1台のGPUのVRAMに収まらないLLMモデルを 2台で分散してみたら動いた話

Jetson Orin Nano NX 16G が2台手元にありますので、VRAMに収まらないモデルを分散推論させることで動かしてみます。

使用したモデル

https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF

の

mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf (26.4G)

を使用します。Jetson Orin Nano NX 16G は、その名の通り、16GのVRAMですので、1台ではメモリに収まりきりません。

使用したライブラリ

llama.cpp

https://github.com/ggerganov/llama.cpp

llama.cpp が

https://github.com/ggerganov/llama.cpp/tree/master/examples/rpc

RPCに対応していましたので、今回採用しました。

他にも vllm や、TensorRT-LLM もありますが、現時点では、Jetson で動かすのはかなり難しそうです。

vllm は、torch 2.2.1 を要求してきて、インストールできず。TensorRT-LLM は、公式が非対応（対応中）と言っています。

結論

最初に結論を申し上げますと・・・

動きますが、レスポンスが非常に遅く、少なくとも家庭用の環境では、使い物になりません。

おそらく原因は、私の知識不足も多分にあるかともいますが、ネットワークの通信速度がボトルネックになります。

私の環境では、BUFFALO LSW6-GT-8NS (Giga対応) を使用しています。

https://amzn.to/3wSef3y

LANケーブルはCAT8 (40GBASE-T)の

https://amzn.to/3KdMu8F

こちらを使用しています。

このような環境ですので、理論値では、1Gbps のスピードで通信できます。が、あくまで理論値です。

実際は、100Mbps 程度の通信速度となります。

全体像

llama.cpp のドキュメント通り、上記のようになっています。

Main Host 上にモデルを置いておき、それを、実行時に各 rpc-server に分散して送り、それを Backend (CUDA,Metal,etc.) が処理して返します。

mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf (26.4G) の場合で、上記の3つの rpc-server に送る場合は、26.4G/3=8.8G の通信が最低発生します。

この転送速がまず遅いです。ひょっとするとキャッシュする仕組みとかあるんでしょうか？あった場合は、追記していきます。

サーバー側の構築 (Host A, Host B)

jetson-containers のインストール

jetson では、AIで使用するライブラリが色々入ったコンテナを準備してくれています。それが、jetson-containers です。

jetson でなにかするときに使わないてはありませんので、インストールします。

https://github.com/dusty-nv/jetson-containers

にドキュメントがあります。

$ git clone https://github.com/dusty-nv/jetson-containers
$ bash jetson-containers/install.sh

ollama を動かしてみる（jetson-containers を体感する）

ここは実施なくても良いのですが、jetson-containers がどんなものか確認するために、私が普段使い慣れている ollama をコンテナ上で動かして体感してみます。

https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/ollama

に詳しいマニュアルがあります。

$ jetson-containers run --name ollama $(autotag ollama)

Namespace(packages=['ollama'], prefer=['local', 'registry', 'build'], disable=[''], user='dustynv', output='/tmp/autotag', quiet=False, verbose=False)
-- L4T_VERSION=36.3.0  JETPACK_VERSION=6.0  CUDA_VERSION=12.2
-- Finding compatible container image for ['ollama']

Found compatible container dustynv/ollama:r36.3.0 (2024-05-14, 3.9GB) - would you like to pull it? [Y/n] Y
dustynv/ollama:r36.3.0
+ sudo docker run --runtime nvidia -it --rm --network host --volume /tmp/argus_socket:/tmp/argus_socket --volume /etc/enctune.conf:/etc/enctune.conf --volume /etc/nv_tegra_release:/etc/nv_tegra_release --volume /tmp/nv_jetson_model:/tmp/nv_jetson_model --volume /var/run/dbus:/var/run/dbus --volume /var/run/avahi-daemon/socket:/var/run/avahi-daemon/socket --volume /var/run/docker.sock:/var/run/docker.sock --volume /home/jetson-01/jetson-containers/data:/data --device /dev/snd --device /dev/bus/usb --device /dev/i2c-0 --device /dev/i2c-1 --device /dev/i2c-2 --device /dev/i2c-4 --device /dev/i2c-5 --device /dev/i2c-7 --device /dev/i2c-9 -v /run/jtop.sock:/run/jtop.sock --name ollama dustynv/ollama:r36.3.0
Unable to find image 'dustynv/ollama:r36.3.0' locally
r36.3.0: Pulling from dustynv/ollama
:
:
OLLAMA_MODELS /data/models/ollama/models
OLLAMA_LOGS   /data/logs/ollama.log

ollama server is now started, and you can run commands here like 'ollama run llama3'

root@jetson-01:/# ollama run llama3
pulling manifest
pulling 6a0746a1ec1a...  77% ▕██████████████                              ▏ 3.6 GB/4.7 GB   62 MB/s     17s
:
writing manifest
removing any unused layers
success
>>> Send a message (/? for help)
>>> おはようー
おはようー！😊 Good morning! 🌞 How are you today? 😊

jetson-containers を使用しない場合は、意外と苦労する ollama があっさりと動いてくれました🍺

llama.cpp のインストール

さて、本番です。llama.cpp をインストールします。

torch がインストール済みのコンテナを起動

ドキュメントに書かれている通り、jetson-containers は、llama.cpp に対応していますが、少しい llama.cpp で、おそらく、最近できたRPCの機能は使用できないので、l4t-pytorch を使って、自分でビルドします。

$ jetson-containers run $(autotag l4t-pytorch)
:
root@jetson-01:/#

llama.cpp のビルド

root@jetson-01:/# git clone https://github.com/ggerganov/llama.cpp
root@jetson-01:/# cd llama.cpp
root@jetson-01:/llama.cpp# make LLAMA_CUDA=1

LLAMA_CUDA=1 は、GPUを使えよ！という意味です。

llama.cpp が正しく動くか確認

root@jetson-01:/llama.cpp# apt install wget
root@jetson-01:/llama.cpp# wget https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf?download=true
root@jetson-01:/llama.cpp# mv 'Meta-Llama-3-8B-Instruct.Q4_K_M.gguf?download=true' llama3.gguf
root@jetson-01:/llama.cpp# text="AIについて教えて"
root@jetson-01:/llama.cpp# ./main -m ./llama3.gguf --temp 0.1 -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant. Please answer in Japanese<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n$text<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -ngl 32 -b 512
:
<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant. Please answer in Japanese<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nAIについて教えて<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nAI（Artificial Intelligence）は、人工知能のことです。人工知能は、コンピュータープログラムを使用して、人間の知能を模倣したり、超えることを目指す技術です。AIは、学習や推測、判断など、人間の知能の様々な機能を実現することができます。

AI技術は、現在、各分野で活用されており、生活様々な面で影響を与えています。例えば、スマートフォンの音声助手や、自動運転車、医療診断支援システムなど、日常生活に密着した分野で活用されています。

AI技術の分野には、以下のような分野があります。

* 機械学習（Machine Learning）：AIが学習する技術
* 自然言語処理（Natural Language Processing）：AIが自然言語を理解する技術
* 画像認識（Image Recognition）：AIが画像を認識する技術
* 音声認識（Speech Recognition）：AIが音声を認識する技術
* 予測分析（Predictive Analytics）：AIが将来の出来事を予測する技術

AI技術は、将来的にますます発展し、生活様々な面で影響を与えることが期待されています。<|eot_id|> [end of text]

llama_print_timings:        load time =    8672.15 ms
llama_print_timings:      sample time =      45.38 ms /   289 runs   (    0.16 ms per token,  6368.02 tokens per second)
llama_print_timings: prompt eval time =     791.07 ms /    39 tokens (   20.28 ms per token,    49.30 tokens per second)
llama_print_timings:        eval time =   27496.79 ms /   288 runs   (   95.47 ms per token,    10.47 tokens per second)
llama_print_timings:       total time =   28817.78 ms /   327 tokens
Log end

正しく動いてくれています🍺

スピードは

モデルの読み込み: 8672.15 ms = 8.6秒
入力: 20.28 ms per token
出力: 95.47 ms per token

でした。

rcp-server のビルドと起動

rcp-server のビルド

root@jetson-01:/llama.cpp# cd /llama.cpp
root@jetson-01:/llama.cpp# mkdir build-rpc-cuda
root@jetson-01:/llama.cpp# cd build-rpc-cuda
root@jetson-01:/llama.cpp/build-rpc-cuda# cmake .. -DLLAMA_CUDA=ON -DLLAMA_RPC=ON
:
-- Configuring done (5.2s)
-- Generating done (0.2s)
-- Build files have been written to: /llama.cpp/build-rpc-cuda

root@jetson-01:/llama.cpp/build-rpc-cuda# cmake --build . --config Release
:
[ 99%] Built target vdot
[100%] Building CXX object pocs/vdot/CMakeFiles/q8dot.dir/q8dot.cpp.o
[100%] Linking CXX executable ../../bin/q8dot
[100%] Built target q8dot

ビルド完了🍺

rcp-server の起動

root@jetson-01:/llama.cpp/build-rpc-cuda# CUDA_VISIBLE_DEVICES=0 bin/rpc-server -p 50052
create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Orin, compute capability 8.7, VMM: yes
Starting RPC server on 0.0.0.0:50052, backend memory: 12584 MB

起動完了です🍺

ちなみに、curl http://localhost:50052 などを実行すると

terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted (core dumped)

というエラーが発生して落ちます。これが仕様なのかバグなのかは不明です。

今回は2台ですので、このイメージをコピーして、もう1台で動かしてもよいのですが、今回は個人的などうでもよい理由で、もう1台でも同じコマンドを実行して起動までもってきました。

クライアント側（Main Host）の構築

上記に続いて同じコマンドで途中まで実施しても良かったのですが、流石に面倒でしたので、

$ docker ps
CONTAINER ID   IMAGE                         COMMAND       CREATED          STATUS          PORTS     NAMES
8e1b8336fcce   dustynv/l4t-pytorch:r36.2.0   "/bin/bash"   53 minutes ago   Up 53 minutes             blissful_swirles

$ docker commit 26475b890f26 micro_llama_cpp

でイメージ化して

$ docker run --runtime nvidia -it --rm --network host \
    --volume /tmp/argus_socket:/tmp/argus_socket \
    --volume /etc/enctune.conf:/etc/enctune.conf \
    --volume /etc/nv_tegra_release:/etc/nv_tegra_release \
    --volume /tmp/nv_jetson_model:/tmp/nv_jetson_model \
    --volume /var/run/dbus:/var/run/dbus \
    --volume /var/run/avahi-daemon/socket:/var/run/avahi-daemon/socket \
    --volume /var/run/docker.sock:/var/run/docker.sock \
    --volume /home/jetson-01/jetson-containers/data:/data \
    --device /dev/snd \
    --device /dev/bus/usb \
    --device /dev/i2c-0 \
    --device /dev/i2c-1 \
    --device /dev/i2c-2 \
    --device /dev/i2c-4 \
    --device /dev/i2c-5 \
    --device /dev/i2c-7 \
    --device /dev/i2c-9 \
    micro_llama_cpp

で起動しました。オプションは、上記の $ jetson-containers run --name ollama $(autotag ollama) 実行時に表示されているものと同じです。

RPC Backend のビルド

root@jetson-01:/llama.cpp# mkdir build-rpc
root@jetson-01:/llama.cpp# cd build-rpc
root@jetson-01:/llama.cpp/build-rpc# cmake .. -DLLAMA_RPC=ON
root@jetson-01:/llama.cpp/build-rpc# cmake --build . --config Release

ビルド完了🍺

分散推論を実行

モデルのダウンロード

root@jetson-01:/llama.cpp/build-rpc# cd ..
root@jetson-01:/llama.cpp# mkdir mymodels
root@jetson-01:/llama.cpp# cd mymodels/
root@jetson-01:/llama.cpp/mymodels# wget https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/resolve/main/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf?download=true
:
root@jetson-01:/llama.cpp/mymodels# mv mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf\?download\=true ./mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf

分散推論実行

root@jetson-01:/llama.cpp/mymodels# cd /llama.cpp/build-rpc
root@jetson-01:/llama.cpp/mymodels# bin/main -m ../mymodels/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf -p "Tell me about AI." --repeat-penalty 1.0 -n 640 --rpc 127.0.0.1:50052,10.0.0.22:50052 -ngl 99

IPアドレスはご自身の環境にあわせて変更してください

実行中、2台のJetsonのVRAMにモデルが読み込まれていくのがわかります。

結果

Log start
main: build = 2999 (b9adcbbf)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for aarch64-linux-gnu
:
:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 2048, n_predict = 640, n_keep = 1


<s> Tell me about AI.

AI, or artificial intelligence, is a branch of computer science that is concerned with the creation of intelligent agents, which are systems that can reason, learn, and act autonomously. AI systems can be designed to perform a wide range of tasks, from simple tasks like playing a game of chess, to complex tasks like diagnosing medical conditions or driving a car. AI is a rapidly growing field, and it has the potential to revolutionize many areas of society, from healthcare and education to manufacturing and transportation.

There are several different approaches to building AI systems, including rule-based systems, machine learning, and deep learning. Rule-based systems are built using a set of predefined rules and logic, and they are typically used for tasks that involve well-defined problem-solving strategies. Machine learning is a type of AI that involves training algorithms to learn and improve their performance on a task by analyzing large amounts of data. Deep learning is a subset of machine learning that uses artificial neural networks to model and solve complex problems.

One of the key challenges in AI is building systems that can learn and adapt to new situations and environments. This is known as general intelligence, and it is a major goal of AI research. While there have been significant advances in AI in recent years, building a truly general AI system remains a significant challenge.

There are also concerns about the potential risks and ethical implications of AI, such as the possibility of AI systems becoming too powerful or being used for malicious purposes. As a result, there is ongoing debate about how to ensure that AI is developed and used in a responsible and ethical manner.

Overall, AI is a fascinating and rapidly evolving field that has the potential to have a profound impact on many aspects of our lives. Whether you are a researcher, a developer, or just someone who is interested in learning more about AI, there is a wealth of resources available to help you explore this exciting and dynamic field.</s> [end of text]

llama_print_timings:        load time =  179671.51 ms
llama_print_timings:      sample time =      49.16 ms /   403 runs   (    0.12 ms per token,  8197.22 tokens per second)
llama_print_timings: prompt eval time =     671.81 ms /     6 tokens (  111.97 ms per token,     8.93 tokens per second)
llama_print_timings:        eval time =   84008.77 ms /   402 runs   (  208.98 ms per token,     4.79 tokens per second)
llama_print_timings:       total time =   85139.29 ms /   408 tokens
Log end

モデル読み込み: 179671.51 ms = 179.67151秒 = 2.9分
入力: 111.97 ms per token
出力: 208.98 ms per token

上記で行った Meta-Llama-3-8B-Instruct.Q4_K_M.gguf の場合は、

モデルの読み込み: 8672.15 ms = 8.6秒
入力: 20.28 ms per token
出力: 95.47 ms per token

でしたので、

モデルの読み込み: 20倍
入力: 5.5倍
出力: 2.1倍

の時間がかかりました。もちろん、モデルのサイズが違いますので、遅くなるのは当然ですが、モデルの読み込みに約3分かかるのは実用に耐えません。

1台は、クライアントと同じ Jetson上で動くサーバーが動いており、その分、転送速度が早い状態でこれですので、クライアントとサーバーが別ディバイスだとより遅いはずです。

悲しい・・・

モデル転送中の通信速度

100Mbps で通信がされていることがわかります。ネットをする分には早いですがLLMの分散推論においてはボトルネックになってしまいます。

1台でもVRAMに収まるモデルを2台で分散するとどうなるのか？

上記の通り、使い物になりませんでしたので、小さいモデル( tinyllama-1.1b-chat-v1.0.Q8_0.gguf )でも試していました。

2台で実行

root@jetson-01:/llama.cpp/build-rpc# bin/main -m ../mymodels/TinyLlama-1.1B-Chat-v1.0-GGUF/tinyllama-1.1b-chat-v1.0.Q8_0.gguf -p "Tell me about AI." --repeat-penalty 1.0 -n 64 --rpc 127.0.0.1:50052,10.0.0.22:50052 -ngl 99
:
llama_print_timings:        load time =    5952.46 ms
llama_print_timings:      sample time =       1.30 ms /    11 runs   (    0.12 ms per token,  8429.12 tokens per second)
llama_print_timings: prompt eval time =     103.17 ms /     7 tokens (   14.74 ms per token,    67.85 tokens per second)
llama_print_timings:        eval time =     740.73 ms /    10 runs   (   74.07 ms per token,    13.50 tokens per second)
llama_print_timings:       total time =     856.47 ms /    17 tokens

1台で実行 (サーバー・クライアントが同じ筐体ディバイス=127.0.0.1)

root@jetson-01:/llama.cpp/build-rpc# bin/main -m ../mymodels/TinyLlama-1.1B-Chat-v1.0-GGUF/tinyllama-1.1b-chat-v1.0.Q8_0.gguf -p "Tell me about AI." --repeat-penalty 1.0 -n 64 --rpc 127.0.0.1:50052 -ngl 99
:
llama_print_timings:        load time =    1211.68 ms
llama_print_timings:      sample time =       1.41 ms /    12 runs   (    0.12 ms per token,  8510.64 tokens per second)
llama_print_timings: prompt eval time =      51.70 ms /     7 tokens (    7.39 ms per token,   135.39 tokens per second)
llama_print_timings:        eval time =     486.24 ms /    11 runs   (   44.20 ms per token,    22.62 tokens per second)
llama_print_timings:       total time =     550.17 ms /    18 tokens

1台で実行 (サーバー・クライアントが別ディバイス)

root@jetson-01:/llama.cpp/build-rpc# bin/main -m ../mymodels/TinyLlama-1.1B-Chat-v1.0-GGUF/tinyllama-1.1b-chat-v1.0.Q8_0.gguf -p "Tell me about AI." --repeat-penalty 1.0 -n 64 --rpc 10.0.0.22:50052 -ngl 99
:
llama_print_timings:        load time =   10553.43 ms
llama_print_timings:      sample time =       7.53 ms /    64 runs   (    0.12 ms per token,  8504.98 tokens per second)
llama_print_timings: prompt eval time =      62.09 ms /     7 tokens (    8.87 ms per token,   112.74 tokens per second)
llama_print_timings:        eval time =    2594.82 ms /    63 runs   (   41.19 ms per token,    24.28 tokens per second)
llama_print_timings:       total time =    2730.82 ms /    70 tokens

ネットワーク経由でモデルを送りますので、モデルの読み込みで 2倍ほどかかります。

2台で分散推論させると、1台のときに比べると効率が悪くなるため、1.5倍以上の時間がかかってしまいます。

所感

少なくとも家庭用の環境においては、分散推論するよりも、1台のVRAMで収まるサイズのモデルを使用して、2台でロードバランスしたほうが効率が圧倒的に良いことがわかりました。ただ、すごく勉強になったので、チャレンジしてみてよかったです。ぜひ、皆様もチャンレンジしてみてください！

あ、あと、ちょくちょく、rcp-server が落ちます。まだ安定していないのかもしれません。今後に期待！