Jetson Orin Nano NX 16G で faster-whisper をインストールして文字起こし

Jetson Orin Nano NX16G で faster-whisper を使ってみたかったので、インストールしようとしたら色々と大変だったので、使えるようになるまでの手順とイメージの作成を行いました。

普通にインストールしようとしたら駄目だった

以下のようなエラーが発生し、なかなか動かすまで苦労しました。

Error: unable to process file: ./sample.mp4 with exception 'No SGEMM backend on CPU'
Transcription results written to '/data/models/whisper' directory

Traceback (most recent call last):
  File "/root/./faster-whiper.py", line 1, in <module>
    from faster_whisper import WhisperModel
  File "/usr/local/lib/python3.10/dist-packages/faster_whisper/__init__.py", line 2, in <module>
    from faster_whisper.transcribe import WhisperModel
  File "/usr/local/lib/python3.10/dist-packages/faster_whisper/transcribe.py", line 10, in <module>
    import ctranslate2
  File "/usr/local/lib/python3.10/dist-packages/ctranslate2/__init__.py", line 21, in <module>
    from ctranslate2._ext import (
ImportError: libctranslate2.so.4: cannot open shared object file: No such file or directory

Traceback (most recent call last):
  File "/root/./faster-whiper.py", line 6, in <module>
    model = WhisperModel(model_size, device="cuda", compute_type="float16")
  File "/usr/local/lib/python3.10/dist-packages/faster_whisper/transcribe.py", line 144, in __init__
    self.model = ctranslate2.models.Whisper(
RuntimeError: CUDA failed with error CUDA driver version is insufficient for CUDA runtime version

whisper_trt のイメージでコンテナ起動

python のバージョンなど、色々と都合が良かったので、whisper_trt からコンテナ起動しました。

# sudo docker run --runtime nvidia -it --network=host dustynv/whisper_trt:r36.3.0

CTranslate2 のビルド＆インストール

pip でインストールできる ctranslate2 は、今回の目的のfaster-whipser では使用できませんでしたので、自分でビルドしていきます。

実行するコマンド自体は少ないのですが、ライブラリのビルドにえらく時間（3〜4時間）がかかります。

ライブラリインストール

# cd /
# git clone --recursive https://github.com/OpenNMT/CTranslate2.git
# cd CTranslate2
# cmake -Bbuild_folder -DWITH_MKL=OFF -DOPENMP_RUNTIME=NONE -DWITH_CUDA=ON -DWITH_CUDNN=ON
# cmake —build build_folder
# cd build_folder
# make install

python 用ライブラリのセットアップ＆インストール

# cd ../python
# pip3 install -r install_requirements.txt
# python3 setup.py bdist_wheel
# pip3 install dist/*.whl

faster-whisper のインストール

# pip3 install faster-whisper
:
Requirement already satisfied: ctranslate2<5,>=4.0 in /usr/local/lib/python3.10/dist-packages (from faster-whisper) (4.3.0)
Collecting huggingface-hub>=0.13 (from faster-whisper)
  Downloading http://jetson.webredirect.org/root/pypi/%2Bf/487/27a16e704d409/huggingface_hub-0.23.2-py3-none-any.whl (401 kB)
:

インストールログで、ctranslate2 が already satisfied と先ほどインストールした ctranslate2 が認識されていることが重要

faster-whisper で文字起こししてみる

文字起こしするための音声をダウンロード

# cd /root
# wget https://pro-video.jp/voice/announce/mp3/ohayo01mayu.mp3

SCL学びのソリューション（サービス名称：ボイスプロ）様の音声ファイルをダウンロードさせていただきました。

素敵な音声をありがとうございます。感謝。

サンプルコード

# nano faster-whiper.py
from faster_whisper import WhisperModel

model_size = "large-v3"

# Run on GPU with FP16
model = WhisperModel(model_size, device="cuda", compute_type="float16")

# or run on GPU with INT8
# model = WhisperModel(model_size, device="cuda", compute_type="int8_float16")
# or run on CPU with INT8
# model = WhisperModel(model_size, device="cpu", compute_type="int8")

segments, info = model.transcribe("ohayo01mayu.mp3", beam_size=5)

print("Detected language '%s' with probability %f" % (info.language, info.language_probability))

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

サンプルコード実行

# python3 ./faster-whiper.py
config.json: 100%|████████████████████████████████████████████████████████| 2.39k/2.39k [00:00<00:00, 6.03MB/s]
preprocessor_config.json: 100%|████████████████████████████████████████████████| 340/340 [00:00<00:00, 884kB/s]
vocabulary.json: 100%|████████████████████████████████████████████████████| 1.07M/1.07M [00:00<00:00, 1.65MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████| 2.48M/2.48M [00:01<00:00, 2.46MB/s]
model.bin: 100%|██████████████████████████████████████████████████████████| 3.09G/3.09G [00:41<00:00, 75.1MB/s]
Detected language 'ja' with probability 0.952637
[0.00s -> 0.80s] おはよう████████████████████████████████████████████████▉| 3.08G/3.09G [00:40<00:00, 72.9MB/s]

正しく文字起こしできました！🍺

Docker Image 公開中

今回生成したイメージを

https://github.com/microaijp/jetson-faster-whisper/pkgs/container/jetson-faster-whisper

で公開しています。