whisper.cpp下載 - whisper.cpp源代碼下載

竊竊私語

穩定：v1.7.2 /路線圖|常問問題

Openai的耳語自動語音識別（ASR）模型的高性能推斷：

普通的C/C ++實施無依賴關係
蘋果矽一流公民 - 通過手臂霓虹燈，加速框架，金屬和核心ML優化
AVX內在支持X86體系結構
VSX內在支持對電力體系結構
混合F16 / F32精度
整數量化支持
運行時的零內存分配
VULKAN支持
支持僅CPU推斷
有效的GPU支持NVIDIA
OpenVino支持
上升NPU支持
C風格的API

支持的平台：

該模型的整個高級實現都包含在hisper.h和hisper.cpp中。該代碼的其餘部分是ggml機器學習庫的一部分。

具有如此輕巧的模型實現，可以輕鬆地將其集成到不同的平台和應用程序中。例如，這是在iPhone 13設備上運行該模型的視頻 - 完全離線，on Dempice：whinper.objc

竊竊私語-IPHONE-13-MINI-2.MP4

您也可以輕鬆地製作自己的離線語音助手應用程序：命令

Command-0.mp4

在蘋果矽上，推理通過金屬在GPU上完全運行：

Metal-Base-1.mp4

或者您甚至可以在瀏覽器中直接運行：Talk.WASM

實施詳細信息

核心張量操作在C（ggml.h / ggml.c）中實現
變壓器模型和高級C風格API在C ++中實現（hisper.h / thisper.cpp）
Main.CPP中顯示了樣本使用情況
從麥克風中證明了來自麥克風的樣品實時音頻轉錄。
示例文件夾中還有其他各種示例

張量操作員對Apple Silicon CPU進行了大量優化。根據計算大小，使用了ARM NEON SIMD INTINSICS或CBLAS ACGELETARE框架例程。後者對於更大尺寸特別有效，因為加速框架利用現代蘋果產品中可用的特殊用途AMX協處理器。

快速開始

首先克隆存儲庫：

git clone https://github.com/ggerganov/whisper.cpp.git

導航到目錄：

 cd whisper.cpp

然後，下載以ggml格式轉換的低語模型之一。例如：

sh ./models/download-ggml-model.sh base.en

現在構建主要示例並轉錄這樣的音頻文件：

 # build the main example
make -j

# transcribe an audio file
./main -f samples/jfk.wav

對於快速演示，只需運行make base.en ：

 $ make -j base.en

cc  -I.              -O3 -std=c11   -pthread -DGGML_USE_ACCELERATE   -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -std=c++11 -pthread -c whisper.cpp -o whisper.o
c++ -I. -I./examples -O3 -std=c++11 -pthread examples/main/main.cpp whisper.o ggml.o -o main  -framework Accelerate
./main -h

usage: ./main [options] file0.wav file1.wav ...

options:
  -h,        --help              [default] show this help message and exit
  -t N,      --threads N         [4      ] number of threads to use during computation
  -p N,      --processors N      [1      ] number of processors to use during computation
  -ot N,     --offset-t N        [0      ] time offset in milliseconds
  -on N,     --offset-n N        [0      ] segment index offset
  -d  N,     --duration N        [0      ] duration of audio to process in milliseconds
  -mc N,     --max-context N     [-1     ] maximum number of text context tokens to store
  -ml N,     --max-len N         [0      ] maximum segment length in characters
  -sow,      --split-on-word     [false  ] split on word rather than on token
  -bo N,     --best-of N         [5      ] number of best candidates to keep
  -bs N,     --beam-size N       [5      ] beam size for beam search
  -wt N,     --word-thold N      [0.01   ] word timestamp probability threshold
  -et N,     --entropy-thold N   [2.40   ] entropy threshold for decoder fail
  -lpt N,    --logprob-thold N   [-1.00  ] log probability threshold for decoder fail
  -debug,    --debug-mode        [false  ] enable debug mode (eg. dump log_mel)
  -tr,       --translate         [false  ] translate from source language to english
  -di,       --diarize           [false  ] stereo audio diarization
  -tdrz,     --tinydiarize       [false  ] enable tinydiarize (requires a tdrz model)
  -nf,       --no-fallback       [false  ] do not use temperature fallback while decoding
  -otxt,     --output-txt        [false  ] output result in a text file
  -ovtt,     --output-vtt        [false  ] output result in a vtt file
  -osrt,     --output-srt        [false  ] output result in a srt file
  -olrc,     --output-lrc        [false  ] output result in a lrc file
  -owts,     --output-words      [false  ] output script for generating karaoke video
  -fp,       --font-path         [/System/Library/Fonts/Supplemental/Courier New Bold.ttf] path to a monospace font for karaoke video
  -ocsv,     --output-csv        [false  ] output result in a CSV file
  -oj,       --output-json       [false  ] output result in a JSON file
  -ojf,      --output-json-full  [false  ] include more information in the JSON file
  -of FNAME, --output-file FNAME [       ] output file path (without file extension)
  -ps,       --print-special     [false  ] print special tokens
  -pc,       --print-colors      [false  ] print colors
  -pp,       --print-progress    [false  ] print progress
  -nt,       --no-timestamps     [false  ] do not print timestamps
  -l LANG,   --language LANG     [en     ] spoken language ('auto' for auto-detect)
  -dl,       --detect-language   [false  ] exit after automatically detecting language
             --prompt PROMPT     [       ] initial prompt
  -m FNAME,  --model FNAME       [models/ggml-base.en.bin] model path
  -f FNAME,  --file FNAME        [       ] input WAV file path
  -oved D,   --ov-e-device DNAME [CPU    ] the OpenVINO device used for encode inference
  -ls,       --log-score         [false  ] log best decoder scores of tokens
  -ng,       --no-gpu            [false  ] disable GPU


sh ./models/download-ggml-model.sh base.en
Downloading ggml model base.en ...
ggml-base.en.bin               100%[========================>] 141.11M  6.34MB/s    in 24s
Done! Model 'base.en' saved in 'models/ggml-base.en.bin'
You can now use it like this:

  $ ./main -m models/ggml-base.en.bin -f samples/jfk.wav


===============================================
Running base.en on all samples in ./samples ...
===============================================

----------------------------------------------
[+] Running base.en on samples/jfk.wav ... (run 'ffplay samples/jfk.wav' to listen)
----------------------------------------------

whisper_init_from_file: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: mem required  =  215.00 MB (+    6.00 MB per decoder)
whisper_model_load: kv self size  =    5.25 MB
whisper_model_load: kv cross size =   17.58 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  140.60 MB
whisper_model_load: model size    =  140.54 MB

system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:     load time =   113.81 ms
whisper_print_timings:      mel time =    15.40 ms
whisper_print_timings:   sample time =    11.58 ms /    27 runs (    0.43 ms per run)
whisper_print_timings:   encode time =   266.60 ms /     1 runs (  266.60 ms per run)
whisper_print_timings:   decode time =    66.11 ms /    27 runs (    2.45 ms per run)
whisper_print_timings:    total time =   476.31 ms

該命令下載了base.en模型轉換為自定義ggml格式，並在文件夾samples中對所有.wav樣本進行了推斷。

對於詳細的用法說明，運行： ./main -h

請注意，主要示例當前僅使用16位WAV文件運行，因此請確保在運行工具之前轉換輸入。例如，您可以使用這樣的ffmpeg ：

ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav

內存使用

模型	磁碟	mem
微小的	75 MIB	〜273 MB
根據	142 MIB	〜388 MB
小的	466 MIB	〜852 MB
中等的	1.5吉布	〜2.1 GB
大的	2.9吉布	〜3.9 GB

量化

whisper.cpp支持Whisper ggml模型的整數量化。量化的模型需要更少的內存和磁盤空間，並且可以更有效地處理硬件。

以下是創建和使用量化模型的步驟：

 # quantize a model with Q5_0 method
make -j quantize
./quantize models/ggml-base.en.bin models/ggml-base.en-q5_0.bin q5_0

# run the examples as usual, specifying the quantized model file
./main -m models/ggml-base.en-q5_0.bin ./samples/gb0.wav

核心ML支持

在Apple矽設備上，可以通過Core ML在Apple神經發動機（ANE）上執行編碼器推斷。這可能會導致顯著的加速 - 與僅CPU執行相比，X3的速度更快。以下是生成核心ML模型並將其與whisper.cpp一起使用的說明：

安裝創建核心ML模型所需的Python依賴性：
```
pip install ane_transformers
pip install openai-whisper
pip install coremltools
```
- 為了確保coremltools正確運行，請確認已安裝Xcode並執行xcode-select --install以安裝命令行工具。
- 建議使用Python 3.10。
- 建議使用Macos Sonoma（版本14）或更新，因為MacOS的較舊版本可能會遇到轉錄幻覺問題。
- [可選]建議使用Python版本管理系統，例如此步驟的Minconda：
  - 要創建一個環境，請使用： conda create -n py310-whisper python=3.10 -y
  - 要激活環境，請使用： conda activate py310-whisper
生成核心ML模型。例如，要生成一個base.en模型，請使用：
```
./models/generate-coreml-model.sh base.en
```
這將生成文件夾models/ggml-base.en-encoder.mlmodelc

構建whisper.cpp並提供核心ML支持：

 # using Makefile
make clean
WHISPER_COREML=1 make -j

# using CMake
cmake -B build -DWHISPER_COREML=1
cmake --build build -j --config Release

像往常一樣運行示例。例如：

 $ ./main -m models/ggml-base.en.bin -f samples/jfk.wav

...

whisper_init_state: loading Core ML model from 'models/ggml-base.en-encoder.mlmodelc'
whisper_init_state: first run on a device may take a while ...
whisper_init_state: Core ML model loaded

system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 1 |

...

在設備上的第一個運行速度很慢，因為ANE服務將Core ML模型編譯為某些設備特定格式。接下來的運行速度更快。

有關核心ML實施的更多信息，請參閱PR＃566。

OpenVino支持

在支持OpenVino的平台上，可以在OpenVino支持的設備上執行編碼器推理，包括X86 CPU和Intel GPU（Integrated＆Invete）。

這可能會導致編碼器性能的大幅加速。以下是生成OpenVino模型並將其與whisper.cpp一起使用的說明：

首先，設置Python Virtual Env。並安裝python依賴性。建議使用Python 3.10。

視窗：

cd models
python - m venv openvino_conv_env
openvino_conv_envScriptsactivate
python - m pip install -- upgrade pip
pip install - r requirements - openvino.txt

Linux和MacOS：

 cd models
python3 -m venv openvino_conv_env
source openvino_conv_env/bin/activate
python -m pip install --upgrade pip
pip install -r requirements-openvino.txt

生成OpenVino編碼器模型。例如，要生成一個base.en模型，請使用：
```
 python convert-whisper-to-openvino.py --model base.en
```
這將產生ggml-base.en-coder-openvino.xml/.bin IR模型文件。建議將它們重新定位到與ggml型號相同的文件夾中，因為這是OpenVino擴展程序將在運行時搜索的默認位置。
在OpenVino支持下構建whisper.cpp ：
從發行頁下載OpenVino軟件包。建議使用的版本為2023.0.0。
在將軟件包下載和提取軟件包之後，通過採購設置腳本來設置所需的環境。例如：
Linux：
```
 source /path/to/l_openvino_toolkit_ubuntu22_2023.0.0.10926.b4452d56304_x86_64/setupvars.sh
```
Windows（CMD）：
```
C:PathTow_openvino_toolkit_windows_2023. 0.0 . 10926. b4452d56304_x86_64 setupvars.bat
```
然後使用CMAKE構建項目：
```
cmake -B build -DWHISPER_OPENVINO=1
cmake --build build -j --config Release
```

像往常一樣運行示例。例如：

 $ ./main -m models/ggml-base.en.bin -f samples/jfk.wav

...

whisper_ctx_init_openvino_encoder: loading OpenVINO model from 'models/ggml-base.en-encoder-openvino.xml'
whisper_ctx_init_openvino_encoder: first run on a device may take a while ...
whisper_openvino_init: path_model = models/ggml-base.en-encoder-openvino.xml, device = GPU, cache_dir = models/ggml-base.en-encoder-openvino-cache
whisper_ctx_init_openvino_encoder: OpenVINO model loaded

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | COREML = 0 | OPENVINO = 1 |

...

第一次在OpenVino設備上運行的時間很慢，因為OpenVino框架將將IR（中間表示）模型彙編為特定於設備的“ Blob”。此特定於設備的斑點將在下一次運行中緩存。

有關核心ML實施的更多信息，請參閱PR＃1037。

NVIDIA GPU支持

使用NVIDIA卡，模型的處理可以通過Cublas和Custom Cuda內核有效地進行。首先，請確保您已經安裝了cuda ：https：//developer.nvidia.com/cuda-downloads

現在，在CUDA支持的情況下構建whisper.cpp ：

 make clean
GGML_CUDA=1 make -j

Vulkan GPU支持

跨供應商解決方案，該解決方案允許您加速GPU上的工作量。首先，確保您的圖形卡驅動程序提供了對Vulkan API的支持。

現在，在Vulkan的支持下構建whisper.cpp ：

 make clean
make GGML_VULKAN=1 -j

Blas CPU通過OpenBlas支持

可以通過OpenBlas在CPU上加速編碼器處理。首先，請確保您已經安裝了openblas ：https：//www.openblas.net/

現在，在開放性的支持下構建whisper.cpp ：

 make clean
GGML_OPENBLAS=1 make -j

Blas CPU通過Intel MKL支持

可以通過英特爾的數學內核庫的BLA兼容接口在CPU上加速編碼器處理。首先，請確保您已經安裝了Intel的MKL運行時和開發軟件包：https：//www.intel.com/content/www/en/en/developer/tools/tools/oneapi/onemkl-download.html

現在，用英特爾MKL Blas支持構建whisper.cpp ：

 source /opt/intel/oneapi/setvars.sh
mkdir build
cd build
cmake -DWHISPER_MKL=ON ..
WHISPER_MKL=1 make -j

上升NPU支持

Ascend NPU通過CANN和AI核提供推理加速度。

首先，檢查是否支持您的Ascend NPU設備：

經過驗證的設備

上升NPU	地位
Atlas 300T A2	支持

然後，確保您已經安裝了CANN toolkit 。推薦了Cann的持久版本。

現在，在CANN支持下構建whisper.cpp ：

 mkdir build
cd build
cmake .. -D GGML_CANN=on
make -j

例如，像往常一樣運行推理示例：

 ./build/bin/main -f samples/jfk.wav -m models/ggml-base.en.bin -t 8

筆記：

如果您在Ascend NPU設備方面遇到麻煩，請與[Cann]前綴/標籤創建問題。
如果您使用Ascend NPU設備成功運行，請幫助更新表格Verified devices 。

Docker

先決條件

必須在系統上安裝並運行Docker。
創建一個文件夾來存儲大型模型和中間文件（例如 /竊竊私語 /型號）

圖像

我們有兩個可用於此項目的Docker圖像：

ghcr.io/ggerganov/whisper.cpp:main ：此圖像包含主要可執行文件以及curl和ffmpeg 。（平台： linux/amd64 ， linux/arm64 ）
ghcr.io/ggerganov/whisper.cpp:main-cuda ：與main相同，但在CUDA支持的情況下進行了編譯。（平台： linux/amd64 ）

用法

 # download model and persist it in a local folder
docker run -it --rm 
  -v path/to/models:/models 
  whisper.cpp:main " ./models/download-ggml-model.sh base /models "
# transcribe an audio file
docker run -it --rm 
  -v path/to/models:/models 
  -v path/to/audios:/audios 
  whisper.cpp:main " ./main -m /models/ggml-base.bin -f /audios/jfk.wav "
# transcribe an audio file in samples folder
docker run -it --rm 
  -v path/to/models:/models 
  whisper.cpp:main " ./main -m /models/ggml-base.bin -f ./samples/jfk.wav "

與柯南安裝

您可以為Whisper.cpp安裝預構建的二進製文件，也可以使用柯南從源構建它。使用以下命令：

 conan install --requires="whisper-cpp/[*]" --build=missing

有關如何使用柯南的詳細說明，請參考柯南文檔。

限制

僅推斷

另一個例子

這是在MacBook M1 Pro上大約半分鐘的3:24分鐘演講的另一個示例medium.en

擴展以查看結果

 $ ./main -m models/ggml-medium.en.bin -f samples/gb1.wav -t 8

whisper_init_from_file: loading model from 'models/ggml-medium.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head  = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1024
whisper_model_load: n_text_head   = 16
whisper_model_load: n_text_layer  = 24
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 4
whisper_model_load: mem required  = 1720.00 MB (+   43.00 MB per decoder)
whisper_model_load: kv self size  =   42.00 MB
whisper_model_load: kv cross size =  140.62 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     = 1462.35 MB
whisper_model_load: model size    = 1462.12 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |

main: processing 'samples/gb1.wav' (3179750 samples, 198.7 sec), 8 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:08.000]   My fellow Americans, this day has brought terrible news and great sadness to our country.
[00:00:08.000 --> 00:00:17.000]   At nine o'clock this morning, Mission Control in Houston lost contact with our Space Shuttle Columbia.
[00:00:17.000 --> 00:00:23.000]   A short time later, debris was seen falling from the skies above Texas.
[00:00:23.000 --> 00:00:29.000]   The Columbia's lost. There are no survivors.
[00:00:29.000 --> 00:00:32.000]   On board was a crew of seven.
[00:00:32.000 --> 00:00:39.000]   Colonel Rick Husband, Lieutenant Colonel Michael Anderson, Commander Laurel Clark,
[00:00:39.000 --> 00:00:48.000]   Captain David Brown, Commander William McCool, Dr. Kultna Shavla, and Ilan Ramon,
[00:00:48.000 --> 00:00:52.000]   a colonel in the Israeli Air Force.
[00:00:52.000 --> 00:00:58.000]   These men and women assumed great risk in the service to all humanity.
[00:00:58.000 --> 00:01:03.000]   In an age when space flight has come to seem almost routine,
[00:01:03.000 --> 00:01:07.000]   it is easy to overlook the dangers of travel by rocket
[00:01:07.000 --> 00:01:12.000]   and the difficulties of navigating the fierce outer atmosphere of the Earth.
[00:01:12.000 --> 00:01:18.000]   These astronauts knew the dangers, and they faced them willingly,
[00:01:18.000 --> 00:01:23.000]   knowing they had a high and noble purpose in life.
[00:01:23.000 --> 00:01:31.000]   Because of their courage and daring and idealism, we will miss them all the more.
[00:01:31.000 --> 00:01:36.000]   All Americans today are thinking as well of the families of these men and women
[00:01:36.000 --> 00:01:40.000]   who have been given this sudden shock and grief.
[00:01:40.000 --> 00:01:45.000]   You're not alone. Our entire nation grieves with you,
[00:01:45.000 --> 00:01:52.000]   and those you love will always have the respect and gratitude of this country.
[00:01:52.000 --> 00:01:56.000]   The cause in which they died will continue.
[00:01:56.000 --> 00:02:04.000]   Mankind is led into the darkness beyond our world by the inspiration of discovery
[00:02:04.000 --> 00:02:11.000]   and the longing to understand. Our journey into space will go on.
[00:02:11.000 --> 00:02:16.000]   In the skies today, we saw destruction and tragedy.
[00:02:16.000 --> 00:02:22.000]   Yet farther than we can see, there is comfort and hope.
[00:02:22.000 --> 00:02:29.000]   In the words of the prophet Isaiah, "Lift your eyes and look to the heavens
[00:02:29.000 --> 00:02:35.000]   who created all these. He who brings out the starry hosts one by one
[00:02:35.000 --> 00:02:39.000]   and calls them each by name."
[00:02:39.000 --> 00:02:46.000]   Because of His great power and mighty strength, not one of them is missing.
[00:02:46.000 --> 00:02:55.000]   The same Creator who names the stars also knows the names of the seven souls we mourn today.
[00:02:55.000 --> 00:03:01.000]   The crew of the shuttle Columbia did not return safely to earth,
[00:03:01.000 --> 00:03:05.000]   yet we can pray that all are safely home.
[00:03:05.000 --> 00:03:13.000]   May God bless the grieving families, and may God continue to bless America.
[00:03:13.000 --> 00:03:19.000]   [Silence]


whisper_print_timings:     fallbacks =   1 p /   0 h
whisper_print_timings:     load time =   569.03 ms
whisper_print_timings:      mel time =   146.85 ms
whisper_print_timings:   sample time =   238.66 ms /   553 runs (    0.43 ms per run)
whisper_print_timings:   encode time = 18665.10 ms /     9 runs ( 2073.90 ms per run)
whisper_print_timings:   decode time = 13090.93 ms /   549 runs (   23.85 ms per run)
whisper_print_timings:    total time = 32733.52 ms

實時音頻輸入示例

這是對麥克風對音頻實時推斷進行實時推斷的幼稚例子。流工具每半秒鐘採樣音頻，並連續運行轉錄。問題10中提供了更多信息。

make stream -j
./stream -m ./models/ggml-base.en.bin -t 8 --step 500 --length 5000

rt_esl_csgo_2.mp4

置信顏色編碼

添加--print-colors參數將使用實驗性顏色編碼策略打印抄錄文本，以高度或低信心突出單詞：

./main -m models/ggml-base.en.bin -f samples/gb0.wav --print-colors

控制生成的文本片段的長度（實驗）

例如，要將線長度限制在最多16個字符中，只需添加-ml 16 ：

 $ ./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -ml 16

whisper_model_load: loading model from './models/ggml-base.en.bin'
...
system_info: n_threads = 4 / 10 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 |

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:00.850]   And so my
[00:00:00.850 --> 00:00:01.590]   fellow
[00:00:01.590 --> 00:00:04.140]   Americans, ask
[00:00:04.140 --> 00:00:05.660]   not what your
[00:00:05.660 --> 00:00:06.840]   country can do
[00:00:06.840 --> 00:00:08.430]   for you, ask
[00:00:08.430 --> 00:00:09.440]   what you can do
[00:00:09.440 --> 00:00:10.020]   for your
[00:00:10.020 --> 00:00:11.000]   country.

文字級時間戳（實驗）

--max-len參數可用於獲得單詞級的時間戳。只需使用-ml 1 ：

 $ ./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -ml 1

whisper_model_load: loading model from './models/ggml-base.en.bin'
...
system_info: n_threads = 4 / 10 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 |

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:00.320]
[00:00:00.320 --> 00:00:00.370]   And
[00:00:00.370 --> 00:00:00.690]   so
[00:00:00.690 --> 00:00:00.850]   my
[00:00:00.850 --> 00:00:01.590]   fellow
[00:00:01.590 --> 00:00:02.850]   Americans
[00:00:02.850 --> 00:00:03.300]  ,
[00:00:03.300 --> 00:00:04.140]   ask
[00:00:04.140 --> 00:00:04.990]   not
[00:00:04.990 --> 00:00:05.410]   what
[00:00:05.410 --> 00:00:05.660]   your
[00:00:05.660 --> 00:00:06.260]   country
[00:00:06.260 --> 00:00:06.600]   can
[00:00:06.600 --> 00:00:06.840]   do
[00:00:06.840 --> 00:00:07.010]   for
[00:00:07.010 --> 00:00:08.170]   you
[00:00:08.170 --> 00:00:08.190]  ,
[00:00:08.190 --> 00:00:08.430]   ask
[00:00:08.430 --> 00:00:08.910]   what
[00:00:08.910 --> 00:00:09.040]   you
[00:00:09.040 --> 00:00:09.320]   can
[00:00:09.320 --> 00:00:09.440]   do
[00:00:09.440 --> 00:00:09.760]   for
[00:00:09.760 --> 00:00:10.020]   your
[00:00:10.020 --> 00:00:10.510]   country
[00:00:10.510 --> 00:00:11.000]  .

通過TinyDiarize（實驗）進行揚聲器分割

有關此方法的更多信息，請參見此處：＃1058

示例用法：

 # download a tinydiarize compatible model
. / models / download - ggml - model . sh small . en - tdrz

# run as usual, adding the "-tdrz" command-line argument
. / main - f . / samples / a13 . wav - m . / models / ggml - small . en - tdrz . bin - tdrz
...
main : processing './samples/a13.wav' ( 480000 samples , 30.0 sec ), 4 threads , 1 processors , lang = en , task = transcribe , tdrz = 1 , timestamps = 1 ...
...
[ 00 : 00 : 00.000 - - > 00 : 00 : 03.800 ]   Okay Houston , we ' ve had a problem here . [ SPEAKER_TURN ]
[ 00 : 00 : 03.800 - - > 00 : 00 : 06.200 ]   This is Houston . Say again please . [ SPEAKER_TURN ]
[ 00 : 00 : 06.200 - - > 00 : 00 : 08.260 ]   Uh Houston we ' ve had a problem .
[ 00 : 00 : 08.260 - - > 00 : 00 : 11.320 ]   We ' ve had a main beam up on a volt . [ SPEAKER_TURN ]
[ 00 : 00 : 11.320 - - > 00 : 00 : 13.820 ]   Roger main beam interval . [ SPEAKER_TURN ]
[ 00 : 00 : 13.820 - - > 00 : 00 : 15.100 ]   Uh uh [ SPEAKER_TURN ]
[ 00 : 00 : 15.100 - - > 00 : 00 : 18.020 ]   So okay stand , by thirteen we ' re looking at it . [ SPEAKER_TURN ]
[ 00 : 00 : 18.020 - - > 00 : 00 : 25.740 ]   Okay uh right now uh Houston the uh voltage is uh is looking good um .
[ 00 : 00 : 27.620 - - > 00 : 00 : 29.940 ]   And we had a a pretty large bank or so .

卡拉OK風格的電影一代（實驗）

主要示例為卡拉OK型電影的輸出提供了支持，該電影當前的單詞被突出顯示。使用-wts參數並運行生成的bash腳本。這需要安裝ffmpeg 。

這裡有一些“典型”示例：

./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -owts
source ./samples/jfk.wav.wts
ffplay ./samples/jfk.wav.mp4

jfk.wav.mp4

./main -m ./models/ggml-base.en.bin -f ./samples/mm0.wav -owts
source ./samples/mm0.wav.wts
ffplay ./samples/mm0.wav.mp4

mm0.wav.mp4

./main -m ./models/ggml-base.en.bin -f ./samples/gb0.wav -owts
source ./samples/gb0.wav.wts
ffplay ./samples/gb0.wav.mp4

gb0.wav.mp4

不同模型的視頻比較

使用腳本/bench-wts.sh腳本以以下格式生成視頻：

./scripts/bench-wts.sh samples/jfk.wav
ffplay ./samples/jfk.wav.all.mp4

jfk.wav.all.mp4

基準

為了對跨不同系統配置的推理的性能進行客觀比較，請使用基準工具。該工具只需運行模型的編碼部分，並打印執行它花費了多少時間。結果總結在以下GitHub問題中：

基準結果

此外，還提供了帶有不同型號和音頻文件的whisper.cpp的腳本。

您可以使用以下命令運行它，默認情況下它將與模型文件夾中的任何標準模型運行。

python3 scripts/bench.py -f samples/jfk.wav -t 2,4,8 -p 1,2

它用python編寫，目的是易於修改並擴展為您的基準用例。

它以基準測試的結果輸出一個CSV文件。

`ggml`格式

原始型號被轉換為自定義二進制格式。這允許將所需的所有內容包裝到一個文件中：

模型參數
梅爾過濾器
詞彙
權重

您可以使用型號/下載 - ggml-model.sh腳本或手動從此處下載轉換的模型：

https://huggingface.co/ggerganov/whisper.cpp
https://ggml.ggerganov.com

有關更多詳細信息，請參見轉換腳本模型/convert-pt-to-ggml.py或型號/readme.md。

綁定

例子

示例文件夾中的不同項目有各種示例。一些示例甚至可以使用WebAssembly在瀏覽器中運行。檢查一下！

例子	網絡	描述
主要的	竊竊私語	使用耳語翻譯和轉錄音頻的工具
長椅	板凳	基准在機器上的耳語表現
溪流	流	原始麥克風捕獲的實時轉錄
命令	命令	從麥克風接收語音命令的基本語音助手示例
wchess	wchess.wasm	語音控制的國際象棋
講話	說話	與GPT-2機器人交談
談話訓練		與美洲駝交談
hisper.objc		使用hisper.cpp的iOS移動應用程序
竊竊私語		Swiftui iOS / MacOS應用使用hisper.cpp
hisper.android		使用hisper.cpp的android移動應用程序
竊竊私語		Neovim的語音到文本插件
生成-karaoke.sh		輔助腳本可以輕鬆生成原始音頻捕獲的卡拉OK視頻
livestream.sh		直播音頻轉錄
yt-wsp.sh		下載 +轉錄和/或翻譯任何VOD（原始）
伺服器		http轉錄服務器，帶有oai like api