¿Varias GPU NVIDIA o Apple Silicon para inferencia de modelos de lenguaje grande? ?
Utilice llama.cpp para probar la velocidad de inferencia de los modelos LLaMA de diferentes GPU en RunPod, MacBook Air M1 de 13 pulgadas, MacBook Pro M1 Max de 14 pulgadas, M2 Ultra Mac Studio y MacBook Pro M3 Max de 16 pulgadas para LLaMA 3.
Velocidad promedio (tokens/s) de generación de 1024 tokens mediante GPU en LLaMA 3. Cuanto más alta sea la velocidad, mejor.
GPU | 8B Q4_K_M | 8B F16 | 70B Q4_K_M | 70B F16 |
---|---|---|---|---|
3070 8GB | 70,94 | OOM | OOM | OOM |
3080 10GB | 106.40 | OOM | OOM | OOM |
3080Ti 12GB | 106,71 | OOM | OOM | OOM |
4070Ti 12GB | 82.21 | OOM | OOM | OOM |
4080 16GB | 106.22 | 40.29 | OOM | OOM |
RTX 4000 Ada 20GB | 58,59 | 20,85 | OOM | OOM |
3090 24GB | 111,74 | 46,51 | OOM | OOM |
4090 24GB | 127,74 | 54,34 | OOM | OOM |
RTX 5000 Ada 32GB | 89,87 | 32,67 | OOM | OOM |
3090 24 GB * 2 | 108.07 | 47,15 | 16.29 | OOM |
4090 24 GB * 2 | 122,56 | 53,27 | 19.06 | OOM |
RTX A6000 48GB | 102.22 | 40.25 | 14.58 | OOM |
RTX 6000 Ada 48GB | 130,99 | 51,97 | 18.36 | OOM |
A40 48GB | 88,95 | 33,95 | 12.08 | OOM |
L40S 48GB | 113.60 | 43,42 | 15.31 | OOM |
RTX 4000 Cada 20GB * 4 | 56.14 | 20,58 | 7.33 | OOM |
A100 PCIe 80GB | 138.31 | 54,56 | 22.11 | OOM |
A100 SXM 80GB | 133,38 | 53.18 | 24.33 | OOM |
H100 PCIe 80GB | 144,49 | 67,79 | 25.01 | OOM |
3090 24 GB * 4 | 104,94 | 46,40 | 16,89 | OOM |
4090 24 GB * 4 | 117,61 | 52,69 | 18.83 | OOM |
RTX 5000 Cada 32GB * 4 | 82,73 | 31,94 | 11.45 | OOM |
3090 24 GB * 6 | 101.07 | 45,55 | 16,93 | 5.82 |
4090 24 GB * 8 | 116.13 | 52.12 | 18,76 | 6.45 |
RTX A6000 48 GB * 4 | 93,73 | 38,87 | 14.32 | 4.74 |
RTX 6000 Ada 48GB * 4 | 118,99 | 50.25 | 17,96 | 6.06 |
A40 48 GB * 4 | 83,79 | 33.28 | 11.91 | 3,98 |
L40S 48 GB * 4 | 105,72 | 42,48 | 14,99 | 5.03 |
A100 PCIe 80GB * 4 | 117,30 | 51,54 | 22,68 | 7.38 |
A100 SXM 80GB * 4 | 97,70 | 45,45 | 19.60 | 6,92 |
H100 PCIe 80GB * 4 | 118.14 | 62,90 | 26.20 | 9.63 |
GPU M1 de 7 núcleos y 8 GB | 9,72 | OOM | OOM | OOM |
GPU M1 Máx. de 32 núcleos y 64 GB | 34,49 | 18.43 | 4.09 | OOM |
GPU M2 Ultra de 76 núcleos y 192 GB | 76,28 | 36,25 | 12.13 | 4.71 |
GPU M3 Máx. de 40 núcleos y 64 GB | 50,74 | 22.39 | 7.53 | OOM |
Un promedio de 1024 tokens solicita velocidad de evaluación (tokens/s) por GPU en LLaMA 3.
GPU | 8B Q4_K_M | 8B F16 | 70B Q4_K_M | 70B F16 |
---|---|---|---|---|
3070 8GB | 2283.62 | OOM | OOM | OOM |
3080 10GB | 3557.02 | OOM | OOM | OOM |
3080Ti 12GB | 3556.67 | OOM | OOM | OOM |
4070Ti 12GB | 3653.07 | OOM | OOM | OOM |
4080 16GB | 5064.99 | 6758.90 | OOM | OOM |
RTX 4000 Ada 20GB | 2310.53 | 2951.87 | OOM | OOM |
3090 24GB | 3865.39 | 4239.64 | OOM | OOM |
4090 24GB | 6898.71 | 9056.26 | OOM | OOM |
RTX 5000 Ada 32GB | 4467.46 | 5835.41 | OOM | OOM |
3090 24 GB * 2 | 4004.14 | 4690.50 | 393,89 | OOM |
4090 24 GB * 2 | 8545.00 | 11094.51 | 905.38 | OOM |
RTX A6000 48GB | 3621.81 | 4315.18 | 466,82 | OOM |
RTX 6000 Ada 48GB | 5560.94 | 6205.44 | 547.03 | OOM |
A40 48GB | 3240.95 | 4043.05 | 239,92 | OOM |
L40S 48GB | 5908.52 | 2491.65 | 649.08 | OOM |
RTX 4000 Cada 20GB * 4 | 3369.24 | 4366.64 | 306.44 | OOM |
A100 PCIe 80GB | 5800.48 | 7504.24 | 726,65 | OOM |
A100 SXM 80GB | 5863.92 | 681,47 | 796,81 | OOM |
H100 PCIe 80GB | 7760.16 | 10342.63 | 984.06 | OOM |
3090 24 GB * 4 | 4653.93 | 5713.41 | 350.06 | OOM |
4090 24 GB * 4 | 9609.29 | 12304.19 | 898.17 | OOM |
RTX 5000 Cada 32GB * 4 | 6530.78 | 2877,66 | 541.54 | OOM |
3090 24 GB * 6 | 5153.05 | 5952.55 | 739.40 | 927.23 |
4090 24 GB * 8 | 9706.82 | 11818.92 | 1336.26 | 1890.48 |
RTX A6000 48 GB * 4 | 5340.10 | 6448.85 | 539.20 | 792.23 |
RTX 6000 Ada 48GB * 4 | 9679.55 | 12637.94 | 714,93 | 1270.39 |
A40 48 GB * 4 | 4841.98 | 5931.06 | 263,36 | 900.79 |
L40S 48 GB * 4 | 9008.27 | 2541.61 | 634.05 | 1478.83 |
A100 PCIe 80GB * 4 | 8889.35 | 11670.74 | 978.06 | 1733.41 |
A100 SXM 80GB * 4 | 7782.25 | 674.11 | 539.08 | 1834.16 |
H100 PCIe 80GB * 4 | 11560.23 | 15612.81 | 1133.23 | 2420.10 |
GPU M1 de 7 núcleos y 8 GB | 87,26 | OOM | OOM | OOM |
GPU M1 Máx. de 32 núcleos y 64 GB | 355.45 | 418.77 | 33.01 | OOM |
GPU M2 Ultra de 76 núcleos y 192 GB | 1023.89 | 1202.74 | 117,76 | 145,82 |
GPU M3 Máx. de 40 núcleos y 64 GB | 678.04 | 751.49 | 62,88 | OOM |
Gracias a Shawwn por los pesos del modelo LLaMA (7B, 13B, 30B, 65B): llama-dl. Accede a LLaMA 2 desde Meta AI. Accede a LLaMA 3 desde Meta Llama 3 en Hugging Face o en mis repositorios de Hugging Face: Xiongjie Dai.
Para las GPU NVIDIA, esto proporciona aceleración BLAS utilizando los núcleos CUDA de su GPU Nvidia:
! make clean && LLAMA_CUBLAS=1 make -j
Para Apple Silicon, Metal está habilitado de forma predeterminada:
! make clean && make -j
Use el argumento -ngl 0
para usar solo la CPU para inferencia y -ngl 10000
para garantizar que todas las capas se descarguen a la GPU.
! ./main -ngl 10000 -m ./models/8B-v3/ggml-model-Q4_K_M.gguf --color --temp 1.1 --repeat_penalty 1.1 -c 0 -n 1024 -e -s 0 -p " " "
First Citizen:nn
Before we proceed any further, hear me speak.nn
nn
All:nn
Speak, speak.nn
nn
First Citizen:nn
You are all resolved rather to die than to famish?nn
nn
All:nn
Resolved. resolved.nn
nn
First Citizen:nn
First, you know Caius Marcius is chief enemy to the people.nn
nn
All:nn
We know't, we know't.nn
nn
First Citizen:nn
Let us kill him, and we'll have corn at our own price. Is't a verdict?nn
nn
All:nn
No more talking on't; let it be done: away, away!nn
nn
Second Citizen:nn
One word, good citizens.nn
nn
First Citizen:nn
We are accounted poor citizens, the patricians good. What authority surfeits on would relieve us: if they would yield us but the superfluity,
while it were wholesome, we might guess they relieved us humanely; but they think we are too dear: the leanness that afflicts us, the object of
our misery, is as an inventory to particularise their abundance; our sufferance is a gain to them Let us revenge this with our pikes,
ere we become rakes: for the gods know I speak this in hunger for bread, not in thirst for revenge.nn
nn
" " "
Nota: Para Apple Silicon, verifique el recommendedMaxWorkingSetSize
en el resultado para ver cuánta memoria se puede asignar en la GPU y mantener su rendimiento. En este momento, solo el 70% de la memoria unificada se puede asignar a la GPU en el M1 Max de 32 GB, y esperamos alrededor del 78% de la memoria utilizable para la GPU en una memoria más grande. (Fuente: https://developer.apple.com/videos/play/tech-talks/10580/?time=346) Para utilizar toda la memoria, use -ngl 0
para usar solo la CPU para inferencia. (Gracias a: ggerganov/llama.cpp#1826)
! ./main -ngl 10000 -m ./models/8B-v3-instruct/ggml-model-Q4_K_M.gguf --color -c 0 -n -2 -e -s 0 --mirostat 2 -i --no-display-prompt --keep -1
-r ' <|eot_id|> ' -p ' <|begin_of_text|><|start_header_id|>system<|end_header_id|>nnYou are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>nnHi!<|eot_id|><|start_header_id|>assistant<|end_header_id|>nn '
--in-prefix ' <|start_header_id|>user<|end_header_id|>nn ' --in-suffix ' <|eot_id|><|start_header_id|>assistant<|end_header_id|>nn '
! ./llama-bench -p 512,1024,4096,8192 -n 512,1024,4096,8192 -m ./models/8B-v3/ggml-model-Q4_K_M.gguf
Modelo | Tamaño cuantificado (Q4_K_M) | Tamaño original (f16) |
---|---|---|
8B | 4,58GB | 14,96GB |
70B | 39,59GB | 131,42GB |
Puede estimar el requisito de VRAM utilizando esta herramienta: Calculadora de RAM LLM
Menos perplejidad es mejor. (crédito a: drranger003)
Cuantización | Tamaño (GiB) | Perplejidad (wiki.test) | Delta (FP16) |
---|---|---|---|
IQ1_S | 14.29 | 9,8655 +/- 0,0625 | 248,51% |
IQ1_M | 15.60 | 8,5193 +/- 0,0530 | 201,94% |
IQ2_XXS | 17,79 | 6,6705 +/- 0,0405 | 135,64% |
IQ2_XS | 19,69 | 5,7486 +/- 0,0345 | 103,07% |
IQ2_S | 20,71 | 5,5215 +/- 0,0318 | 95,05% |
Q2_K_S | 22,79 | 5,4334 +/- 0,0325 | 91,94% |
IQ2_M | 22.46 | 4,8959 +/- 0,0276 | 72,35% |
Q2_K | 24,56 | 4,7763 +/- 0,0274 | 68,73% |
IQ3_XXS | 25,58 | 3,9671 +/- 0,0211 | 40,14% |
IQ3_XS | 27.29 | 3,7210 +/- 0,0191 | 31,45% |
Q3_K_S | 28,79 | 3,6502 +/- 0,0192 | 28,95% |
IQ3_S | 28,79 | 3,4698 +/- 0,0174 | 22,57% |
IQ3_M | 29,74 | 3,4402 +/- 0,0171 | 21,53% |
Q3_K_M | 31,91 | 3,3617 +/- 0,0172 | 18,75% |
Q3_K_L | 34,59 | 3,3016 +/- 0,0168 | 16,63% |
IQ4_XS | 35.30 | 3,0310 +/- 0,0149 | 7,07% |
IQ4_NL | 37.30 | 3,0261 +/- 0,0149 | 6,90% |
Q4_K_S | 37,58 | 3,0050 +/- 0,0148 | 6,15% |
Q4_K_M | 39,60 | 2,9674 +/- 0,0146 | 4,83% |
Q5_K_S | 45.32 | 2,8843 +/- 0,0141 | 1,89% |
Q5_K_M | 46,52 | 2,8656 +/- 0,0139 | 1,23% |
P6_K | 53,91 | 2,8441 +/- 0,0138 | 0,47% |
Q8_0 | 69,83 | 2,8316 +/- 0,0138 | 0,03% |
F16 | 131,43 | 2,8308 +/- 0,0138 | 0,00% |
TG
significa "generación de texto" y PP
significa "procesamiento rápido". # para el total de tokens generados/procesados. OOM
significa sin memoria. Velocidad media en tokens/s.
GPU | Modelo | tg 512 | tg 1024 | tg 4096 | tg 8192 | págs. 512 | págs. 1024 | págs. 4096 | págs. 8192 |
---|---|---|---|---|---|---|---|---|---|
3070 8GB | 8B Q4_K_M | 72,79 | 70,94 | 67.01 | 61,64 | 2402.51 | 2283.62 | 1826.59 | 1419.97 |
3080 10GB | 8B Q4_K_M | 109,57 | 106.40 | 98,67 | 89,90 | 3728.86 | 3557.02 | 2852.06 | 2232.21 |
3080Ti 12GB | 8B Q4_K_M | 110.60 | 106,71 | 98,34 | 88,63 | 3690.30 | 3556.67 | 2947.11 | 2381.52 |
4070Ti 12GB | 8B Q4_K_M | 83,50 | 82.21 | 78,59 | 73,46 | 3936.29 | 3653.07 | 2729.71 | 2019.71 |
4080 16GB | 8B Q4_K_M | 108.15 | 106.22 | 100.44 | 93,71 | 5389.74 | 5064.99 | 3790.96 | 2882.03 |
8B F16 | 40,58 | 40.29 | 39,44 | OOM | 7246.97 | 6758.90 | 4720.22 | OOM | |
3090 24GB | 8B Q4_K_M | 115,42 | 111,74 | 97,31 | 87,49 | 4030.40 | 3865.39 | 3169.91 | 2527.40 |
8B F16 | 47,40 | 46,51 | 44,79 | 42,62 | 4444.65 | 4239.64 | 3410.47 | 2667.14 | |
4090 24GB | 8B Q4_K_M | 130,58 | 127,74 | 119,44 | 110.66 | 7138.99 | 6898.71 | 5265.68 | 4039.68 |
8B F16 | 54,84 | 54,34 | 52,63 | 50,88 | 9382.00 | 9056.26 | 6531.36 | 4744.18 | |
3090 24 GB * 2 | 8B Q4_K_M | 111,67 | 108.07 | 99,60 | 90,77 | 3336.37 | 4004.14 | 4013.34 | 3433.59 |
8B F16 | 47,72 | 47,15 | 45,56 | 43,61 | 4122.66 | 4690.50 | 4788.60 | 3851.37 | |
70B Q4_K_M | 16.57 | 16.29 | 15.36 | 14.34 | 357,32 | 393,89 | 379,52 | 338,82 | |
4090 24 GB * 2 | 8B Q4_K_M | 124,65 | 122,56 | 114,32 | 106.18 | 7003.51 | 8545.00 | 8422.04 | 6895.68 |
8B F16 | 53,64 | 53,27 | 51,64 | 49,83 | 9177.92 | 11094.51 | 10329.29 | 8067.29 | |
70B Q4_K_M | 19.22 | 19.06 | 18.54 | 17,92 | 839.43 | 905.38 | 846.38 | 723.24 | |
3090 24 GB * 4 | 8B Q4_K_M | 108,66 | 104,94 | 97.09 | 88,35 | 3742.66 | 4653.93 | 5826.91 | 4913.40 |
8B F16 | 47.07 | 46,40 | 44,76 | 42,81 | 4608.40 | 5713.41 | 6596.17 | 5361.52 | |
70B Q4_K_M | 17.07 | 16,89 | 16.24 | 15.39 | 300.79 | 350.06 | 367,75 | 331,37 | |
4090 24 GB * 4 | 8B Q4_K_M | 120.32 | 117,61 | 110.52 | 103.13 | 6748.96 | 9609.29 | 12491.10 | 10993.75 |
8B F16 | 53.10 | 52,69 | 51.00 | 49.21 | 8750.57 | 12304.19 | 15143.84 | 12919.74 | |
70B Q4_K_M | 19,80 | 18.83 | 18.35 | 17.66 | 834,74 | 898.17 | 839,97 | 718.01 | |
3090 24 GB * 6 | 8B Q4_K_M | 104.17 | 101.07 | 94.06 | 85,93 | 3359.99 | 5153.05 | 7690.65 | 7084.44 |
8B F16 | 46.23 | 45,55 | 43,99 | 42.15 | 3875.97 | 5952.55 | 9437.91 | 8780.49 | |
70B Q4_K_M | 17.09 | 16,93 | 16.32 | 15.45 | 456,95 | 739.40 | 786,79 | 695,44 | |
70B F16 | 5.85 | 5.82 | 5.76 | 5.53 | 579.00 | 927.23 | 998,79 | 813.99 | |
4090 24 GB * 8 | 8B Q4_K_M | 118.09 | 116.13 | 108.37 | 100,95 | 6172.06 | 9706.82 | 15089.45 | 13802.08 |
8B F16 | 52,51 | 52.12 | 50,39 | 48,72 | 7889.26 | 11818.92 | 16462.18 | 14300.98 | |
70B Q4_K_M | 18,94 | 18,76 | 18.23 | 17.57 | 812.95 | 1336.26 | 1488.36 | 1320.36 | |
70B F16 | 6.47 | 6.45 | 6.39 | 6.31 | 1183.87 | 1890.48 | 2311.43 | 1995.85 |
GPU | Modelo | tg 512 | tg 1024 | tg 4096 | tg 8192 | págs. 512 | págs. 1024 | págs. 4096 | págs. 8192 |
---|---|---|---|---|---|---|---|---|---|
RTX 4000 Ada 20GB | 8B Q4_K_M | 59,15 | 58,59 | 55,94 | 52,39 | 2451.93 | 2310.53 | 1798.01 | 1337.15 |
8B F16 | 20,92 | 20,85 | 20,50 | 20.01 | 3121.67 | 2951.87 | 2200.58 | 1557.00 | |
RTX 5000 Ada 32GB | 8B Q4_K_M | 91,39 | 89,87 | 85.01 | 80.00 | 4761.12 | 4467.46 | 3272.94 | 2422.33 |
8B F16 | 32,84 | 32,67 | 32.04 | 31.27 | 6160.57 | 5835.41 | 4008.30 | 2808.89 | |
RTX A6000 48GB | 8B Q4_K_M | 105.39 | 102.22 | 94,82 | 86,73 | 3780.55 | 3621.81 | 2917.23 | 2292.61 |
8B F16 | 40,71 | 40.25 | 39.14 | 37,73 | 4511.02 | 4315.18 | 3365.79 | 2566.46 | |
70B Q4_K_M | 14.71 | 14.58 | 14.09 | 13.42 | 482.19 | 466,82 | 404.61 | 340.73 | |
RTX 6000 Ada 48GB | 8B Q4_K_M | 133,44 | 130,99 | 120,74 | 111,57 | 5791.74 | 5560.94 | 4495.19 | 3542.57 |
8B F16 | 52,32 | 51,97 | 50.21 | 48,79 | 6663.13 | 6205.44 | 4969.46 | 3915.81 | |
70B Q4_K_M | 18.52 | 18.36 | 17,80 | 16,97 | 565,98 | 547.03 | 481.59 | 419,76 | |
A40 48GB | 8B Q4_K_M | 91,27 | 88,95 | 83.10 | 76,45 | 3324.98 | 3240.95 | 2586.50 | 2013.34 |
8B F16 | 34.26 | 33,95 | 33.06 | 31,93 | 4203.75 | 4043.05 | 3069.98 | 2295.02 | |
70B Q4_K_M | 11.60 | 12.08 | 11.68 | 11.26 | 209.38 | 239,92 | 268,89 | 291.13 | |
L40S 48GB | 8B Q4_K_M | 115.55 | 113.60 | 105.50 | 97,98 | 6035.24 | 5908.52 | 4335.18 | 3192.70 |
8B F16 | 43,69 | 43,42 | 42.22 | 41.05 | 2253.93 | 2491.65 | 2887.70 | 3312.16 | |
70B Q4_K_M | 15.46 | 15.31 | 14,92 | 14.45 | 673,63 | 649.08 | 542.29 | 446,48 | |
RTX 4000 Cada 20GB * 4 | 8B Q4_K_M | 56,64 | 56.14 | 53,58 | 50.19 | 2413.07 | 3369.24 | 4404.45 | 3733.15 |
8B F16 | 20,65 | 20,58 | 20.24 | 19,74 | 3220.21 | 4366.64 | 5366.39 | 4323.70 | |
70B Q4_K_M | 7.36 | 7.33 | 7.12 | 6.84 | 282,28 | 306.44 | 290.70 | 243.45 | |
A100 PCIe 80GB | 8B Q4_K_M | 140,62 | 138.31 | 127,22 | 117,60 | 5981.04 | 5800.48 | 4959.84 | 4083.37 |
8B F16 | 54,84 | 54,56 | 53.02 | 51.24 | 7741.34 | 7504.24 | 6137.54 | 4849.11 | |
70B Q4_K_M | 22.31 | 22.11 | 20.93 | 19.53 | 744.12 | 726,65 | 653.20 | 573,95 | |
A100 SXM 80GB | 8B Q4_K_M | 135.04 | 133,38 | 125.09 | 115,92 | 5947.64 | 5863.92 | 5121.60 | 4137.08 |
8B F16 | 53,49 | 53.18 | 52.03 | 50,52 | 603.76 | 681,47 | 866.13 | 1323.07 | |
70B Q4_K_M | 24,61 | 24.33 | 22,91 | 21.32 | 817,58 | 796,81 | 714.07 | 625,66 | |
H100 PCIe 80GB | 8B Q4_K_M | 145,55 | 144,49 | 136.06 | 126,83 | 8125.45 | 7760.16 | 6423.31 | 5185.03 |
8B F16 | 68.03 | 67,79 | 65,97 | 63,55 | 10815.51 | 10342.63 | 8106.53 | 6191.45 | |
70B Q4_K_M | 25.03 | 25.01 | 23,82 | 22.39 | 1012.73 | 984.06 | 863.37 | 741,52 | |
RTX 5000 Cada 32GB * 4 | 8B Q4_K_M | 84.07 | 82,73 | 78,45 | 74.11 | 4671.34 | 6530.78 | 8004.94 | 6790.82 |
8B F16 | 32.10 | 31,94 | 31.32 | 30,58 | 2427.96 | 2877,66 | 3836.89 | 5235.00 | |
70B Q4_K_M | 11.51 | 11.45 | 11.24 | 10.94 | 502.37 | 541.54 | 504.23 | 424.29 | |
RTX A6000 48 GB * 4 | 8B Q4_K_M | 96,48 | 93,73 | 87,72 | 80,88 | 3712.99 | 5340.10 | 7126.45 | 6438.82 |
8B F16 | 39.34 | 38,87 | 37,81 | 36,51 | 4508.60 | 6448.85 | 8327.16 | 7298.18 | |
70B Q4_K_M | 14.44 | 14.32 | 13.91 | 13.32 | 496.08 | 539.20 | 511.22 | 434.31 | |
70B F16 | 4.76 | 4.74 | 4.70 | 4.63 | 510.31 | 792.23 | 751.37 | 748.06 | |
RTX 6000 Ada 48GB * 4 | 8B Q4_K_M | 121.21 | 118,99 | 110.65 | 103.18 | 6640.86 | 9679.55 | 11734.85 | 10278.14 |
8B F16 | 50,61 | 50.25 | 48,69 | 47.18 | 8953.30 | 12637.94 | 13971.34 | 11702.36 | |
70B Q4_K_M | 18.13 | 17,96 | 17.49 | 16,89 | 656,61 | 714,93 | 697.10 | 612.54 | |
70B F16 | 6.08 | 6.06 | 6.01 | 5.94 | 864.12 | 1270.39 | 1363,75 | 1182.28 | |
A40 48 GB * 4 | 8B Q4_K_M | 85,91 | 83,79 | 78,56 | 72,70 | 3321.27 | 4841.98 | 6442.38 | 5742.84 |
8B F16 | 33,60 | 33.28 | 32,42 | 31.38 | 4144.88 | 5931.06 | 7544.92 | 6516.60 | |
70B Q4_K_M | 11,99 | 11.91 | 11.60 | 11.17 | 236,86 | 263,36 | 300.57 | 312.31 | |
70B F16 | 3,99 | 3,98 | 3.95 | 3.90 | 610.51 | 900.79 | 893.28 | 735.16 | |
L40S 48 GB * 4 | 8B Q4_K_M | 107,53 | 105,72 | 98,59 | 92.20 | 6125.69 | 9008.27 | 10566.97 | 9017.90 |
8B F16 | 42,70 | 42,48 | 41.33 | 40.19 | 2211.45 | 2541.61 | 3093.33 | 4336.81 | |
70B Q4_K_M | 15.12 | 14,99 | 14.63 | 14.17 | 591.05 | 634.05 | 605.66 | 541,67 | |
70B F16 | 5.05 | 5.03 | 4,99 | 4.94 | 1042.13 | 1478.83 | 1427.77 | 1150.63 | |
A100 PCIe 80GB * 4 | 8B Q4_K_M | 119.28 | 117,30 | 110,75 | 103,87 | 6076.58 | 8889.35 | 12724.54 | 11803.39 |
8B F16 | 51,63 | 51,54 | 50.20 | 48,73 | 8088.79 | 11670.74 | 16025.11 | 14269.17 | |
70B Q4_K_M | 22,91 | 22,68 | 21.41 | 19,96 | 771,28 | 978.06 | 1138.60 | 1043.15 | |
70B F16 | 7.40 | 7.38 | 7.23 | 7.06 | 1172.14 | 1733.41 | 1846.36 | 1592.37 | |
A100 SXM 80GB * 4 | 8B Q4_K_M | 99,73 | 97,70 | 92.09 | 86,27 | 4850.88 | 7782.25 | 12242.53 | 11535.66 |
8B F16 | 45,53 | 45,45 | 44.33 | 43.09 | 626,75 | 674.11 | 1003.37 | 1612.05 | |
70B Q4_K_M | 19,87 | 19.60 | 18.48 | 17.19 | 468,86 | 539.08 | 712.08 | 802.23 | |
70B F16 | 6,95 | 6,92 | 6.77 | 6.58 | 1233.31 | 1834.16 | 1972.48 | 1699.56 | |
H100 PCIe 80GB * 4 | 8B Q4_K_M | 123.08 | 118.14 | 113.12 | 110.34 | 8054.58 | 11560.23 | 16128.27 | 14682.97 |
8B F16 | 64.00 | 62,90 | 61,45 | 59,72 | 11107.40 | 15612.81 | 20561.03 | 17762.96 | |
70B Q4_K_M | 26.40 | 26.20 | 24.60 | 23,68 | 1048.29 | 1133.23 | 1088,99 | 950.92 | |
70B F16 | 9.67 | 9.63 | 9.46 | 9.23 | 1681.45 | 2420.10 | 2437.53 | 2031.77 |
GPU | Modelo | tg 512 | tg 1024 | tg 4096 | tg 8192 | págs. 512 | págs. 1024 | págs. 4096 | págs. 8192 |
---|---|---|---|---|---|---|---|---|---|
GPU M1 de 7 núcleos y 8 GB | 8B Q4_K_M | 10.20 | 9,72 | 11.77 | OOM | 94,48 | 87,26 | 96,53 | OOM |
GPU M1 Máx. de 32 núcleos y 64 GB | 8B Q4_K_M | 35,73 | 34,49 | 31.18 | 26,84 | 408.23 | 355.45 | 329,84 | 302.92 |
8B F16 | 18,75 | 18.43 | 16.33 | 15.03 | 517.34 | 418.77 | 374.09 | 351,46 | |
70B Q4_K_M | 4.34 | 4.09 | 4.09 | 3.71 | 34,96 | 33.01 | 32,64 | 30,97 | |
GPU M2 Ultra de 76 núcleos y 192 GB | 8B Q4_K_M | 78,81 | 76,28 | 64,58 | 54.13 | 994.04 | 1023.89 | 979,47 | 913.55 |
8B F16 | 36,90 | 36,25 | 33,67 | 30,68 | 1175.40 | 1202.74 | 1194.21 | 1103.44 | |
70B Q4_K_M | 12.48 | 12.13 | 10,75 | 9.34 | 118,79 | 117,76 | 109,53 | 108,57 | |
70B F16 | 4.76 | 4.71 | 4.48 | 4.23 | 147,58 | 145,82 | 133,75 | 135.15 | |
GPU M3 Máx. de 40 núcleos y 64 GB | 8B Q4_K_M | 48,97 | 50,74 | 44.21 | 36.12 | 693,32 | 678.04 | 573.09 | 505.32 |
8B F16 | 22.04 | 22.39 | 20,72 | 18.74 | 769,84 | 751.49 | 609,97 | 515.15 | |
70B Q4_K_M | 7,65 | 7.53 | 6.58 | 5.60 | 70.19 | 62,88 | 64,90 | 61,96 |
Mismo rendimiento bajo el mismo tamaño y modelos de cuantificación. Varias GPU NVIDIA pueden afectar el rendimiento de la generación de texto, pero aún así pueden aumentar la velocidad de procesamiento de mensajes.
Compre GPU para juegos NVIDIA para ahorrar dinero. Compra GPU profesionales para tu negocio. Compra una Mac si quieres poner tu computadora en tu escritorio, ahorrar energía, estar en silencio, no quieres mantenimiento y divertirte más. ?
Si encuentra útil esta información, por favor deme una estrella. ️ No dudes en ponerte en contacto conmigo si tienes algún consejo. Gracias. ?