LLM Quantisations for Local Models
Old Quants
| Configuration | Chunk size | Bits per Weight | Scale Value | Bias Value | Size per Weight | 
|---|---|---|---|---|---|
| q4_0 | 32 | 4 | 32-bit float | N/A | 5 bits | 
| q4_1 | 32 | 4 | 32-bit float | 32-bit float | 6 bits | 
| q4_2 | 16 | 4 | 16-bit float | N/A | 5 bits | 
| q4_3 | 16 | 4 | 16-bit float | 16-bit float | 6 bits | 
| q5_0 | 32 | 5 | 16-bit float | N/A | 5.5 bits | 
| q5_1 | 32 | 5 | 16-bit float | 16-bit float | 6 bits | 
| q8_0 | 32 | 8 | 32-bit float | N/A | 9 bits | 
“With 7B, use Q5_1. With 13B and above, Q4_2 is a great compromise between speed and quality if you don’t want to go with the slower, resource-heavier Q5_1.
Q4_0 is only relevant for compatibility and should be avoided when possible.”
K-Quants
https://github.com/ggerganov/llama.cpp/pull/1684
| Model | Measure | F16 | Q2_K | Q3_K_M | Q4_K_S | Q5_K_S | Q6_K | 
|---|---|---|---|---|---|---|---|
| 7B | perplexity | 5.9066 | 6.7764 | 6.1503 | 6.0215 | 5.9419 | 5.9110 | 
| 7B | file size | 13.0G | 2.67G | 3.06G | 3.56G | 4.33G | 5.15G | 
| 7B | ms/tok @ 4th M2 Max | 116 | 56 | 69 | 50 | 70 | 75 | 
| 7B | ms/tok @ 8th M2 Max | 111 | 36 | 36 | 36 | 44 | 51 | 
| 7B | ms/tok @ 4th RTX-4080 | 60 | 15.5 | 17.0 | 15.5 | 16.7 | 18.3 | 
| 7B | ms/tok @ 4th Ryzen | 214 | 57 | 61 | 68 | 81 | 93 | 
| 13B | perplexity | 5.2543 | 5.8545 | 5.4498 | 5.3404 | 5.2785 | 5.2568 | 
| 13B | file size | 25.0G | 5.13G | 5.88G | 6.80G | 8.36G | 9.95G | 
| 13B | ms/tok @ 4th M2 Max | 216 | 103 | 148 | 95 | 132 | 142 | 
| 13B | ms/tok @ 8th M2 Max | 213 | 67 | 77 | 68 | 81 | 95 | 
| 13B | ms/tok @ 4th RTX-4080 | - | 25.3 | 29.3 | 26.2 | 28.6 | 30.0 | 
| 13B | ms/tok @ 4th Ryzen | 414 | 109 | 118 | 130 | 156 | 180 | 
Q4_K_S seems optimal speed/perplexity