LLM Quantisations for Local Models

Old Quants

Configuration	Chunk size	Bits per Weight	Scale Value	Bias Value	Size per Weight
q4_0	32	4	32-bit float	N/A	5 bits
q4_1	32	4	32-bit float	32-bit float	6 bits
q4_2	16	4	16-bit float	N/A	5 bits
q4_3	16	4	16-bit float	16-bit float	6 bits
q5_0	32	5	16-bit float	N/A	5.5 bits
q5_1	32	5	16-bit float	16-bit float	6 bits
q8_0	32	8	32-bit float	N/A	9 bits

“With 7B, use Q5_1. With 13B and above, Q4_2 is a great compromise between speed and quality if you don’t want to go with the slower, resource-heavier Q5_1.
Q4_0 is only relevant for compatibility and should be avoided when possible.”

K-Quants

https://github.com/ggerganov/llama.cpp/pull/1684

Model	Measure	F16	Q2_K	Q3_K_M	Q4_K_S	Q5_K_S	Q6_K
7B	perplexity	5.9066	6.7764	6.1503	6.0215	5.9419	5.9110
7B	file size	13.0G	2.67G	3.06G	3.56G	4.33G	5.15G
7B	ms/tok @ 4th M2 Max	116	56	69	50	70	75
7B	ms/tok @ 8th M2 Max	111	36	36	36	44	51
7B	ms/tok @ 4th RTX-4080	60	15.5	17.0	15.5	16.7	18.3
7B	ms/tok @ 4th Ryzen	214	57	61	68	81	93
13B	perplexity	5.2543	5.8545	5.4498	5.3404	5.2785	5.2568
13B	file size	25.0G	5.13G	5.88G	6.80G	8.36G	9.95G
13B	ms/tok @ 4th M2 Max	216	103	148	95	132	142
13B	ms/tok @ 8th M2 Max	213	67	77	68	81	95
13B	ms/tok @ 4th RTX-4080	-	25.3	29.3	26.2	28.6	30.0
13B	ms/tok @ 4th Ryzen	414	109	118	130	156	180

Q4_K_S seems optimal speed/perplexity

←

Deciphering Malta's Church Bells

Large Language Model Leaderboards

→