AI/ML/NLPDecember 19, 202513 min read

LLM Quantization Guide: Run 70B Models on Consumer GPUs with GPTQ, AWQ & GGUF

Run 70B parameter models on a single GPU. Learn LLM quantization from 8-bit to 2-bit precision - GPTQ, AWQ, GGUF, QuIP#, and when to use each method.

Space Services

ai tools

Running a 70-billion parameter language model requires 140 GB of memory in 16-bit precision. That's more than most consumer GPUs can handle, and even enterprise deployments struggle with the costs. Quantization—the art of reducing numerical precision while preserving model quality—has become essential for making large language models practical. Recent breakthroughs have pushed the boundaries from 8-bit to 4-bit, and now even 2-bit precision, achieving compression ratios that seemed impossible just two years ago.

Quantization Visualization

See how weights are compressed from 32-bit to 8-bit precision

import numpy as np
import matplotlib.pyplot as plt

# Simulate neural network weights (normally distributed)
np.random.seed(42)
weights_fp32 = np.random.randn(10000) * 0.5

# Quantize to int8 (simulated)
def quantize_int8(weights):
    scale = (weights.max() - weights.min()) / 255
    zero_point = -weights.min() / scale
    quantized = np.round(weights / scale + zero_point).astype(np.int8)
    # Dequantize back
    dequantized = (quantized.astype(float) - zero_point) * scale
    return quantized, dequantized, scale

quantized, dequantized, scale = quantize_int8(weights_fp32)
error = weights_fp32 - dequantized

fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# Original weights distribution
ax = axes[0, 0]
ax.hist(weights_fp32, bins=100, color='#7c6f9c', alpha=0.8, edgecolor='none')
ax.set_title('Original Weights (FP32)')
ax.set_xlabel('Weight Value')
ax.set_ylabel('Count')
ax.axvline(x=0, color='white', linestyle='--', alpha=0.5)

# Quantized weights
ax = axes[0, 1]
ax.hist(quantized, bins=50, color='#9089a3', alpha=0.8, edgecolor='none')
ax.set_title('Quantized Weights (INT8)')
ax.set_xlabel('Quantized Value')
ax.set_ylabel('Count')

# Dequantized vs Original
ax = axes[1, 0]
sample = 200
ax.scatter(weights_fp32[:sample], dequantized[:sample], alpha=0.5, c='#b8b0c8', s=10)
ax.plot([-2, 2], [-2, 2], 'w--', alpha=0.5, label='Perfect')
ax.set_xlabel('Original (FP32)')
ax.set_ylabel('Dequantized')
ax.set_title('Reconstruction Quality')
ax.legend()

# Quantization error
ax = axes[1, 1]
ax.hist(error, bins=100, color='#5c5270', alpha=0.8, edgecolor='none')
ax.set_title(f'Quantization Error (RMSE: {np.sqrt(np.mean(error**2)):.4f})')
ax.set_xlabel('Error')
ax.set_ylabel('Count')

plt.tight_layout()
plt.show()

print(f"Memory: FP32 = {weights_fp32.nbytes/1024:.1f}KB → INT8 = {quantized.nbytes/1024:.1f}KB")
print(f"Compression ratio: {weights_fp32.nbytes / quantized.nbytes:.1f}x")
print(f"Mean absolute error: {np.mean(np.abs(error)):.5f}")

Ctrl/Cmd + Enter to run

The Fundamentals of Quantization

What Is Quantization?

Quantization maps high-precision floating-point numbers to lower-precision representations. In the context of neural networks, this typically means converting 32-bit or 16-bit weights to 8-bit, 4-bit, or even lower bit-widths.

The core quantization formula is:

$Q(w) = \text{round}\left(\frac{w - z}{s}\right) \cdot s + z$

Where:

$w$ is the original weight value
$s$ is the scale factor (determines the step size between quantized values)
$z$ is the zero-point (offset for asymmetric quantization)
$Q(w)$ is the quantized-then-dequantized value

Symmetric vs. Asymmetric Quantization

Symmetric quantization assumes weights are centered around zero:

$Q_{\text{sym}}(w) = \text{round}\left(\frac{w}{s}\right) \cdot s$

Symmetric Quantization (8-bit):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

    -127 ────────────── 0 ────────────── +127
           ◀────────────▶◀────────────▶
              negative      positive
              weights       weights

    Scale: s = max(|w|) / 127
    Zero-point: z = 0

Asymmetric quantization allows the zero-point to shift, better capturing distributions that aren't centered at zero:

$Q_{\text{asym}}(w) = \text{round}\left(\frac{w - w_{\min}}{s}\right) \cdot s + w_{\min}$

Asymmetric Quantization (8-bit):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

    0 ──────────────────────────────── 255
    │◀──────── full range ──────────▶│
    ↓                                 ↓
    w_min                           w_max

    Scale: s = (w_max - w_min) / 255
    Zero-point: z = w_min

Method	Pros	Cons
Symmetric	Simpler computation, faster inference	Wastes precision if distribution is skewed
Asymmetric	Better utilizes full quantization range	Requires storing zero-point, slightly slower

Quantization Approaches: PTQ vs. QAT

Post-Training Quantization (PTQ)

PTQ quantizes a pre-trained model without additional training. It's fast and convenient but can suffer accuracy loss, especially at low bit-widths.

Post-Training Quantization Pipeline:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

    ┌─────────────┐
    │ Pre-trained │  Full precision (FP16/FP32)
    │   Model     │
    └──────┬──────┘
           │
           ▼
    ┌─────────────┐
    │ Calibration │  Run small dataset to determine
    │   Dataset   │  weight/activation ranges
    └──────┬──────┘
           │
           ▼
    ┌─────────────┐
    │  Compute    │  Calculate scale (s) and
    │  Scales &   │  zero-point (z) per layer
    │ Zero-points │
    └──────┬──────┘
           │
           ▼
    ┌─────────────┐
    │  Quantized  │  INT8/INT4 weights
    │   Model     │
    └─────────────┘

Quantization-Aware Training (QAT)

QAT simulates quantization during training, allowing the model to adapt to reduced precision. It achieves better accuracy but requires significant computational resources.

Quantization-Aware Training:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Forward Pass:
    w_float ──▶ Quantize ──▶ w_quant ──▶ Dequantize ──▶ w_fake
                   │                           │
                   └─── "Fake quantization" ───┘
                        (differentiable)

Backward Pass:
    Uses Straight-Through Estimator (STE):

    ∂L/∂w ≈ ∂L/∂w_fake  (gradient passes through)

Approach	Training Cost	Accuracy at 4-bit	Use Case
PTQ	None	Good (with calibration)	Quick deployment
QAT	High (full retraining)	Excellent	Production systems

Quantization Error Analysis

Understanding quantization error is crucial for designing effective compression schemes. The mean squared error (MSE) from quantization is:

$\text{MSE} = \mathbb{E}[(w - Q(w))^2]$

For uniform quantization with step size $\Delta$ , the error is bounded by:

$\text{MSE} \leq \frac{\Delta^2}{12}$

This assumes the quantization error is uniformly distributed over $[-\Delta/2, \Delta/2]$ .

The Outlier Problem

LLM weights often contain outliers—values significantly larger than the typical range. These outliers devastate naive quantization:

Weight Distribution with Outliers:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

    Count
      │
    ██│
    ██│    99% of weights
    ██│   ┌────────────┐
    ██│   │            │
    ██████████████████ │  │          Outliers (0.1%)
    ──────────────────────────────────────▶
   -0.5            0           0.5      2.0    Value

Problem: If scale is set by outliers (max = 2.0),
         most weights get very few quantization levels.

Solutions to the outlier problem include:

Per-channel quantization: Different scales for different output channels
Mixed-precision: Keep outlier channels in higher precision
Outlier-aware methods: Explicitly handle outliers (e.g., SmoothQuant, AWQ)

QuIP#: State-of-the-Art Extreme Compression

The QuIP# paper from Cornell (2024) achieved a breakthrough in extreme quantization, enabling 2-bit precision with minimal accuracy loss. With 215 citations in under two years, it has become a foundational technique for LLM compression.

The Key Insight: Incoherence

QuIP#'s central innovation is using Hadamard transforms to make weight matrices incoherent before quantization. Incoherence means the weights are "spread out" uniformly, without concentrated outliers.

Incoherence via Hadamard Transform:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Before (Coherent):          After (Incoherent):

    ┌─────────────┐         ┌─────────────┐
    │ ■           │         │ · · · · · · │
    │   ■         │   H×    │ · · · · · · │
    │     ■       │  ───▶   │ · · · · · · │
    │       ■     │         │ · · · · · · │
    │         ■   │         │ · · · · · · │
    └─────────────┘         └─────────────┘

    Outliers concentrated    Energy spread uniformly
    on specific entries      across all entries

The Hadamard Transform

The Hadamard matrix $H_n$ is a square matrix with entries $\pm 1/\sqrt{n}$ satisfying:

$H_n H_n^T = I$

For dimension $n = 2^k$ , it's constructed recursively:

$H_2 = \frac{1}{\sqrt{2}} \begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix}$

$H_{2n} = \frac{1}{\sqrt{2}} \begin{pmatrix} H_n & H_n \\ H_n & -H_n \end{pmatrix}$

QuIP# applies Hadamard transforms to both rows and columns of weight matrices:

$W' = H_m W H_n$

Because $H$ is orthogonal, the transformation is lossless (can be perfectly reversed) and can be fused into adjacent layers with minimal overhead.

"The Hadamard transform spreads the energy of outlier weights across all entries, creating a more uniform distribution that is far easier to quantize accurately." — Tseng et al., QuIP#, 2024

Lattice Codebooks

The second innovation in QuIP# is using lattice codebooks instead of uniform quantization grids. A lattice is a regular geometric arrangement of points that provides optimal quantization for incoherent vectors.

Uniform Grid vs. Lattice Codebook:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Uniform 2D Grid:              E8 Lattice (8D):
    • • • •                   Optimal sphere packing
    • • • •                   in 8 dimensions
    • • • •
    • • • •                   ┌─────────────────────┐
                              │ Quantization points │
    Wasted space              │ arranged to minimize│
    between points            │ average distance to │
                              │ any weight vector   │
                              └─────────────────────┘

QuIP# uses the E8 lattice for 2-bit quantization. The E8 lattice achieves the optimal packing density in 8 dimensions, meaning quantized weights are, on average, closer to their original values than any other arrangement.

QuIP# Results

Model	Method	Bits	Perplexity (WikiText-2)
Llama-2-70B	FP16	16	3.32
Llama-2-70B	GPTQ	4	3.85
Llama-2-70B	QuIP	4	3.48
Llama-2-70B	QuIP#	2	4.16

QuIP# achieves 2-bit quantization with perplexity within 25% of full precision—a remarkable result that was considered impossible just a few years ago. This translates to 8x compression, reducing a 140 GB model to under 18 GB.

KV Cache Compression with GEAR

While weight quantization reduces model storage, KV cache compression addresses a different bottleneck: memory usage during inference. The GEAR paper (2024), with 128 citations, presents an elegant solution.

The KV Cache Problem

During autoregressive generation, transformers cache the key and value tensors from all previous tokens to avoid recomputation. For long sequences, this cache dominates memory usage:

KV Cache Memory Growth:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Memory
  │                                          ╱
  │                                        ╱
  │                                      ╱
  │                                    ╱
  │                              KV Cache
  │                            ╱
  │                          ╱
  │                        ╱
  │──────────────────────────────────────  Model Weights
  │                                        (constant)
  └──────────────────────────────────────▶
                                         Sequence Length

At 128K context (Llama-3):
- Model weights: ~140 GB (70B params × 2 bytes)
- KV cache: ~160 GB (grows with sequence × batch)

GEAR: Quantization + Low-Rank + Sparse Residuals

GEAR (Generative Efficient Attention caching with Residual approximation) combines three compression techniques:

GEAR Compression Pipeline:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

    ┌─────────────────────────────────────────────────┐
    │              Original KV Cache                   │
    │                   (FP16)                         │
    └─────────────────────┬───────────────────────────┘
                          │
                          ▼
    ┌─────────────────────────────────────────────────┐
    │         Uniform Quantization (INT4)              │
    │    Captures bulk of the information cheaply      │
    └─────────────────────┬───────────────────────────┘
                          │
                          ▼
    ┌─────────────────────────────────────────────────┐
    │        Low-Rank Approximation (SVD)              │
    │    K̂ ≈ U × S × V^T  (top-r singular values)     │
    │    Captures systematic quantization error        │
    └─────────────────────┬───────────────────────────┘
                          │
                          ▼
    ┌─────────────────────────────────────────────────┐
    │          Sparse Outlier Storage                  │
    │    Store only top-k largest residual entries     │
    │    (handles remaining outliers)                  │
    └─────────────────────────────────────────────────┘

The reconstruction is:

$\text{KV} \approx \text{Dequant}(Q_{\text{int4}}(\text{KV})) + U \Sigma V^T + S$

Where:

$Q_{\text{int4}}$ is 4-bit uniform quantization
$U \Sigma V^T$ is the low-rank correction (rank 2-4)
$S$ is the sparse residual (top 1-2% of entries)

GEAR Results

Method	Bits	Memory Reduction	Accuracy Loss
FP16 (baseline)	16	1x	0%
Naive INT4	4	4x	8-15%
KIVI	2	8x	3-5%
GEAR	~3	5-6x	<1%

"GEAR achieves near-lossless compression of the KV cache with 5-6x memory reduction, enabling longer context windows and larger batch sizes without quality degradation." — Kang et al., GEAR, 2024

Practical Quantization: Choosing the Right Approach

Decision Tree for Quantization Strategy

Quantization Strategy Selection:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

                    ┌────────────────┐
                    │ What's your    │
                    │ constraint?    │
                    └───────┬────────┘
                            │
            ┌───────────────┼───────────────┐
            ▼               ▼               ▼
     ┌──────────┐    ┌──────────┐    ┌──────────┐
     │ Memory   │    │ Accuracy │    │ Both     │
     │ (edge)   │    │ critical │    │          │
     └────┬─────┘    └────┬─────┘    └────┬─────┘
          │               │               │
          ▼               ▼               ▼
     ┌──────────┐    ┌──────────┐    ┌──────────┐
     │ QuIP#    │    │ QAT      │    │ GPTQ/AWQ │
     │ (2-bit)  │    │ (8-bit)  │    │ (4-bit)  │
     └──────────┘    └──────────┘    └──────────┘

For long-context inference: Add GEAR for KV cache

Compression Comparison

Technique	Compression	Quality	Complexity	Best For
INT8 PTQ	2x	Excellent	Low	Quick deployment
GPTQ 4-bit	4x	Good	Medium	Consumer GPUs
AWQ 4-bit	4x	Very Good	Medium	Production
QuIP# 2-bit	8x	Good	High	Maximum compression
GEAR (KV)	5-6x	Excellent	Medium	Long context

The Mathematics of Optimal Quantization

Rate-Distortion Theory

The fundamental limit of quantization is described by rate-distortion theory. For a source with distribution $p(x)$ and distortion measure $d(x, \hat{x})$ , the minimum achievable rate (bits) at distortion $D$ is:

$R(D) = \min_{p(\hat{x}|x): \mathbb{E}[d(x,\hat{x})] \leq D} I(X; \hat{X})$

For Gaussian sources with mean squared error distortion:

$R(D) = \frac{1}{2} \log_2 \frac{\sigma^2}{D}$

This means each halving of the distortion requires one additional bit per value.

Why Lattices Are Optimal

Vector quantization outperforms scalar quantization because it exploits correlations. For $n$ -dimensional vectors, the normalized second moment measures quantization efficiency:

$G_n = \frac{1}{n} \frac{\mathbb{E}[\|X - Q(X)\|^2]}{V^{2/n}}$

Where $V$ is the volume of the quantization cell. The E8 lattice achieves:

$G_{E8} \approx 0.0717$

Compared to uniform scalar quantization:

$G_1 = \frac{1}{12} \approx 0.0833$

This 14% reduction in quantization error per dimension compounds significantly over millions of weights.

Future Directions

Emerging Techniques

1-bit quantization: Binary and ternary networks are approaching practical quality for specific tasks
Dynamic quantization: Adjusting precision based on input difficulty or layer importance
Quantization-aware architecture design: Models designed from scratch for efficient quantization
Hardware co-design: Custom accelerators optimized for specific quantization formats

The Path to Sub-2-Bit

Theoretical Compression Limits:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Bits/Parameter    Technique              Status
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
     16          FP16 (baseline)         Standard
      8          INT8 PTQ                Mature
      4          GPTQ/AWQ                Mainstream
      2          QuIP#                   State-of-art
     1.5         Ternary + Entropy       Research
      1          Binary                  Emerging
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Each 1-bit reduction = 2x compression
4-bit → 2-bit = 4x smaller = 70B model in 18GB

Conclusion

Quantization has transformed from a nice-to-have optimization into a critical enabling technology for large language models. The journey from 16-bit to 2-bit precision—an 8x compression—makes models like Llama-2-70B accessible on consumer hardware.

The key insights from recent research are:

Incoherence matters: QuIP# showed that preprocessing weights with Hadamard transforms enables extreme quantization by eliminating outliers
Optimal codebooks: Lattice-based quantization (E8 lattice) provides theoretically optimal compression for incoherent weight distributions
KV cache is the next frontier: GEAR demonstrates that combining quantization, low-rank, and sparse approximations achieves near-lossless compression
Theory guides practice: Rate-distortion theory and optimal transport provide principled frameworks for understanding quantization limits

As models continue to grow—GPT-4 class systems likely exceed 1 trillion parameters—efficient compression becomes ever more critical. The mathematics of quantization, from Hadamard transforms to lattice codebooks, will remain essential knowledge for anyone deploying large language models at scale.

Test Your Knowledge: LLM Quantization

Question 1 of 5

What is the main goal of model quantization?

This article cites peer-reviewed research from Semantic Scholar, including studies from leading institutions in machine learning and systems research. For complete bibliographic information, see the hyperlinked references throughout the text.

Interactive

RAG Tutorial: Build Knowledge-Grounded AI with Retrieval-Augmented Generation

Learn RAG from scratch - chunking strategies, embeddings, vector databases, and fusion mechanisms. Build AI that cites sources and never hallucinates facts.

AI/ML/NLPNovember 27, 202515 min read

Interactive

Mixture of Experts Explained: How Mixtral, DeepSeek & Grok Scale to Trillions of Parameters

Understand MoE architecture - the technology behind Mixtral 8x7B, DeepSeek, and Grok. Learn expert routing, load balancing, and why sparse models beat dense ones.

AI/ML/NLPNovember 21, 202512 min read

Interactive

Constitutional AI Explained: How Anthropic Makes Claude Safe Without Human Labels

Learn Constitutional AI (CAI) - Anthropic's technique for training safe AI without massive human labeling. Understand RLAIF, self-critique, and how it compares to RLHF.

AI/ML/NLPJanuary 21, 202610 min read

LLM Quantization Guide: Run 70B Models on Consumer GPUs with GPTQ, AWQ & GGUF

The Fundamentals of Quantization

What Is Quantization?

Symmetric vs. Asymmetric Quantization

Quantization Approaches: PTQ vs. QAT

Post-Training Quantization (PTQ)

Quantization-Aware Training (QAT)

Quantization Error Analysis

The Outlier Problem

QuIP#: State-of-the-Art Extreme Compression

The Key Insight: Incoherence

The Hadamard Transform

Lattice Codebooks

QuIP# Results

KV Cache Compression with GEAR

The KV Cache Problem

GEAR: Quantization + Low-Rank + Sparse Residuals

GEAR Results

Practical Quantization: Choosing the Right Approach

Decision Tree for Quantization Strategy

Compression Comparison

The Mathematics of Optimal Quantization

Rate-Distortion Theory

Why Lattices Are Optimal

Future Directions

Emerging Techniques

The Path to Sub-2-Bit

Conclusion

Related Articles

RAG Tutorial: Build Knowledge-Grounded AI with Retrieval-Augmented Generation

Mixture of Experts Explained: How Mixtral, DeepSeek & Grok Scale to Trillions of Parameters

Constitutional AI Explained: How Anthropic Makes Claude Safe Without Human Labels

SPACE SERVICES