Back to research
AI/ML/NLPJanuary 5, 202613 min read

LLM Quantization Guide: Run 70B Models on Consumer GPUs with GPTQ, AWQ & GGUF

Run 70B parameter models on a single GPU. Learn LLM quantization from 8-bit to 2-bit precision - GPTQ, AWQ, GGUF, QuIP#, and when to use each method.

Space Services

Space Services

LLM Quantization Guide: Run 70B Models on Consumer GPUs with GPTQ, AWQ & GGUF
Share:

Running a 70-billion parameter language model requires 140 GB of memory in 16-bit precision. That's more than most consumer GPUs can handle, and even enterprise deployments struggle with the costs. Quantization—the art of reducing numerical precision while preserving model quality—has become essential for making large language models practical. Recent breakthroughs have pushed the boundaries from 8-bit to 4-bit, and now even 2-bit precision, achieving compression ratios that seemed impossible just two years ago.

Quantization Visualization
See how weights are compressed from 32-bit to 8-bit precision
Ctrl/Cmd + Enter to run

The Fundamentals of Quantization

What Is Quantization?

Quantization maps high-precision floating-point numbers to lower-precision representations. In the context of neural networks, this typically means converting 32-bit or 16-bit weights to 8-bit, 4-bit, or even lower bit-widths.

The core quantization formula is:

Q(w)=round(wzs)s+zQ(w) = \text{round}\left(\frac{w - z}{s}\right) \cdot s + z

Where:

  • ww is the original weight value
  • ss is the scale factor (determines the step size between quantized values)
  • zz is the zero-point (offset for asymmetric quantization)
  • Q(w)Q(w) is the quantized-then-dequantized value

Symmetric vs. Asymmetric Quantization

Symmetric quantization assumes weights are centered around zero:

Qsym(w)=round(ws)sQ_{\text{sym}}(w) = \text{round}\left(\frac{w}{s}\right) \cdot s

Symmetric Quantization (8-bit):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

    -127 ────────────── 0 ────────────── +127
           ◀────────────▶◀────────────▶
              negative      positive
              weights       weights

    Scale: s = max(|w|) / 127
    Zero-point: z = 0

Asymmetric quantization allows the zero-point to shift, better capturing distributions that aren't centered at zero:

Qasym(w)=round(wwmins)s+wminQ_{\text{asym}}(w) = \text{round}\left(\frac{w - w_{\min}}{s}\right) \cdot s + w_{\min}

Asymmetric Quantization (8-bit):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

    0 ──────────────────────────────── 255
    │◀──────── full range ──────────▶│
    ↓                                 ↓
    w_min                           w_max

    Scale: s = (w_max - w_min) / 255
    Zero-point: z = w_min
Method Pros Cons
Symmetric Simpler computation, faster inference Wastes precision if distribution is skewed
Asymmetric Better utilizes full quantization range Requires storing zero-point, slightly slower

Quantization Approaches: PTQ vs. QAT

Post-Training Quantization (PTQ)

PTQ quantizes a pre-trained model without additional training. It's fast and convenient but can suffer accuracy loss, especially at low bit-widths.

Post-Training Quantization Pipeline:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

    ┌─────────────┐
    │ Pre-trained │  Full precision (FP16/FP32)
    │   Model     │
    └──────┬──────┘
           │
           ▼
    ┌─────────────┐
    │ Calibration │  Run small dataset to determine
    │   Dataset   │  weight/activation ranges
    └──────┬──────┘
           │
           ▼
    ┌─────────────┐
    │  Compute    │  Calculate scale (s) and
    │  Scales &   │  zero-point (z) per layer
    │ Zero-points │
    └──────┬──────┘
           │
           ▼
    ┌─────────────┐
    │  Quantized  │  INT8/INT4 weights
    │   Model     │
    └─────────────┘

Quantization-Aware Training (QAT)

QAT simulates quantization during training, allowing the model to adapt to reduced precision. It achieves better accuracy but requires significant computational resources.

Quantization-Aware Training:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Forward Pass:
    w_float ──▶ Quantize ──▶ w_quant ──▶ Dequantize ──▶ w_fake
                   │                           │
                   └─── "Fake quantization" ───┘
                        (differentiable)

Backward Pass:
    Uses Straight-Through Estimator (STE):

    ∂L/∂w ≈ ∂L/∂w_fake  (gradient passes through)
Approach Training Cost Accuracy at 4-bit Use Case
PTQ None Good (with calibration) Quick deployment
QAT High (full retraining) Excellent Production systems

Quantization Error Analysis

Understanding quantization error is crucial for designing effective compression schemes. The mean squared error (MSE) from quantization is:

MSE=E[(wQ(w))2]\text{MSE} = \mathbb{E}[(w - Q(w))^2]

For uniform quantization with step size Δ\Delta, the error is bounded by:

MSEΔ212\text{MSE} \leq \frac{\Delta^2}{12}

This assumes the quantization error is uniformly distributed over [Δ/2,Δ/2][-\Delta/2, \Delta/2].

The Outlier Problem

LLM weights often contain outliers—values significantly larger than the typical range. These outliers devastate naive quantization:

Weight Distribution with Outliers:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

    Count
      │
    ██│
    ██│    99% of weights
    ██│   ┌────────────┐
    ██│   │            │
    ██████████████████ │  │          Outliers (0.1%)
    ──────────────────────────────────────▶
   -0.5            0           0.5      2.0    Value

Problem: If scale is set by outliers (max = 2.0),
         most weights get very few quantization levels.

Solutions to the outlier problem include:

  • Per-channel quantization: Different scales for different output channels
  • Mixed-precision: Keep outlier channels in higher precision
  • Outlier-aware methods: Explicitly handle outliers (e.g., SmoothQuant, AWQ)

QuIP#: State-of-the-Art Extreme Compression

The QuIP# paper from Cornell (2024) achieved a breakthrough in extreme quantization, enabling 2-bit precision with minimal accuracy loss. With 215 citations in under two years, it has become a foundational technique for LLM compression.

The Key Insight: Incoherence

QuIP#'s central innovation is using Hadamard transforms to make weight matrices incoherent before quantization. Incoherence means the weights are "spread out" uniformly, without concentrated outliers.

Incoherence via Hadamard Transform:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Before (Coherent):          After (Incoherent):

    ┌─────────────┐         ┌─────────────┐
    │ ■           │         │ · · · · · · │
    │   ■         │   H×    │ · · · · · · │
    │     ■       │  ───▶   │ · · · · · · │
    │       ■     │         │ · · · · · · │
    │         ■   │         │ · · · · · · │
    └─────────────┘         └─────────────┘

    Outliers concentrated    Energy spread uniformly
    on specific entries      across all entries

The Hadamard Transform

The Hadamard matrix HnH_n is a square matrix with entries ±1/n\pm 1/\sqrt{n} satisfying:

HnHnT=IH_n H_n^T = I

For dimension n=2kn = 2^k, it's constructed recursively:

H2=12(1111)H_2 = \frac{1}{\sqrt{2}} \begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix}

H2n=12(HnHnHnHn)H_{2n} = \frac{1}{\sqrt{2}} \begin{pmatrix} H_n & H_n \\ H_n & -H_n \end{pmatrix}

QuIP# applies Hadamard transforms to both rows and columns of weight matrices:

W=HmWHnW' = H_m W H_n

Because HH is orthogonal, the transformation is lossless (can be perfectly reversed) and can be fused into adjacent layers with minimal overhead.

"The Hadamard transform spreads the energy of outlier weights across all entries, creating a more uniform distribution that is far easier to quantize accurately." — Tseng et al., QuIP#, 2024

Lattice Codebooks

The second innovation in QuIP# is using lattice codebooks instead of uniform quantization grids. A lattice is a regular geometric arrangement of points that provides optimal quantization for incoherent vectors.

Uniform Grid vs. Lattice Codebook:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Uniform 2D Grid:              E8 Lattice (8D):
    • • • •                   Optimal sphere packing
    • • • •                   in 8 dimensions
    • • • •
    • • • •                   ┌─────────────────────┐
                              │ Quantization points │
    Wasted space              │ arranged to minimize│
    between points            │ average distance to │
                              │ any weight vector   │
                              └─────────────────────┘

QuIP# uses the E8 lattice for 2-bit quantization. The E8 lattice achieves the optimal packing density in 8 dimensions, meaning quantized weights are, on average, closer to their original values than any other arrangement.

QuIP# Results

Model Method Bits Perplexity (WikiText-2)
Llama-2-70B FP16 16 3.32
Llama-2-70B GPTQ 4 3.85
Llama-2-70B QuIP 4 3.48
Llama-2-70B QuIP# 2 4.16

QuIP# achieves 2-bit quantization with perplexity within 25% of full precision—a remarkable result that was considered impossible just a few years ago. This translates to 8x compression, reducing a 140 GB model to under 18 GB.


KV Cache Compression with GEAR

While weight quantization reduces model storage, KV cache compression addresses a different bottleneck: memory usage during inference. The GEAR paper (2024), with 128 citations, presents an elegant solution.

The KV Cache Problem

During autoregressive generation, transformers cache the key and value tensors from all previous tokens to avoid recomputation. For long sequences, this cache dominates memory usage:

KV Cache Memory Growth:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Memory
  │                                          ╱
  │                                        ╱
  │                                      ╱
  │                                    ╱
  │                              KV Cache
  │                            ╱
  │                          ╱
  │                        ╱
  │──────────────────────────────────────  Model Weights
  │                                        (constant)
  └──────────────────────────────────────▶
                                         Sequence Length

At 128K context (Llama-3):
- Model weights: ~140 GB (70B params × 2 bytes)
- KV cache: ~160 GB (grows with sequence × batch)

GEAR: Quantization + Low-Rank + Sparse Residuals

GEAR (Generative Efficient Attention caching with Residual approximation) combines three compression techniques:

GEAR Compression Pipeline:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

    ┌─────────────────────────────────────────────────┐
    │              Original KV Cache                   │
    │                   (FP16)                         │
    └─────────────────────┬───────────────────────────┘
                          │
                          ▼
    ┌─────────────────────────────────────────────────┐
    │         Uniform Quantization (INT4)              │
    │    Captures bulk of the information cheaply      │
    └─────────────────────┬───────────────────────────┘
                          │
                          ▼
    ┌─────────────────────────────────────────────────┐
    │        Low-Rank Approximation (SVD)              │
    │    K̂ ≈ U × S × V^T  (top-r singular values)     │
    │    Captures systematic quantization error        │
    └─────────────────────┬───────────────────────────┘
                          │
                          ▼
    ┌─────────────────────────────────────────────────┐
    │          Sparse Outlier Storage                  │
    │    Store only top-k largest residual entries     │
    │    (handles remaining outliers)                  │
    └─────────────────────────────────────────────────┘

The reconstruction is:

KVDequant(Qint4(KV))+UΣVT+S\text{KV} \approx \text{Dequant}(Q_{\text{int4}}(\text{KV})) + U \Sigma V^T + S

Where:

  • Qint4Q_{\text{int4}} is 4-bit uniform quantization
  • UΣVTU \Sigma V^T is the low-rank correction (rank 2-4)
  • SS is the sparse residual (top 1-2% of entries)

GEAR Results

Method Bits Memory Reduction Accuracy Loss
FP16 (baseline) 16 1x 0%
Naive INT4 4 4x 8-15%
KIVI 2 8x 3-5%
GEAR ~3 5-6x <1%

"GEAR achieves near-lossless compression of the KV cache with 5-6x memory reduction, enabling longer context windows and larger batch sizes without quality degradation." — Kang et al., GEAR, 2024


Practical Quantization: Choosing the Right Approach

Decision Tree for Quantization Strategy

Quantization Strategy Selection:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

                    ┌────────────────┐
                    │ What's your    │
                    │ constraint?    │
                    └───────┬────────┘
                            │
            ┌───────────────┼───────────────┐
            ▼               ▼               ▼
     ┌──────────┐    ┌──────────┐    ┌──────────┐
     │ Memory   │    │ Accuracy │    │ Both     │
     │ (edge)   │    │ critical │    │          │
     └────┬─────┘    └────┬─────┘    └────┬─────┘
          │               │               │
          ▼               ▼               ▼
     ┌──────────┐    ┌──────────┐    ┌──────────┐
     │ QuIP#    │    │ QAT      │    │ GPTQ/AWQ │
     │ (2-bit)  │    │ (8-bit)  │    │ (4-bit)  │
     └──────────┘    └──────────┘    └──────────┘

For long-context inference: Add GEAR for KV cache

Compression Comparison

Technique Compression Quality Complexity Best For
INT8 PTQ 2x Excellent Low Quick deployment
GPTQ 4-bit 4x Good Medium Consumer GPUs
AWQ 4-bit 4x Very Good Medium Production
QuIP# 2-bit 8x Good High Maximum compression
GEAR (KV) 5-6x Excellent Medium Long context

The Mathematics of Optimal Quantization

Rate-Distortion Theory

The fundamental limit of quantization is described by rate-distortion theory. For a source with distribution p(x)p(x) and distortion measure d(x,x^)d(x, \hat{x}), the minimum achievable rate (bits) at distortion DD is:

R(D)=minp(x^x):E[d(x,x^)]DI(X;X^)R(D) = \min_{p(\hat{x}|x): \mathbb{E}[d(x,\hat{x})] \leq D} I(X; \hat{X})

For Gaussian sources with mean squared error distortion:

R(D)=12log2σ2DR(D) = \frac{1}{2} \log_2 \frac{\sigma^2}{D}

This means each halving of the distortion requires one additional bit per value.

Why Lattices Are Optimal

Vector quantization outperforms scalar quantization because it exploits correlations. For nn-dimensional vectors, the normalized second moment measures quantization efficiency:

Gn=1nE[XQ(X)2]V2/nG_n = \frac{1}{n} \frac{\mathbb{E}[\|X - Q(X)\|^2]}{V^{2/n}}

Where VV is the volume of the quantization cell. The E8 lattice achieves:

GE80.0717G_{E8} \approx 0.0717

Compared to uniform scalar quantization:

G1=1120.0833G_1 = \frac{1}{12} \approx 0.0833

This 14% reduction in quantization error per dimension compounds significantly over millions of weights.


Future Directions

Emerging Techniques

  1. 1-bit quantization: Binary and ternary networks are approaching practical quality for specific tasks

  2. Dynamic quantization: Adjusting precision based on input difficulty or layer importance

  3. Quantization-aware architecture design: Models designed from scratch for efficient quantization

  4. Hardware co-design: Custom accelerators optimized for specific quantization formats

The Path to Sub-2-Bit

Theoretical Compression Limits:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Bits/Parameter    Technique              Status
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
     16          FP16 (baseline)         Standard
      8          INT8 PTQ                Mature
      4          GPTQ/AWQ                Mainstream
      2          QuIP#                   State-of-art
     1.5         Ternary + Entropy       Research
      1          Binary                  Emerging
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Each 1-bit reduction = 2x compression
4-bit → 2-bit = 4x smaller = 70B model in 18GB

Conclusion

Quantization has transformed from a nice-to-have optimization into a critical enabling technology for large language models. The journey from 16-bit to 2-bit precision—an 8x compression—makes models like Llama-2-70B accessible on consumer hardware.

The key insights from recent research are:

  1. Incoherence matters: QuIP# showed that preprocessing weights with Hadamard transforms enables extreme quantization by eliminating outliers

  2. Optimal codebooks: Lattice-based quantization (E8 lattice) provides theoretically optimal compression for incoherent weight distributions

  3. KV cache is the next frontier: GEAR demonstrates that combining quantization, low-rank, and sparse approximations achieves near-lossless compression

  4. Theory guides practice: Rate-distortion theory and optimal transport provide principled frameworks for understanding quantization limits

As models continue to grow—GPT-4 class systems likely exceed 1 trillion parameters—efficient compression becomes ever more critical. The mathematics of quantization, from Hadamard transforms to lattice codebooks, will remain essential knowledge for anyone deploying large language models at scale.

Test Your Knowledge: LLM Quantization
Question 1 of 5
What is the main goal of model quantization?

This article cites peer-reviewed research from Semantic Scholar, including studies from leading institutions in machine learning and systems research. For complete bibliographic information, see the hyperlinked references throughout the text.

Share:

Related Articles

Space landscape

SPACE SERVICES