Quantization
InfrastructureA technique for reducing AI model size and memory requirements by representing weights with lower precision numbers.
Full Explanation
A full-precision model stores weights as 32-bit floats. 4-bit quantization reduces each weight to 4 bits — an 8x size reduction with modest quality loss. This makes large models runnable on consumer hardware. LLaMA 3.1 405B in full precision requires 810GB VRAM; 4-bit quantized versions run on a single A100 80GB GPU. Tools: llama.cpp, GGUF format, bitsandbytes.
Related Terms
AI models whose weights and (sometimes) training code are publicly released, allowing anyone to run, modify, and build on them.
The process of running a trained AI model to generate outputs — what happens when you use an AI tool.
A type of AI model trained on vast amounts of text data that can generate, summarize, translate, and reason about language.