Quantization is a widely adopted method for improving the efficiency of AI models, but recent research highlights its limitations. This technique reduces the number of bits used to represent information, much like simplifying "12:00:01.004" to just "noon." By representing internal variables, known as parameters, with fewer bits, AI models require less computational power. This approach is particularly useful given the millions of calculations these models perform.
However, studies suggest that quantization may have significant drawbacks. Research conducted by experts from institutions like Harvard and MIT shows that heavily quantized models can perform worse if their original, unquantized versions were trained extensively on large datasets. This means it might be more effective to train smaller models from the start rather than downsizing larger ones through quantization.
This challenge is becoming evident with certain large-scale models, such as Meta’s Llama 3. Developers have found that quantization negatively affects these models, likely due to their extensive training methods. This issue raises concerns for AI companies relying on massive models that they later quantize to reduce costs.
While training AI models incurs high costs, the expenses of inferencing—using a model to generate results—are even higher. For example, training Google’s Gemini model cost an estimated $191 million, but using it for 50-word responses to half of Google Search queries could cost around $6 billion annually.
Despite the diminishing returns of scaling up models with larger datasets, the industry continues to focus on this approach. For instance, Meta trained Llama 3 on 15 trillion tokens, far exceeding the 2 trillion tokens used for Llama 2. Yet larger datasets don’t always lead to proportionally better results, as seen in models from companies like Anthropic and Google.
To address these challenges, researchers are exploring ways to make models more robust during training. Training in lower precision, such as using 8-bit instead of the more common 16-bit precision, could help. However, reducing precision too much—below 7 or 8 bits—can noticeably degrade performance, especially in smaller models.
Hardware developments, like Nvidia’s Blackwell chip, which supports 4-bit precision, aim to make quantization more practical. However, researchers caution against overly aggressive quantization, emphasizing that AI models have finite capacity. Simply trying to fit vast amounts of data into smaller, lower-precision models can compromise their quality.
This research underscores the complexity of optimizing AI models. Reducing inference costs through quantization isn’t a limitless solution. Instead, the focus may need to shift toward curating high-quality data and developing new architectures that support stable, low-precision training.
As Tanishq Kumar, a lead researcher, points out, "Bit precision matters, and it’s not free. Careful data selection and innovative architectures will be crucial for creating efficient and effective AI models in the future."