Google develops a technique for AI models to use up to six times less memory

24/03/2026

AI models require large amounts of memory to operate quickly. Google Research has introduced TurboQuant, a compression algorithm that reduces that consumption up to six times without any loss of precision.

Google develops a technique for AI models to use up to six times less memory

AI models, like those powering virtual assistants or modern search engines, work by processing enormous amounts of information. To do so quickly, they store part of that information in a kind of working memory, similar to notes taken while studying to avoid rereading an entire book each time. The problem is that this memory takes up a lot of space and becomes a bottleneck that slows systems down and drives up operating costs.

Google Research has developed TurboQuant, a technique that drastically reduces the space used by this working memory without causing the model to make more errors. In tests, the team compressed that information up to six times without any loss of precision, reducing data from 32 bits to just 3, while the system ran up to eight times faster than the uncompressed version on specialised hardware such as Nvidia H100 GPUs.

The approach combines two mathematical techniques. The first reorganises data more compactly, in a way comparable to describing a location with an angle and a distance rather than X and Y coordinates: less information is needed to convey the same thing. The second uses just one additional bit to correct the small errors introduced by compression, acting as an automatic corrector that maintains the accuracy of the final result.

One of the most notable practical advantages is that it requires neither retraining models nor fine-tuning them from scratch. TurboQuant applies directly to existing models, making adoption considerably easier. Google notes that the technique also improves semantic search engines — those that allow search tools to understand the meaning of a query rather than looking for exact keywords.

The research is backed by theoretical proofs placing the results close to the maximum efficiency limit achievable from a mathematical standpoint.

Key points

  • Google Research introduces TurboQuant, an algorithm that compresses AI model working memory up to six times.
  • The technique does not reduce model accuracy and requires no retraining.
  • In tests, the system ran up to eight times faster on specialised hardware.
  • Compression is achieved by reducing data to just 3 bits, compared to the usual 32.
  • It also improves the speed of large-scale semantic search engines.

Related AI

Google AI

Responsible AI innovation for everyone

Google AI develops advanced platforms that improve people's lives. Its Gemini ecosystem integrates models, products, and APIs, driving responsible innovation and enabling developers and businesses to ...

Lastest news

★★★★★
Rate us on Google
This website uses technical, personalization and analysis cookies, both our own and from third parties, to facilitate anonymous browsing and analyze website usage statistics. We consider that if you continue browsing, you accept their use.