Engineering · QTT

Why QTT for vector search (and what that sentence actually means)

·9 min read

QTTCompressionVector search

Most people who have stared at a hundred million 384-dimensional vectors for a while have, at some point, tried to compress them. Product quantization, scalar quantization, dimensionality reduction, binary embeddings. The menu is long. What we're doing with Quantized Tensor Train decomposition (QTT) is a different kind of compression, and the difference matters for search.

What QTT actually is

A Tensor Train factorizes a high-order tensor into a chain of low-rank cores. The "quantized" part means we first reshape each dimension into a sequence of small factors (typically 2) and treat the result as a much higher-order tensor. Concretely, a length-1024 vector becomes a 10-order binary tensor. Then you factor it.

If the underlying data has structure (smoothness, separability, low effective rank) the resulting cores are small, and the total parameter count grows with the logarithm of the original size. That is where the compression numbers on /benchmarks come from.

The precondition you have to read

QTT only compresses data that has structure. An i.i.d. Gaussian tensor does not factor usefully, and neither do the pessimal worst-case inputs of information-theoretic compression bounds. Embedding spaces from real models (BGE, E5, text-embedding-3, image backbones, etc.) are structured in practice. That is what makes this useful in the first place. We publish the regime where this holds and the regime where it doesn't.

Why this helps vector search specifically

Traditional approximate nearest neighbor (ANN) indexes (HNSW, IVF, ScaNN) trade recall for speed. They build a graph or a coarse partitioning and then do a bounded walk. Memory scales with the number of embeddings; the index is a second copy on top of the raw vectors.

With QTT, the representation is the index. The inner product against a query is computed directly on the compressed cores, which is cheap when the cores are low rank. There's no separate "graph" to walk. That is what lets the fp32 tier hit a 1.3 GB index over 10M vectors at D=384, as opposed to the 15–20 GB HNSW numbers you see in published benchmarks.

Why this doesn't help everywhere

  • Rank-limited speedup. The core of the method is a rank parameter. Workloads whose embeddings have genuinely high effective rank get smaller speedups. At the extreme (random unit vectors in R^D) QTT gives you almost nothing.
  • Insert throughput. Rebuilding the TT cores on the hot path is expensive. Batched insert is fine; single-record streaming insert is bounded by the kernel we ship (~885/s today). We surface this on /platform under Honest tradeoffs.
  • Concurrency ceiling. A single GPU node. One representation. Beyond ~64 simultaneous clients at 10M scale, p50 starts climbing. Multi-node sharding is designed but not shipped.

What we are not claiming

We are not claiming QTT beats a well-tuned HNSW on p50 for small working sets. HNSW is genuinely faster under those conditions, and the numbers on the per-vendor comparison on /benchmarks say so. We are claiming three things that are backed by signed receipts:

  • 100M vectors fit on a single H100 80GB node with 100% exact recall on the QTT-Native fp32 path.
  • The index build is 10–30× faster than the HNSW reference at 10M.
  • The representation is the only copy of the data. Not an index on top of it.

Further reading

The closest academic reference points are Oseledets' original tensor-train paper and subsequent work on QTT in scientific computing. The core idea is old; the interesting engineering is the GPU kernel path and the per-tier recall curves, which live on /benchmarks.