Most people who have stared at a hundred million 384-dimensional vectors for a while have, at some point, tried to compress them. Product quantization, scalar quantization, dimensionality reduction, binary embeddings. The menu is long. What we're doing with Quantized Tensor Train decomposition (QTT) is a different kind of compression, and the difference matters for search.
What QTT actually is
A Tensor Train factorizes a high-order tensor into a chain of low-rank cores. The "quantized" part means we first reshape each dimension into a sequence of small factors (typically 2) and treat the result as a much higher-order tensor. Concretely, a length-1024 vector becomes a 10-order binary tensor. Then you factor it.
If the underlying data has structure (smoothness, separability, low effective rank) the resulting cores are small, and the total parameter count grows with the logarithm of the original size. That is where the compression numbers on /benchmarks come from.
The precondition you have to read
Why this helps vector search specifically
Traditional approximate nearest neighbor (ANN) indexes (HNSW, IVF, ScaNN) trade recall for speed. They build a graph or a coarse partitioning and then do a bounded walk. Memory scales with the number of embeddings; the index is a second copy on top of the raw vectors.
With QTT, the representation is the index. The inner product against a query is computed directly on the compressed cores, which is cheap when the cores are low rank. There's no separate "graph" to walk. That is what lets the fp32 tier hit a 1.3 GB index over 10M vectors at D=384, as opposed to the 15–20 GB HNSW numbers you see in published benchmarks.
Why this doesn't help everywhere
- Rank-limited speedup. The core of the method is a
rankparameter. Workloads whose embeddings have genuinely high effective rank get smaller speedups. At the extreme (random unit vectors in R^D) QTT gives you almost nothing. - Insert throughput. Rebuilding the TT cores on the hot path is expensive. Batched insert is fine; single-record streaming insert is bounded by the kernel we ship (~885/s today). We surface this on /platform under Honest tradeoffs.
- Concurrency ceiling. A single GPU node. One representation. Beyond ~64 simultaneous clients at 10M scale, p50 starts climbing. Multi-node sharding is designed but not shipped.
What we are not claiming
We are not claiming QTT beats a well-tuned HNSW on p50 for small working sets. HNSW is genuinely faster under those conditions, and the numbers on the per-vendor comparison on /benchmarks say so. We are claiming three things that are backed by signed receipts:
- 100M vectors fit on a single H100 80GB node with 100% exact recall on the QTT-Native fp32 path.
- The index build is 10–30× faster than the HNSW reference at 10M.
- The representation is the only copy of the data. Not an index on top of it.
Further reading
The closest academic reference points are Oseledets' original tensor-train paper and subsequent work on QTT in scientific computing. The core idea is old; the interesting engineering is the GPU kernel path and the per-tier recall curves, which live on /benchmarks.