Benchmarks

1 billion vectors. Single GPU. 100% exact recall.

Measured H100 80GB. No approximation. Every number cryptographically signed.

NVIDIA H100 80GB HBM3CUDA 12.8D=384, rank=32April 2026

1,000,000,000

vectors on a single GPU

38.5ms

p50 latency

100%

Recall@10

73 GB

VRAM

46.5x

Reduction

ML-DSA-65 signed (FIPS 204). SHA-256 manifest for all artifacts

Scale ceiling

fp16 | Scale Tier

QTT compression converges as the dataset grows. More data, better recall.

ScaleBuildServingVRAMp50p99R@10
400M485s26.4 GB29.7 GB15.8ms18.0ms96%
500M699s33.0 GB36.9 GB20.5ms24.4ms98%
600M813s39.6 GB44.1 GB23.4ms26.3ms99%
800M1,122s52.8 GB58.5 GB31.0ms33.5ms99%
1B1,418s66.0 GB72.9 GB38.5ms41.0ms100%

Why recall improves

More vectors provide richer structure for the tensor train to exploit. At 1B entries the compression reaches full convergence — a mathematical property, not a tuning parameter.

Production

fp32 | 100% recall at every tested scale

ScaleBuildServingVRAMp50p99R@10
200M354s25.8 GB28.1 GB34.9ms39.7ms100%
300M347s39.6 GB44.2 GB46.3ms47.5ms100%
400M495s52.8 GB58.6 GB61.5ms62.6ms100%
500M622s66.0 GB73.0 GB76.5ms77.7ms100%

Exact

fp64 | Full double-precision arithmetic

ScaleBuildp50p99R@10VRAMB/entry
1K0.0s0.70ms1.97ms100%0 MB420
100K0.2s0.86ms1.45ms100%26 MB278
1M2.1s1.18ms5.10ms100%264 MB277
10M24.2s2.68ms4.66ms100%2.6 GB277
50M164s9.37ms10.88ms100%12.9 GB277
100M104s20.40ms21.50ms100%25.8 GB277
200M212s40.40ms41.80ms100%50.4 GB271

Comparison

vs HNSW at 10M vectors

Milvus / Qdrant / Weaviate. D=384.

HX-SDP (fp32)

Index1.3 GB
VRAM3.0 GB
Recall100% exact
Build21s
p502.2ms
p994.1ms
B/entry132

HNSW (fp32)

Index15–20 GB
VRAM18–24 GB
Recall95–99% approx
Build2–10 min
p500.5–2.0ms
p995–15ms
B/entry1,500–2,000

HNSW is faster at p50 for approximate search. HX-SDP delivers exact results with 5–15x less memory. At 100M+ the latency gap closes as HNSW degrades and HX-SDP scales linearly.

Durability

24-hour soak test

Continuous mixed workload. S-10M, H100 80GB.

8.6M

Inserts

2.9M

Queries

99.99992%

Success rate (9 errors)

86.3s

Cold restart (mean of 5)

VRAM drifted 17.2% over 24 hours (4.75 GB to 5.03 GB). Bounded, not zero. These are the real numbers.

Transparency

Honest tradeoffs

Insert throughput is batch-optimized. Online single-insert is ~885/s vs HNSW at 5–50K/s. Streaming ingest exists but it's not the strength yet.
Concurrency has a ceiling. Single-client QPS is ~350 at 10M. At 64 concurrent clients, latency spikes to ~300ms p50. Single-GPU constraint. Multi-GPU sharding is designed but not shipped.
9 errors out of 11.5M operations. VRAM drifted 17.2%. Bounded, not zero.
Build times scale linearly. 21s at 10M. 1,418s at 1B. Rebuilding a billion-entry index takes 24 minutes.
Self-hosted today. Docker with GPU overlay. Auth, rate limiting, Prometheus, Grafana, client SDK, AWS/GCP/Azure/K8s patterns included. Managed cloud on the roadmap.

Economics

Flat cost. No per-query billing.

One GPU per client. The cost is the GPU. No namespace multipliers. No usage-based scaling. No per-query metering.

Every access pattern (search, analytics, inference, streaming) runs on the same representation in VRAM. One bill, one runtime.

Paid pilot available. Bring your embeddings.