Benchmarks

1 billion vectors. Single GPU. 100% exact recall.

Measured H100 80GB. No approximation. Every number cryptographically signed.

NVIDIA H100 80GB HBM3CUDA 12.8D=384, rank=32April 2026

1,000,000,000

vectors on a single GPU

38.5ms

p50 latency

100%

Recall@10

73 GB

VRAM

46.5x

Reduction

ML-DSA-65 signed (FIPS 204). SHA-256 manifest for all artifacts

Scale ceiling

fp16 | Scale Tier

QTT compression converges as the dataset grows. More data, better recall.

Scale	Build	Serving	VRAM	p50	p99	R@10
400M	485s	26.4 GB	29.7 GB	15.8ms	18.0ms	96%
500M	699s	33.0 GB	36.9 GB	20.5ms	24.4ms	98%
600M	813s	39.6 GB	44.1 GB	23.4ms	26.3ms	99%
800M	1,122s	52.8 GB	58.5 GB	31.0ms	33.5ms	99%
1B	1,418s	66.0 GB	72.9 GB	38.5ms	41.0ms	100%

Why recall improves

More vectors provide richer structure for the tensor train to exploit. At 1B entries the compression reaches full convergence — a mathematical property, not a tuning parameter.

Production

fp32 | 100% recall at every tested scale

Scale	Build	Serving	VRAM	p50	p99	R@10
200M	354s	25.8 GB	28.1 GB	34.9ms	39.7ms	100%
300M	347s	39.6 GB	44.2 GB	46.3ms	47.5ms	100%
400M	495s	52.8 GB	58.6 GB	61.5ms	62.6ms	100%
500M	622s	66.0 GB	73.0 GB	76.5ms	77.7ms	100%

Exact

fp64 | Full double-precision arithmetic

Scale	Build	p50	p99	R@10	VRAM	B/entry
1K	0.0s	0.70ms	1.97ms	100%	0 MB	420
100K	0.2s	0.86ms	1.45ms	100%	26 MB	278
1M	2.1s	1.18ms	5.10ms	100%	264 MB	277
10M	24.2s	2.68ms	4.66ms	100%	2.6 GB	277
50M	164s	9.37ms	10.88ms	100%	12.9 GB	277
100M	104s	20.40ms	21.50ms	100%	25.8 GB	277
200M	212s	40.40ms	41.80ms	100%	50.4 GB	271

Comparison

vs HNSW at 10M vectors

Milvus / Qdrant / Weaviate. D=384.

HX-SDP (fp32)

Index1.3 GB

VRAM3.0 GB

Recall100% exact

Build21s

p502.2ms

p994.1ms

B/entry132

HNSW (fp32)

Index15–20 GB

VRAM18–24 GB

Recall95–99% approx

Build2–10 min

p500.5–2.0ms

p995–15ms

B/entry1,500–2,000

HNSW is faster at p50 for approximate search. HX-SDP delivers exact results with 5–15x less memory. At 100M+ the latency gap closes as HNSW degrades and HX-SDP scales linearly.

Durability

24-hour soak test

Continuous mixed workload. S-10M, H100 80GB.

8.6M

Inserts

2.9M

Queries

99.99992%

Success rate (9 errors)

86.3s

Cold restart (mean of 5)

VRAM drifted 17.2% over 24 hours (4.75 GB to 5.03 GB). Bounded, not zero. These are the real numbers.

Transparency

Honest tradeoffs

Insert throughput is batch-optimized. Online single-insert is ~885/s vs HNSW at 5–50K/s. Streaming ingest exists but it's not the strength yet.

Concurrency has a ceiling. Single-client QPS is ~350 at 10M. At 64 concurrent clients, latency spikes to ~300ms p50. Single-GPU constraint. Multi-GPU sharding is designed but not shipped.

9 errors out of 11.5M operations. VRAM drifted 17.2%. Bounded, not zero.

Build times scale linearly. 21s at 10M. 1,418s at 1B. Rebuilding a billion-entry index takes 24 minutes.

Self-hosted today. Docker with GPU overlay. Auth, rate limiting, Prometheus, Grafana, client SDK, AWS/GCP/Azure/K8s patterns included. Managed cloud on the roadmap.

Economics

Flat cost. No per-query billing.

One GPU per client. The cost is the GPU. No namespace multipliers. No usage-based scaling. No per-query metering.

Every access pattern (search, analytics, inference, streaming) runs on the same representation in VRAM. One bill, one runtime.

Paid pilot available. Bring your embeddings.

[email protected]