Benchmarking quantum computers: metrics that matter
Qubit count alone doesn't tell the story. Here's how to evaluate quantum hardware claims.
“We have 1,000 qubits!” sounds impressive. But is it?
Benchmarking quantum computers is harder than classical systems because the relevant metrics are multidimensional and context-dependent.
Here’s a framework for evaluating hardware claims.
The metrics that matter
1) Qubit count (necessary but not sufficient)
What it is: Total number of physical qubits.
Why it matters: Algorithms scale with qubit count. Shor’s algorithm to factor a 2048-bit RSA key needs thousands of logical qubits.
Why it’s not enough:
- Qubits with 90% fidelity are worse than 10 qubits with 99.9% fidelity
- Error correction overhead means 1,000 physical qubits might yield only a few logical qubits
Useful follow-up: How many are actually usable? What’s the connectivity?
2) Gate fidelity (quality over quantity)
What it is: Probability that a gate operation does what you intended.
Example:
- Single-qubit gate: 99.9% fidelity = 0.1% error per gate
- Two-qubit gate: 99.5% fidelity = 0.5% error per gate
Why it matters: Errors compound with circuit depth. A 100-layer circuit with 99% fidelity per layer has ((0.99)^{100} \approx 37%) success probability.
Threshold for fault tolerance: Most error correction schemes need ~99.9% two-qubit gate fidelity to break even. Below that, error correction makes things worse.
Useful follow-up: What’s the distribution of fidelities across qubits? (Worst-case often matters more than average.)
3) Coherence time (how long before noise wins)
What it is: Time scale over which quantum information degrades.
Common metrics:
- (T_1) (relaxation): How long before (|1\rangle) decays to (|0\rangle)
- (T_2) (dephasing): How long before relative phase is lost
Why it matters: Longer circuits need longer coherence. If your algorithm takes 100 µs but (T_2 = 50) µs, you’re in trouble.
Platform differences:
- Superconducting: ~50-200 µs
- Trapped ion: seconds to minutes
- Neutral atoms: ~1-10 seconds
Useful follow-up: What’s the gate time relative to coherence time? (You want (T_2 / t_{\text{gate}} \gg 1).)
4) Connectivity (topology matters)
What it is: Which qubits can directly interact.
Examples:
- Heavy-hex (IBM): Each qubit connects to 2-3 neighbors in a hexagonal lattice
- Linear chain (some trapped ions): Qubit (i) connects to (i \pm 1)
- All-to-all (some trapped ions): Any qubit can interact with any other
Why it matters: If your algorithm needs qubit 1 to interact with qubit 50, and they’re not neighbors, you need SWAP gates to move data. Each SWAP adds depth and error.
Useful follow-up: What’s the effective depth after compilation? (Logical depth vs physical depth.)
5) Readout fidelity (garbage in, garbage out)
What it is: Probability that measurement correctly identifies the qubit state.
Typical values: 95-99.5% (worse than gate fidelity on many platforms).
Why it matters: Even if your circuit is perfect, bad readout gives you wrong answers.
Mitigation: Readout error mitigation (REM) can help, but adds classical overhead.
6) Quantum volume (IBM’s composite metric)
What it is: A single number that combines qubit count, gate fidelity, connectivity, and circuit depth.
Formula (simplified):
- Run square circuits ((n) qubits, depth (n)) with random gates
- Quantum volume = (2^n) if heavy output generation (HOG) test passes
Pros:
- One number for marketing
- Captures some tradeoffs (more qubits but worse fidelity → lower QV)
Cons:
- Doesn’t reflect specific algorithm performance
- Can be gamed (optimize for the benchmark, not real workloads)
- Doesn’t account for error correction overhead
Useful follow-up: What’s the QV trend over time? (Doubling QV each year is a common goal.)
7) Circuit depth (how deep before noise kills you)
What it is: Number of sequential gate layers.
Why it matters: Deeper circuits accumulate more errors. The “NISQ cliff” is when depth × error rate becomes too large.
Practical limits (NISQ era):
- Superconducting: ~100-1000 gates before noise dominates
- Trapped ion: ~10,000-100,000 gates (better fidelity + coherence)
Useful follow-up: What’s the effective depth after error mitigation?
Composite benchmarks to watch
Algorithm-specific benchmarks
Instead of abstract metrics, run actual algorithms:
- VQE for H₂ molecule: Can you get within chemical accuracy?
- QAOA for MaxCut: What approximation ratio do you achieve?
- Grover search: How many iterations before decoherence?
Why it’s better: This is what users care about. Qubit count alone doesn’t tell you if VQE will work.
Randomized benchmarking (RB)
What it is: Run random gate sequences, measure average fidelity.
Pros:
- Robust to state preparation and measurement errors
- Gives average gate error rate
Cons:
- Averages can hide worst-case qubits
- Doesn’t reflect specific circuit structures
Cross-entropy benchmarking (XEB)
What it is: Used by Google for “quantum supremacy” claim.
How it works:
- Run random circuits
- Measure output distribution
- Compare to classical simulation (cross-entropy between measured and ideal)
Pros:
- Tests full-system performance
- Hard to fake
Cons:
- Doesn’t correspond to useful algorithms
- Classical simulation limits constrain problem size
Red flags when evaluating claims
❌ “We have X qubits” (no other details)
Ask:
- Gate fidelities?
- Connectivity?
- Coherence times?
❌ “Achieved quantum advantage” (on a synthetic benchmark)
Ask:
- What problem?
- Can classical algorithms improve?
- Is the problem useful?
❌ “Ready for practical applications” (NISQ era)
Ask:
- What specific application?
- What’s the error rate vs algorithm requirement?
- How does it compare to classical state-of-the-art?
❌ “Fault-tolerant quantum computer” (without logical qubit demos)
Ask:
- How many logical qubits at what error rate?
- What’s the physical-to-logical overhead?
- Can you run error correction cycles continuously?
The checklist
When you see a quantum computing announcement, ask:
- Qubit count: How many?
- Gate fidelity: Single-qubit? Two-qubit?
- Coherence: (T_1) and (T_2)?
- Connectivity: Nearest-neighbor or all-to-all?
- Readout fidelity: What percentage?
- Circuit depth: How deep can you go?
- Algorithm performance: Any real workload benchmarks?
- Error correction: Physical or logical qubits?
If the press release gives you only qubit count, be skeptical.
What “good enough” looks like (for different goals)
NISQ experiments:
- 50-100 qubits
- 99%+ gate fidelity
- Depth ~100-1000
- Example: VQE for small molecules, QAOA for toy optimization
Pre-fault-tolerance demonstrations:
- 100-1000 qubits
- 99.5%+ gate fidelity
- Depth ~1000-10,000
- Example: Error detection codes, small logical qubits
Fault-tolerant quantum computing:
- 10,000+ physical qubits → 50-100 logical qubits
- 99.9%+ gate fidelity (physical)
- Logical error rate < 10⁻⁶ per gate
- Continuous error correction cycles
- Example: Shor’s algorithm, large-scale chemistry simulation
Takeaway
Qubit count is just one dimension. Quality × quantity × connectivity × coherence all matter.
The best metric is: Can it run the algorithm you care about with acceptable error rates?
Everything else is a proxy.