Benchmarking quantum computers: metrics that matter

“We have 1,000 qubits!” sounds impressive. But is it?

Benchmarking quantum computers is harder than classical systems because the relevant metrics are multidimensional and context-dependent.

Here’s a framework for evaluating hardware claims.

The metrics that matter

1) Qubit count (necessary but not sufficient)

What it is: Total number of physical qubits.

Why it matters: Algorithms scale with qubit count. Shor’s algorithm to factor a 2048-bit RSA key needs thousands of logical qubits.

Why it’s not enough:

Qubits with 90% fidelity are worse than 10 qubits with 99.9% fidelity
Error correction overhead means 1,000 physical qubits might yield only a few logical qubits

Useful follow-up: How many are actually usable? What’s the connectivity?

2) Gate fidelity (quality over quantity)

What it is: Probability that a gate operation does what you intended.

Example:

Single-qubit gate: 99.9% fidelity = 0.1% error per gate
Two-qubit gate: 99.5% fidelity = 0.5% error per gate

Why it matters: Errors compound with circuit depth. A 100-layer circuit with 99% fidelity per layer has ((0.99)^{100} \approx 37%) success probability.

Threshold for fault tolerance: Most error correction schemes need ~99.9% two-qubit gate fidelity to break even. Below that, error correction makes things worse.

Useful follow-up: What’s the distribution of fidelities across qubits? (Worst-case often matters more than average.)

3) Coherence time (how long before noise wins)

What it is: Time scale over which quantum information degrades.

Common metrics:

(T_1) (relaxation): How long before (|1\rangle) decays to (|0\rangle)
(T_2) (dephasing): How long before relative phase is lost

Why it matters: Longer circuits need longer coherence. If your algorithm takes 100 µs but (T_2 = 50) µs, you’re in trouble.

Platform differences:

Superconducting: ~50-200 µs
Trapped ion: seconds to minutes
Neutral atoms: ~1-10 seconds

Useful follow-up: What’s the gate time relative to coherence time? (You want (T_2 / t_{\text{gate}} \gg 1).)

4) Connectivity (topology matters)

What it is: Which qubits can directly interact.

Examples:

Heavy-hex (IBM): Each qubit connects to 2-3 neighbors in a hexagonal lattice
Linear chain (some trapped ions): Qubit (i) connects to (i \pm 1)
All-to-all (some trapped ions): Any qubit can interact with any other

Why it matters: If your algorithm needs qubit 1 to interact with qubit 50, and they’re not neighbors, you need SWAP gates to move data. Each SWAP adds depth and error.

Useful follow-up: What’s the effective depth after compilation? (Logical depth vs physical depth.)

5) Readout fidelity (garbage in, garbage out)

What it is: Probability that measurement correctly identifies the qubit state.

Typical values: 95-99.5% (worse than gate fidelity on many platforms).

Why it matters: Even if your circuit is perfect, bad readout gives you wrong answers.

Mitigation: Readout error mitigation (REM) can help, but adds classical overhead.

6) Quantum volume (IBM’s composite metric)

What it is: A single number that combines qubit count, gate fidelity, connectivity, and circuit depth.

Formula (simplified):

Run square circuits ((n) qubits, depth (n)) with random gates
Quantum volume = (2^n) if heavy output generation (HOG) test passes

Pros:

One number for marketing
Captures some tradeoffs (more qubits but worse fidelity → lower QV)

Cons:

Doesn’t reflect specific algorithm performance
Can be gamed (optimize for the benchmark, not real workloads)
Doesn’t account for error correction overhead

Useful follow-up: What’s the QV trend over time? (Doubling QV each year is a common goal.)

7) Circuit depth (how deep before noise kills you)

What it is: Number of sequential gate layers.

Why it matters: Deeper circuits accumulate more errors. The “NISQ cliff” is when depth × error rate becomes too large.

Practical limits (NISQ era):

Superconducting: ~100-1000 gates before noise dominates
Trapped ion: ~10,000-100,000 gates (better fidelity + coherence)

Useful follow-up: What’s the effective depth after error mitigation?

Composite benchmarks to watch

Algorithm-specific benchmarks

Instead of abstract metrics, run actual algorithms:

VQE for H₂ molecule: Can you get within chemical accuracy?
QAOA for MaxCut: What approximation ratio do you achieve?
Grover search: How many iterations before decoherence?

Why it’s better: This is what users care about. Qubit count alone doesn’t tell you if VQE will work.

Randomized benchmarking (RB)

What it is: Run random gate sequences, measure average fidelity.

Pros:

Robust to state preparation and measurement errors
Gives average gate error rate

Cons:

Averages can hide worst-case qubits
Doesn’t reflect specific circuit structures

Cross-entropy benchmarking (XEB)

What it is: Used by Google for “quantum supremacy” claim.

How it works:

Run random circuits
Measure output distribution
Compare to classical simulation (cross-entropy between measured and ideal)

Pros:

Tests full-system performance
Hard to fake

Cons:

Doesn’t correspond to useful algorithms
Classical simulation limits constrain problem size

Red flags when evaluating claims

❌ “We have X qubits” (no other details)

Ask:

Gate fidelities?
Connectivity?
Coherence times?

❌ “Achieved quantum advantage” (on a synthetic benchmark)

Ask:

What problem?
Can classical algorithms improve?
Is the problem useful?

❌ “Ready for practical applications” (NISQ era)

Ask:

What specific application?
What’s the error rate vs algorithm requirement?
How does it compare to classical state-of-the-art?

❌ “Fault-tolerant quantum computer” (without logical qubit demos)

Ask:

How many logical qubits at what error rate?
What’s the physical-to-logical overhead?
Can you run error correction cycles continuously?

The checklist

When you see a quantum computing announcement, ask:

Qubit count: How many?
Gate fidelity: Single-qubit? Two-qubit?
Coherence: (T_1) and (T_2)?
Connectivity: Nearest-neighbor or all-to-all?
Readout fidelity: What percentage?
Circuit depth: How deep can you go?
Algorithm performance: Any real workload benchmarks?
Error correction: Physical or logical qubits?

If the press release gives you only qubit count, be skeptical.

What “good enough” looks like (for different goals)

NISQ experiments:

50-100 qubits
99%+ gate fidelity
Depth ~100-1000
Example: VQE for small molecules, QAOA for toy optimization

Pre-fault-tolerance demonstrations:

100-1000 qubits
99.5%+ gate fidelity
Depth ~1000-10,000
Example: Error detection codes, small logical qubits

Fault-tolerant quantum computing:

10,000+ physical qubits → 50-100 logical qubits
99.9%+ gate fidelity (physical)
Logical error rate < 10⁻⁶ per gate
Continuous error correction cycles
Example: Shor’s algorithm, large-scale chemistry simulation

Takeaway

Qubit count is just one dimension. Quality × quantity × connectivity × coherence all matter.

The best metric is: Can it run the algorithm you care about with acceptable error rates?

Everything else is a proxy.

The metrics that matter

1) Qubit count (necessary but not sufficient)

2) Gate fidelity (quality over quantity)

3) Coherence time (how long before noise wins)

4) Connectivity (topology matters)

5) Readout fidelity (garbage in, garbage out)

6) Quantum volume (IBM’s composite metric)

7) Circuit depth (how deep before noise kills you)

Composite benchmarks to watch

Algorithm-specific benchmarks

Randomized benchmarking (RB)

Cross-entropy benchmarking (XEB)

Red flags when evaluating claims

❌ “We have X qubits” (no other details)

❌ “Achieved quantum advantage” (on a synthetic benchmark)

❌ “Ready for practical applications” (NISQ era)

❌ “Fault-tolerant quantum computer” (without logical qubit demos)

The checklist

What “good enough” looks like (for different goals)

Takeaway

Daily Quantum Updates