Logical Qubit Benchmarking: What to Measure and Why

Measuring the logical error rate of a surface-code logical qubit is more subtle than measuring physical gate fidelity. We describe the benchmark protocol QECSync uses, how to isolate decoder contribution from hardware contribution, and what numbers to report.

Logical qubit benchmarking methodology diagram showing error rate isolation

Demonstrating a logical qubit is easy. Demonstrating a logical qubit with a measured, reproducible logical error rate — and correctly interpreting what that number means — is not.

The quantum computing field has accumulated a set of common benchmark interpretation errors that inflate apparent performance: conflating logical error rate per syndrome cycle with logical error rate per gate, comparing numbers from different code distances without stating the distance, and attributing performance to the decoder when the limiting factor is hardware fidelity, or vice versa. This article describes the benchmark protocol QECSync uses and explains why each design choice matters for producing honest, reproducible numbers.

The primary metric: logical error rate per syndrome cycle

The most fundamental measurable quantity for a surface-code logical qubit is the logical error rate per syndrome cycle: the probability that the logical qubit's state is incorrect after one round of stabilizer measurement and classical decoding.

This is measured by the idle logical qubit experiment: prepare a logical |0⟩ or |+⟩ state, apply N syndrome cycles, apply the decoder's corrections, measure the logical state, and count the fraction of trials where the measured state disagrees with the prepared state. That fraction, divided by N, is the logical error rate per cycle.

Several subtleties:

  • Use N large enough that the error floor is detectable. At logical error rate 10⁻⁴ per cycle, you need at least 10,000 trials × several cycles to observe logical errors at a statistically meaningful level. At 10⁻⁶, you need millions of trials. Plan your experiment around the error rate you expect to measure, not a round number of shots.
  • Separate cycle count from syndrome round count. Each "cycle" in the idle logical qubit experiment consists of d syndrome rounds (the standard choice), not 1. The decoder processes all d rounds jointly — this is the temporal redundancy that handles measurement errors. The reported error rate per cycle should be per d-round block, not per single syndrome round.
  • Account for preparation and measurement errors. State preparation errors (imperfect logical |0⟩ initialization) and logical measurement errors (misidentifying the final logical state) contribute to the measured error rate independently of the syndrome cycle error rate. These can be calibrated independently and subtracted, or the benchmark can be designed to be insensitive to SPAM errors by using the same readout calibration procedure as physical randomized benchmarking.

Measuring the threshold crossing: the key experiment

The most informative logical qubit benchmark is not a single data point but the threshold crossing experiment: measure logical error rate per cycle at code distances d=3, d=5, d=7 (or d=3, d=5 at minimum) on the same device with the same noise conditions.

If your device is below the fault-tolerance threshold, logical error rate should decrease as code distance increases. If d=5 gives a higher logical error rate than d=3 at the same physical error rate, your effective threshold is not where you think it is — likely because of spatially correlated noise, measurement error that is not being corrected, or decoder suboptimality.

The suppression factor per distance step (the ratio of logical error rates at consecutive distances) gives you a direct measurement of how far below threshold you are operating. A factor of 10× per distance step indicates comfortable margin; a factor of 2× indicates you are operating near threshold and additional distance is unlikely to help much.

Isolating decoder contribution from hardware contribution

When logical error rate is higher than expected, the cause is either hardware (physical error rates are worse than measured, or noise is more correlated than assumed) or the decoder (the decoder is making suboptimal correction decisions for this noise model). Distinguishing these requires running two experiments:

Offline decoder test. Capture the syndrome data from a hardware experiment and store it. Run the decoder against the stored syndrome data using multiple configurations: uniform edge weights, per-device calibrated edge weights, and if available, an optimal maximum-likelihood decoder. Compare the logical error rates. If calibrated-weight MWPM significantly outperforms uniform-weight MWPM, the decoder was the limiting factor. If both perform similarly and both are worse than expected, the hardware noise model is the issue.

Syndrome injection test. Run the QECSync decoder on synthetic syndrome data generated by Monte Carlo simulation of your device's measured noise model. If the decoder on synthetic data performs as expected but the decoder on real hardware data does not, there is structure in the real hardware noise that your noise model does not capture — likely correlated errors not reflected in per-qubit RB numbers.

QECSync's Python SDK includes a syndrome logging mode that captures the full syndrome array from each experiment shot. This data can be replayed through the decoder offline for comparison experiments. See the API reference for the SyndromeCaptureMode parameter.

What numbers to report

A complete logical qubit benchmark report should include:

  • Code distance and physical qubit count
  • Physical error rate (median two-qubit gate fidelity from simultaneous RB, measured in the same session)
  • Number of syndrome rounds per cycle
  • Number of experimental trials per distance
  • Logical error rate per cycle (with statistical uncertainty — a 95% confidence interval from binomial statistics, not just a point estimate)
  • Decoder algorithm used (MWPM or Union-Find) and whether per-device calibration was applied
  • Whether SPAM errors have been calibrated out and how

What not to report without qualification: logical fidelity after a single syndrome round (not a cycle), logical error rate without stating code distance, comparisons between different hardware platforms unless physical error rates are also reported side-by-side.

Interpreting suppression: when is it real?

Distance-dependent suppression of logical error rate is the core experimental claim of below-threshold fault-tolerant operation. Interpreting it correctly requires a sanity check on the suppression mechanism.

True distance suppression arises because longer error chains (required to produce logical errors at higher d) are exponentially less likely than shorter chains. This produces a specific functional form: p_L ≈ A × (p/p_th)^⌈d/2⌉. The suppression exponent should match ⌈d/2⌉ for the distances you measure.

A common confound: improving logical error rate from d=3 to d=5 that is not accompanied by a further improvement from d=5 to d=7 at the expected suppression exponent is not evidence of true threshold operation — it is evidence of correlated noise that the code handles at d=5 but not at higher distances, or a hardware preparation step that improves qubit quality at higher qubit count but is not a threshold effect.

The QECSync benchmarking methodology described in our technical report (TR-2025-01, available on request from the contact page) includes a statistical test for whether observed suppression is consistent with below-threshold operation versus other explanations. We recommend applying this test before interpreting suppression as evidence of below-threshold fault-tolerant operation in publications.