‘The accuracy of Apple Watch measurements: a living systematic review and meta-analysis’

Author: Rory Lambe et al
Journal: Nature (2026)

AI generated summary

This study presents a living systematic review and meta-analysis evaluating the measurement accuracy of the Apple Watch across multiple health and fitness metrics. A total of 82 validation studies were included, covering 14 metrics and involving 430,052 participants. Eligible studies compared Apple Watch outputs against accepted criterion methods under controlled, free-living, exercise, and clinical conditions

Methods

Nine databases were systematically searched up to 24 September 2025, without language or publication-type restrictions. Measurement agreement between Apple Watch and criterion methods was assessed using Bland–Altman mean bias and limits of agreement (LoA), as well as sensitivity and specificity for diagnostic outcomes such as atrial fibrillation. Risk of bias was evaluated using an adapted COSMIN/INTERLIVE framework. Random-effects meta-analyses were conducted where sufficient data were available .

Results
  • Heart rate (HR):
    Meta-analysis of 22 studies showed a low pooled mean bias (−0.27 bpm; 95% CI −0.72 to 0.17), but moderately wide limits of agreement (−7.19 to 6.64 bpm), indicating individual-level variability. Accuracy improved substantially with newer (third-generation) optical sensors (LoA −3.68 to 2.59 bpm) and was sufficient to quantify exercise intensity in healthy adults, though misestimation occurred more frequently in clinical populations .
  • Blood oxygen saturation (SpO₂):
    Mean bias was minimal (−0.04%), but limits of agreement were wide (−4.01 to 3.94%), suggesting potential misclassification in hypoxic ranges, particularly among clinical populations. Accuracy was higher in healthy adults under controlled conditions and met FDA/ISO standards in select studies .
  • Atrial fibrillation detection:
    Meta-analysis of ECG-based detection showed moderate-to-high diagnostic accuracy, with pooled sensitivity of 0.79 (95% CI 0.61–0.90) and specificity of 0.91 (95% CI 0.81–0.96). High specificity indicates that positive notifications likely reflect true arrhythmia, though sensitivity varied and inconclusive readings were common (15–25%) .
  • Other metrics:
    Accuracy was moderate for sleep (sleep–wake classification) and step count, but poor and inconsistent for energy expenditure and VO₂max, with errors often exceeding clinically meaningful thresholds. Several metrics (e.g. respiratory rate, wrist temperature, sedentary behaviour) lack sufficient validation .
Interpretation

Measurement accuracy varied substantially by metric, sensor generation, measurement conditions, and individual physiology. Metrics derived directly from photoplethysmography (e.g. HR, SpO₂) showed stronger agreement than those relying on multi-sensor fusion and proprietary algorithms (e.g. energy expenditure). The authors emphasise that intended use (personal monitoring vs. clinical decision-making) should guide interpretation of accuracy thresholds .

Conclusion

The Apple Watch demonstrates acceptable accuracy for heart rate monitoring and atrial fibrillation detection in healthy populations, while accuracy for other metrics remains limited. The findings support its use for longitudinal personal health monitoring, but caution against uncritical clinical application for metrics with large error margins. Ongoing validation is required, particularly in diverse and clinical populations, reinforcing the value of this work as a living systematic review .

Niels de Vries
Niels de Vries
Articles: 151