Large-scale real-world validation of an AI-driven ultrasound breast cancer screening system in rural settings

Chenzhong Wang ^1,†

Tao He ^1,†

Xi Wang ^2,†

Junjie Hu ¹

Jianyong Wang ¹

Xiang Li ¹

Jixiang Guo ¹

Qing Lyu ¹

Xiaoying Hao ³

Jian Liu ⁴

Bo Gou ⁴

Chuanbo Xie ⁵

Han Wang ⁶

Jianping Gou ⁷

Fanxin Zeng ⁸

Xi Lin ⁹

Cuiju Yan ⁹

Huiling Xiang ⁹

Weihan Cao ¹⁰

Lan Wu ¹

Ling Liu ¹

Shimin Hu ¹¹

Hao Chen ^2,*

Zhang Yi ^1,*

Nature

¹Sichuan University, ²The Hong Kong University of Science and Technology, ³Zigong Maternal and Child Health Hospital, ⁴Chengdu Medical College, ⁵Zigong Health Commission, ⁶Peking University, ⁷Southwest University, ⁸Dazhou Central Hospital, ⁹Sun Yat-sen University Cancer Center, ¹⁰Kunming Medical University, ¹¹Tsinghua University, ^*Corresponding authors Hao Chen (jhc@ust.hk), Zhang Yi (zhangyi@scu.edu.cn), ^†These authors contributed equally to this work

Code arXiv 🤗 DEMO

Abstract

Breast cancer is the most commonly diagnosed cancer among women, and early detection is essential for improving outcomes. Ultrasound is particularly important for screening Asian populations with dense breasts, but reliance on physician-operated acquisition and interpretation limits scalability, especially in underserved regions. Here we present AIBS, an artificial intelligence-enabled system for large-scale breast ultrasound screening. AIBS combines two components for scalable screening: a real-time quality-control module supporting standardized acquisition by non-physician operators, and a video-based risk-stratification model built on the first breast ultrasound video foundation model (BUVFM). BUVFM was self-supervised pretrained on 423,493 frames and 84,021 video sequences, then fine-tuned for BI-RADS risk stratification. Coupled with portable ultrasound devices, AIBS enables a hospital-independent screening workflow that expands access to population-level screening. With AI-assisted quality control, non-physician operators achieved scanning performance non-inferior to that of trained ultrasound physicians (8.00 versus 8.18; lower bound of the 95% confidence interval, -0.374; non-inferiority $P=0.079$ ). In retrospective validation, AIBS outperformed state-of-the-art models for risk stratification in both internal and external cohorts. On the internal validation cohort, it achieved a macro sensitivity of 95.42% and a macro specificity of 95.60%; across the 11 external validation cohorts, macro sensitivity ranged from 94.11% to 99.39%, and macro specificity ranged from 94.28% to 99.39%. In a video-based reader study involving 16 ultrasound physicians, standalone AIBS showed stronger risk-stratification performance than all readers in the independent setting, while AI assistance improved reader interpretation by 4.18 percentage points and reduced interpretation time by 42.85%. In a prospective study across 4 screening cohorts ( $n=1,282$ ), the average screening time was 94 s per case and the average screening cost was 4.19 CNY per case (4.48x lower than traditional). These findings support AIBS as a scalable and standardized approach to breast ultrasound screening in resource-constrained settings.

Standardizing Breast Ultrasound Screening

这里放我们系统的流程动画：技师扫描 -> 端侧指控 -> 云端模型 -> 结果分析

Sample Screening Videos by Risk Level in Our Dataset

High Risk

Mid Risk

Low Risk

No Lesion

Study Design and AI Evaluation Workflow

Overview of study design and AI evaluation workflow. a, b, Participant demographics and study cohorts across screening centers. c, Self-supervised pre-training framework (nmODE-JEPA) using a context encoder to predict latent features from masked regions. d, Real-time AI pipeline for ultrasound videos, enabling continuous lesion localization and dynamic risk stratification (low, mid, high-risk). e, Scanning quality control validated via expert scoring of volunteer scans by physicians and non-physicians. f, Retrospective internal (3 sites, n = 790) and external (4 sites, n = 1, 080) validation of diagnostic accuracy. g, Crossover study (500 videos) evaluating 16 readers with and without AI (4-week washout) to assess diagnostic performance. h, Prospective real-world validation across 23 sites (n = 11, 485), demonstrating clinical robustness and generalizability.

The Architecture of AIBS

AIBS is a cloud-edge collaborative breast ultrasound screening system designed to reconcile two competing demands: the need for immediate feedback during acquisition and the need for computationally intensive video-level analysis. The system comprises two tightly coupled components operating in parallel:

1. Real-Time Quality-Control Module (Edge-Side)

Device: Portable Ultrasound Devices.
Function: Processes continuous video streams in real-time to provide immediate feedback on image quality and scan completeness.
Guidance: Displays visual prompts and warnings on the device interface, allowing non-physician operators to dynamically adjust probe contact, imaging depth, sweep speed, and gain settings for standardized scanning.

2. Staged Diagnostic Workflow

The diagnostic pipeline distributes inference across edge-side and cloud-side environments based on latency and computational requirements. It is organized into a hierarchical three-stage pipeline:

Stage 1: Edge-Side Rapid Filtering (Valid Frame Identification)

Function: A lightweight cascade model continuously sieves raw video streams to identify whether incoming frames or short temporal clips contain suspicious lesion-related content. It operates with low latency to provide immediate front-end response.

Stage 2: Edge-Side Lesion Localization (Key-Clip Extraction)

Function: Triggered when Stage 1 predictions exceed predefined probability thresholds. This model localizes suspicious lesion regions and extracts key diagnostic clips, selecting lesion-positive clips for downstream analysis while reducing cloud upload bandwidth.

Stage 3: Cloud-Side Risk Stratification (Video Foundation Model)

The Core Model: A breast ultrasound video foundation model fine-tuned for downstream classification.
Function: Handles computationally intensive high-precision temporal interpretation. Uploaded suspicious clips undergo further video-level lesion confirmation while preserving vital video-level context.
Output: Generates the final examination-level risk categorization (Low, Medium, or High Risk).

Summary of Collaboration: By allocating low-latency filtering and quality control to the edge, and high-precision temporal interpretation to the cloud, AIBS ensures real-time responsiveness during acquisition while delivering expert-level risk assessment at the back-end.

Model Architecture. The framework comprises self-supervised pre-training (left) and downstream clinical adaptation (right). The foundation model is pre-trained via video contrastive learning on a large-scale unlabeled dataset (>80,000 videos, >400,000 images). It integrates a 3D convolution encoder, a momentum-updated teacher-student Vision Transformer (ViT), and an nmODE predictor to learn robust spatiotemporal representations optimized by content and prediction losses. Through distillation or fine-tuning, the model adapts to three progressive stages: Stage I (Image-based filtering) leverages Mobile-ODE and YOLO-11 for frame validation and lesion detection; Stage II (Video-based filtering) utilizes ResNet50 and nmODE to model temporal evolution for video-level lesion filtering; Stage III (Risk stratification) applies fully supervised fine-tuning on labeled datasets to classify lesion severity into Low, Mid, and High clinical risk categories.

Datasets & Scale

Our system is built and validated on a massive scale of real-world clinical data:

Dataset Category	Scope	Scale
Pre-training Corpus	Unlabeled Multicenter Data	6.3M items (5.8M video frames + 507k static images)
SCUBC Cohort	Development & Validation	22,534 videos from 9,780 patients
Prospective Cohort	Nationwide Rural Deployment	11,485 exams across 23 sites (6 provinces)
QC Validation	Scanning Standardization	12 operators vs. senior physicians

Experimental Results

Quantitative Performance

nmODE-VJEPA outperforms SOTA models (OpenUS, USFM) across all cohorts:

Internal Validation (N=790): 98.23% Accuracy, 0.9956 Macro AUC.
External Validation (N=1,082): Accuracy up to 100% (Ext1) and 96.15% (Ext2).
Prospective Real-world (N=11,485): 91.3% Patient-level Accuracy; No high-risk case was downgraded to low-risk.

Clinical Utility & Efficiency

Non-Inferiority: AI-assisted non-physicians achieved image quality scores comparable to professional sonographers (8.00 vs 8.16).
Efficiency: Reading efficiency improved by 42% ( $P < 0.001$ ).
Inter-reader Agreement: Pairwise Cohen’s kappa increased from 0.60 to 0.70 with AI assistance.

Comprehensive performance evaluation and feature representation across multicenter cohorts. a, Radar charts comparing macro-averaged classification metrics, including Macro Sensitivity, Macro Specificity, and Macro Precision. The twelve axes represent the internal validation cohort and eleven external validation cohorts (Ext 1–Ext 11), demonstrating the models’ generalization capabilities across varied multicenter data distributions. b, Radar charts illustrating model robustness across clinical risk stratifications (Low, Medium, and High Risk). Performance is measured by class-level AUC across the same twelve validation cohorts, showing the consistency of BUVFM across different clinical subgroups. c, UMAP visualizations of high-dimensional feature embeddings for all evaluated methods. Points are colored by risk level, demonstrating the superior class-separability and distinct feature clustering achieved by BUVFM compared to baseline models.

Reader study performance and behavior analysis. a–f, Diagnostic performance and efficiency of 16 individual readers (left columns) and reader subgroups (right columns). Metrics include macro-averaged sensitivity (a, b), specificity (c, d), and precision (e, f), comparing independent reading (solid bars) and AI-assisted reading (hatched bars), alongside standalone AI performance (dashed horizontal lines). g, h, Corresponding reading times per case for individuals (g) and subgroups (h). For subgroup analysis, readers were categorized into Junior (≤10 years experience) and Senior (>10 years experience) groups. i, Transitions in reader behavior under AI assistance, categorized by the initial correctness of both the human reader and the AI system (e.g., AI-corrected, AI-misled). j, Error contribution analysis across low-, medium-, and high-risk BI-RADS categories for Junior and Senior groups. k, Heatmap of reader agreement (Cohen’s κ), displaying intra-reader consistency (blue, diagonal; assisted vs. independent) and inter-reader agreement (red, off-diagonal; assisted vs. assisted). Error bars represent the standard deviation (s.d.) derived from 1,000 bootstrap iterations; P values were calculated using two-tailed paired t-tests.

Prospective validation and real-world evaluation of the AI screening system across sequential cohorts. (a–c) Diagnostic performance of the proposed AI against baseline models, measured by (a) Macro-Sensitivity, (b) Macro-Specificity, and (c) Macro-Precision across four prospective phases (Cohort 1–4) and the overall cohort. Statistical significance is annotated for top-performing comparisons. (d) Average screening time (MM:SS) per patient, stratified by clinical risk (High, Medium, Low) to illustrate workflow efficiency across sequential cohorts. (e) Real-world economic analysis showing average screening cost per capita (RMB). Total expenses are broken down into logistics, consumables, equipment depreciation, and labor to highlight the economic sustainability and scaling efficiency of the program.

BibTeX citation

@article{wang2026nmode,
  title={nmODE-V: ...},
  author={Zhang, San and Li, Si...},
  journal={Nature},
  year={2026}
}

Collaborating Institutions