MATERIALS AND METHODS: A literature search was conducted on Web of Science, PubMed, and IEEE Xplore for relevant studies published from January 2015 to February 2024. The study was registered with the PROSPERO International Prospective Register of Systematic Reviews (protocol no. CRD42024485371). The quality assessment of diagnostic accuracy studies-2 (QUADAS2) tool and the Must AI Criteria-10 (MAIC-10) checklist were used to assess quality and risk of bias. The meta-analysis included studies reporting DL for breast cancer diagnosis and their performance, from which pooled summary estimates for the area under the curve (AUC), sensitivity, and specificity were calculated.
RESULTS: A total of 40 studies were included, of which only 21 were eligible for quantitative analysis. Convolutional neural networks (CNNs) were used in 62.5% (25/40) of the implemented models, with the remaining 37.5% (15/40) hybrid composite models (HCMs). The pooled estimates of AUC, sensitivity, and specificity were 0.90 (95% CI: 0.87, 0.93), 88% (95% CI: 86, 91%), and 90% (95% CI: 87, 93%), respectively.
CONCLUSIONS: DL models used for breast cancer diagnosis on MRI achieve high performance. However, there is considerable inherent variability in this analysis. Therefore, continuous evaluation and refinement of DL models is essential to ensure their practicality in the clinical setting.
KEY POINTS: Question Can DL models improve diagnostic accuracy in breast MRI, addressing challenges like overfitting and heterogeneity in study designs and imaging sequences? Findings DL achieved high diagnostic accuracy (AUC 0.90, sensitivity 88%, specificity 90%) in breast MRI, with training size significantly impacting performance metrics (p
METHODS: In this study, we present an integrated pipeline combining weakly supervised learning-reducing the need for detailed annotations-with local AI model training via swarm learning (SL), which circumvents centralized data sharing. We utilized three datasets comprising 1372 female bilateral breast MRI exams from institutions in three countries: the United States (US), Switzerland, and the United Kingdom (UK) to train models. These models were then validated on two external datasets consisting of 649 bilateral breast MRI exams from Germany and Greece.
RESULTS: Upon systematically benchmarking various weakly supervised two-dimensional (2D) and three-dimensional (3D) deep learning (DL) methods, we find that the 3D-ResNet-101 demonstrates superior performance. By implementing a real-world SL setup across three international centers, we observe that these collaboratively trained models outperform those trained locally. Even with a smaller dataset, we demonstrate the practical feasibility of deploying SL internationally with on-site data processing, addressing challenges such as data privacy and annotation variability.
CONCLUSIONS: Combining weakly supervised learning with SL enhances inter-institutional collaboration, improving the utility of distributed datasets for medical AI training without requiring detailed annotations or centralized data sharing.