Digital health: Peer-reviewed study reveals significant disparities in coverage and accuracy among symptom assessment apps

New study compares world’s most popular symptom assessment apps on condition coverage, accuracy and safety
The peer-reviewed study, published in BMJ Open, was conducted by a team of doctors and scientists led by Ada Health alongside independent digital health experts
Eight symptom assessment apps were tested: Ada, Babylon, Buoy, K Health, Mediktor, Symptomate, WebMD, and Your.MD

London & Berlin, 16 December 2020 - A new peer-reviewed study testing the coverage, accuracy and safety of the eight most popular online symptom assessment apps has found that the performance of apps varies widely, with only a handful performing close to the levels of human general practitioners (GPs). Published today in BMJ Open, the study is the first of its kind to be published since 2015 and was conducted by a team of doctors and scientists led by global digital health company Ada Health.

Key findings

Coverage: Coverage is an important measure for digital health tools that might be deployed at scale, since it demonstrates how well apps can handle the wide variety of cases encountered within complex real-world healthcare environments. A tool with low coverage for example may exclude users who are too young, too old, pregnant, or who are living with a pre-existing mental health condition.

The study looked at how comprehensively the apps covered possible conditions and user types, and found that just a few of the most popular apps are configured to cover all patients. The most comprehensive app was Ada, which provided a condition suggestion 99 percent of the time. The other apps tested provided a suggestion 69.5 percent of the time on average, with the lowest scoring just 51.5 percent. The least comprehensive apps were not able to suggest conditions for significant numbers of cases, including key groups such as children, patients with a mental health condition, or those that were pregnant. Human GPs provided 100 percent coverage.

Accuracy: The study also considered the accuracy of each symptom assessment app by comparing the conditions suggested with what was deemed to be the ‘gold standard’ answer for each case as determined by a panel of doctors.

The study found that the apps’ clinical accuracy was also highly variable. Ada was rated as the most accurate, suggesting the right condition in its top three suggestions 71 percent of the time. The average across all the other apps was just 38 percent, with scores falling in a range between 23.5 percent and 43 percent. This means that, with the exception of Ada, most apps didn’t correctly identify the possible conditions in the majority of the cases. Human GPs were the most accurate, with 82 percent accuracy.

Safety: Finally, the study also assessed the safety of the app’s advice by examining whether the guidance they provided - such as staying at home to manage symptoms, or going to see a doctor - was considered to have the appropriate level of urgency.

While most apps gave safe advice in the majority of cases, only three apps performed close to the level of human GPs: Ada, Babylon, and Symptomate. Although all the apps assessed scored above 80 percent on safety, compared to 97 percent for human GPs, any small disparity in the safety of advice could potentially have a major impact upon patient outcomes if deployed at scale.

Methodology

The study is the only international large-scale peer-reviewed comparison of the performance and safety of apps across a broad range of medical conditions to be published in the last five years. It was developed by a team of digital health experts and clinical practitioners, including practising GPs, independent primary care clinical experts, and members of the clinical and scientific teams at Ada Health.

To ensure a fair comparison, the study used 200 ‘clinical vignettes’ - fictional patients, generated from a mix of real patient experiences gleaned from anonymised transcripts of calls to the UK’s NHS 111 telephone triage service and from the many years’ combined experience of the research team.1 The vignettes were reviewed externally by a panel of three experienced primary care practitioners to ensure quality and clarity and to set the list of ‘gold standard’ correct conditions and urgency advice level for each case.

The vignettes were then entered into the eight apps by eight external GPs playing the role of ‘patient’. Each app was tested once against every vignette. Seven external GPs were also tested with the vignettes, providing condition suggestions (preliminary diagnoses) for the clinical vignettes after telephone consultations. Human GPs were included to provide a benchmark by which to assess the apps.

Commentary

Dr. Hamish S F Fraser, Associate Professor of Medical Science, Brown Center for Biomedical Informatics:

“Symptom assessment apps are now used by tens of millions of patients annually in the US and UK alone. This study of eight of the most commonly used symptom assessment apps provides valuable evidence regarding the coverage of conditions, and the accuracy of condition suggestion and urgency advice.”

“Compared to a similar study from five years ago, this larger and more rigorous study shows improved performance with results closer to those of physicians. It also demonstrates the importance of knowing when apps cannot handle certain conditions. While this is a preclinical study, the one-third of clinical vignettes based on real NHS 111 helpline consultations provide an important link to real urgent care challenges. Notably, both the GPs and the apps tended to perform somewhat worse when tested on those cases.”

“These results should help to determine which apps are ready for clinical testing in observational studies and then randomized controlled trials. The study design could form a model for future evaluations of symptom checker apps, and as part of assessment for regulatory approval.”

Dr. Claire Novorol, co-founder and Chief Medical Officer, Ada Health:

“Symptom assessment apps have seen rapid uptake by users in recent years as they are easy to use, convenient and can provide invaluable guidance and peace of mind. When used in a clinical setting to support - rather than replace - doctors, they also have huge potential to reduce the burden on strained healthcare systems and improve outcomes. This peer-reviewed study provides important new insights into the development and performance of these tools. In particular, it shows that there is still much work to be done to make sure that these technologies are being built to be inclusive and to cover all patients. We believe this is vital if symptom assessment apps are to fulfil their potential: human doctors don’t have the luxury of cherry-picking which patients they help and digital health must be held to the same standard.”

Results breakdown:

App	Coverage	Accuracy	Safety
GPs (for comparison)	100.0%	82.1%	97.0%
Ada Health	99.0%	70.5%	97.0%
Babylon	51.5%	32.0%	95.1%
Buoy	88.5%	43.0%	80.0%
K Health	74.5%	36.0%	87.3%
Mediktor	80.5%	36.0%	87.3%
Symptomate	61.5%	27.5%	97.8%
WebMD2	93.0%	35.5%	N/A
Your.MD	64.5%	23.5%	92.6%

More details about the study are available in the report.

Clinical vignettes are created to reflect a typical GP caseload, such as “abdominal pain in an eight-year-old boy” or “painful shoulder in a 63-year-old woman”. The transcripts used in the study had previously been used as part of an NHS Direct benchmarking exercise for recommended outcomes, and were used with full consent of NHS Direct.
Because WebMD does not provide an overall user triage like the other apps tested do, meaningful comparison to the other apps or tested-GPs was not possible and WebMD was excluded from the advice safety analysis in this study.