Setting the bar high: the challenges of benchmarking digital health

At Ada, we are pleased that the evaluation of digital health technologies has become a central topic in the media and within the medical community. We appreciate the efforts of publications such as Pulse to contribute to this important dialogue and are glad to see that Ada performed well in their recent tests, with our symptom assessment platform attaining almost 90 percent accuracy.

Evidence-based evaluation is crucial to any legitimate healthcare development, and to continued improvement and innovation in the provision of care. For the industry to progress and adopt new technologies, digital health innovations must be rigorously tested. Companies building these technologies must also be held accountable, ensuring their innovations are ethical, beneficial, and medically sound, with the ultimate goal being to ensure that individuals have absolute confidence in the new technologies changing how they interact with their health and care.

Tests like Pulse’s show that the industry is rightly keen to evaluate these new technologies. However, benchmarking symptom assessment tools is far from simple, and the methodology used must be carefully considered if we want to truly give clinicians, providers, and patients the ability to make informed decisions about which technologies they should use. I therefore wanted to share our views on the complexities and challenges that need to be addressed if, as an industry, we are to benchmark properly.

As a team we believe it is our responsibility to hold ourselves to the highest possible standards. We employ a number of methods that combine rigorous internal and external review to ensure that these standards are met. We encourage fellow companies to do the same when measuring different aspects of their technology.

In turn, we believe that to effectively and accurately benchmark and test emerging digital health technologies, external evaluation should also be both rigorous and sophisticated. For example, in settings where there might not be a definitive diagnosis or ‘ground truth’ available, then alternative approaches to ensure evaluation is objective and reliable must be employed. For symptom assessment applications such as Ada, at the very least this means multiple expert clinician opinions should be sought as the benchmark, alongside clinical guidelines and evidence where available. In addition, the information and guidance provided should be evaluated as a whole. Health is complex, and there are multiple factors that should be taken into account within a clinical assessment and the clinical decision-making process.

One challenge in particular when benchmarking in healthcare is differing medical opinion. It’s not unusual for two clinicians to have contrasting views on what they consider an appropriate differential and advice level for any given medical assessment. Indeed, in my experience, having conducted numerous such evaluations at significant scale and across the breadth of medical presentations and acuity levels, this inter-clinician variability is the norm rather than the exception. Many benchmarking evaluations only look to one clinician’s opinion, along with a small number of test cases. While these can be valuable, using multiple expert opinions and a larger data set enables evaluators to draw more meaningful conclusions, with higher statistical significance.

Knowing that clinicians will not always agree on the outcome of assessments or suggested next steps, we always use multiple clinicians and feedback processes in our own internal testing as well as all external evaluations.

We also recognize that the condition considered most likely should not be the sole determinant of guidance around next steps, but that all aspects of the assessment should be taken into account. This is why in some cases, if a red flag feature is presented, a potentially more serious condition considered possible, or if timely clinical review or treatment is likely to improve outcomes, the guidance provided to users in our app will be escalated. Their guidance is based on all symptoms and information shared by the patient, as well as the broader differential, and never just on a single condition.

At an industry level, we believe establishing a culture of accountability and evidence-based decision-making is paramount, and that benchmarking AI-powered health apps will require a nuanced and collaborative effort between clinicians, industry, and regulators. This should include defining the data and parameters with which solutions should be tested, understanding how errors and biases can occur, and working to ensure that assessments are based on all the available information.

As a first step to address some of these challenges, we are proud to have joined the World Health Organization (WHO) and the International Telecommunication Union’s (ITU) joint AI for Health (AI4H) Focus Group to establish standardized benchmarking for AI technologies in health. Ada has been selected as the topic driver for the AI-based symptom assessment domain of the AI4H Focus group. In this role we are working alongside several other digital health companies in our space, and a wide range of experts and stakeholders, to define and establish a globally recognized standardized approach to benchmarking in our particular field. The scope is comprehensive, including the ethical collection of data, benchmarking infrastructure, selecting scores and metrics, gathering undisclosed datasets, and reporting.

If we are to benefit from the full potential of emerging digital health technologies then regulators, developers, clinicians, and patients need to be able to have an honest and open conversation about this topic. We welcome further conversations around this with clinicians, experts, and other key stakeholders, and appreciate how studies like Pulse’s are continuing to move the dialogue forward. Benchmarking is vital to the development and adoption of new health technologies and we as an industry have a responsibility to work together to ensure that the methods used for evaluation are held to the same high standards and expectations of rigor as the solutions they are designed to test.