The Oximeter Oxymoron
Yesterday the Independent Review of Equity in Medical Devices published a report on bias in medical devices. Set up in 2022 by the then Secretary of State for Health and Social Care, Sajid Javid, the review sought to establish the extent and impact of ethnic and other unfair biases in the performance of medical devices commonly used in the NHS.
The report found that pulse oximeters, a life-saving device during the COVID-19 pandemic, can over-estimate the amount of oxygen in the blood of people with darker skin tones. This is not entirely unexpected. Different skin tones affect the absorption and reflective properties of light. What is surprising is that this system wasn’t tested for this bias, the inventors didn’t appear to have anticipated it and certainly didn’t mitigate for it, and nobody thought it should be a requirement to test for such bias.
In my field, machine learning, the more narrow branch of Artificial Intelligence that currently sees most use and media attention, AI systems are assumed to be biased.
This is not without cause. There is a rich history of AI products producing prejudicial outcomes contrary to their original intent. For example, some facial recognition algorithms were up to 100 times more likely to misidentify Asian and African American faces compared to white faces, mortgage decision-making programmes which systematically overcharged people of colour and recruitment software which discriminated against women.
With such examples, it is unsurprising that Government recommendations, such as the UK’s AI Whitepaper and full-blown regulations, such as the EU AI Act, ask the creators of AI systems to explicitly test for a number of equality and diversity parameters. These include age, gender, and ethnicity. The EU AI Act requires that such bias is anticipated, measured, and mitigated against.
And mitigation against these biases is completely possible.
For example, at BLUESKEYE AI we use machine learning to monitor minute changes to facial and vocal behaviour which can be the early warning signs of anxiety and depression as well as other conditions which manifest through the face. And we can demonstrate that our AI systems are demonstrably unbiased, to age, gender, and ethnicity.
How? By using a representative dataset of people of all apparent ages, ethnicity, and gender, and running statistical tests to show that predictions were no different regardless of age, gender, or ethnicity.
Why? Because we designed our AI to penalise any bias. As a result, when exposed to large, diverse data sets the AI system implicitly learns to identify and account for individual differences, such as the different absorption and reflection parameters for different skin tones.
Traditional, or “hand-crafted” measurement systems don’t use data to train the function they’re trying to approximate. Such a device doesn’t use data to train the function it’s trying to approximate. A scientist writes down a formula, on paper, or in software, based on their knowledge of physics, chemistry, and the experience of experts. A small amount of data is then used to verify that their device performs as planned. Tweaks are made to the formula or software as needed to get better results on that test data.
Machine learning systems, on the other hand, don’t start with a prescribed, function. Instead, they learn from examples of combinations of the sensor input data (say, the reflection of light by the blood as it’s captured by a sensor on your index finger) and the desired diagnostic output (say, heart rate). Machine learning systems can approximate incredibly complex input-output relations and can learn all the anticipated and unanticipated dependencies between them. This includes the effects of individual differences in protected characteristics such as age, gender, and ethnicity.
So, ironically, it seems very likely that AI systems, assumed to be biased but correctly regulated and correctly built, will end up being the most unbiased medical devices, and the more traditionally developed devices, such as the pulse oximeter, less so.