“Model drifts are more common when models in high-stakes domains, such as air quality sensing or ultrasound scanning due to lack of curated datasets.”
When AI models are applied in high-stakes domains like health and industrial automation, data quality suddenly becomes a significant aspect of the whole pipeline. Models in the real world are prone to many vulnerabilities that go undetected in a controlled environment. For instance, even the seasons have their say in model outcomes. Wind can unexpectedly move image sensors in deployment, a form of cascade. Google’s research showed even a small drop of oil or water can affect data that could be used to train a cancer prediction model. These small deviations can go unnoticed for 2-3 years before they show up in production. This is why Google researchers want the whole community to take the issue of Data Cascades seriously. The researchers surveyed practices and challenges among 53 AI practitioners in India, East and West African countries, and the US, working on cutting-edge, high-stakes domains of health, wildlife conservation, food systems, road safety, credit, and environment.
About Data Cascades
Data Cascades, as the name suggests, involves a string of trivial looking errors that compound to a catastrophe. Data cascades are elusive yet avoidable. A study by the Google Research team found that 92% of the teams they have surveyed, experienced at least one cascade. According to the researchers, data cascades are usually influenced by:
- The activities and interactions of developers, governments, and other stakeholders.
- Location of data collection (eg:rural hospitals where sensor data collection occurs).
According to the researchers, model drifts are more common when models in high-stakes domains, such as sensing air quality or performing an ultrasound scan— because there are no pre-existing and/or curated datasets. The so-called good models work well in a lab setting where everything is under control. The real world presents unique challenges.
“In the live systems of new digital environments with resource constraints, it is more common for data to be collected with physical artefacts such as fingerprints, shadows, dust, improper lighting, and pen markings, which can add noise that affects model performance,” explained the researchers.
What to do about data cascades?
Data cascades are opaque in diagnosis and manifestation—with no clear indicators, tools, and metrics to detect and measure their effects on the system. They occur when conventional AI practices are applied in high-stakes domains characterised by high accountability, interdisciplinary work, and resource constraints. A majority of curricula for degrees, diplomas, and nano-degrees in AI are concentrated on model development, leaving graduates under-prepared for the science, engineering, and art of working with data, including data collection, infrastructure building, data documentation, and data sense-making.
- Measure phenomenological fidelity; to know how accurately and comprehensively does the data represent the phenomena.
- Incentivise the community to shift their focus from models to data.
- Foster collaboration for data work. Teams, which encountered the least data cascades commonly had step-wise feedback loops throughout, ran models frequently, worked closely with application-domain experts and field partners, maintained clear data documentation, and regularly monitored incoming data.
- Socio-economic status of a nation should be considered as the lack of curated datasets can change across geographies. Google researchers recommend setting up of open dataset banks, creating data policies, and boosting ML literacy of policy makers to address the current data inequalities globally.
As the systems mature, they usually end up with a wide range of configurable options such as features used, how data is selected, algorithm-specific learning settings, verification methods, etc. And, as data cascades often originate early in the lifecycle of an ML system, it becomes even more challenging. The researchers lament that there are no clear indicators, tools, or metrics to detect and measure data cascade effects. Another challenge is the costly system-level changes one might have to perform in identifying a data cascade. Nevertheless, the researchers believe that such data cascades can be avoided through early interventions in ML development as mentioned above.