MITB Banner

It’s High Time ML Community Looked Into Effects of Data Cascades

Share

“Model drifts are more common when models in high-stakes domains, such as air quality sensing or ultrasound scanning due to lack of curated datasets.”

When AI models are applied in high-stakes domains like health and industrial automation, data quality suddenly becomes a significant aspect of the whole pipeline. Models in the real world are prone to many vulnerabilities that go undetected in a controlled environment. For instance, even the seasons have their say in model outcomes. Wind can unexpectedly move image sensors in deployment, a form of cascade. Google’s research showed even a small drop of oil or water can affect data that could be used to train a cancer prediction model. These small deviations can go unnoticed for 2-3 years before they show up in production. This is why Google researchers want the whole community to take the issue of Data Cascades seriously. The researchers surveyed  practices and challenges among 53 AI practitioners in India, East and West African countries, and the US, working on cutting-edge, high-stakes domains of health, wildlife conservation, food systems, road safety, credit, and environment.

About Data Cascades

Image credits: Google AI blog

Data Cascades, as the name suggests, involves a string of trivial looking errors that compound to a catastrophe. Data cascades are elusive yet avoidable. A study by the Google Research team found that 92% of the teams they have surveyed, experienced at least one cascade. According to the researchers, data cascades are usually influenced by:

  • The activities and interactions of developers, governments, and other stakeholders.
  • Location of data collection (eg:rural hospitals where sensor data collection occurs).

According to the researchers, model drifts are more common when models in high-stakes domains, such as sensing air quality or performing an ultrasound scan— because there are no pre-existing and/or curated datasets. The so-called good models work well in a lab setting where everything is under control. The real world presents unique challenges.

“In the live systems of new digital environments with resource constraints, it is more common for data to be collected with physical artefacts such as fingerprints, shadows, dust, improper lighting, and pen markings, which can add noise that affects model performance,” explained the researchers.

What to do about data cascades?

Data cascades are opaque in diagnosis and manifestation—with no clear indicators, tools, and metrics to detect and measure their effects on the system. They occur when conventional AI practices are applied in high-stakes domains characterised by high accountability, interdisciplinary work, and resource constraints. A majority of curricula for degrees, diplomas, and nano-degrees in AI are concentrated on model development, leaving graduates under-prepared for the science, engineering, and art of working with data, including data collection, infrastructure building, data documentation, and data sense-making.

  • Measure phenomenological fidelity; to know how accurately and comprehensively does the data represent the phenomena.
  • Incentivise the community to shift their focus from models to data.
  • Foster collaboration for data work. Teams, which encountered the least data cascades commonly had step-wise feedback loops throughout, ran models frequently, worked closely with application-domain experts and field partners, maintained clear data documentation, and regularly monitored incoming data.
  • Socio-economic status of a nation should be considered as the lack of curated datasets can change across geographies. Google researchers recommend setting up of open dataset banks, creating data policies, and boosting ML literacy of policy makers to address the current data inequalities globally.
Image credits: Google PAIR

As the systems mature, they usually end up with a wide range of configurable options such as features used, how data is selected, algorithm-specific learning settings, verification methods, etc. And, as data cascades often originate early in the lifecycle of an ML system, it becomes even more challenging. The researchers lament that there are no clear indicators, tools, or metrics to detect and measure data cascade effects. Another challenge is the costly system-level changes one might have to perform in identifying a data cascade. Nevertheless, the researchers believe that such data cascades can be avoided through early interventions in ML development as mentioned above.

PS: The story was written using a keyboard.
Share
Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India