Data Nutrition Project
Helping data scientists understand what's inside datasets before they're used in machine learning.Problem
Artificial intelligence (AI) systems built on incomplete or biased data will often exhibit problematic outcomes, leading to negative unintended consequences that affect the very communities that are already marginalized, underserved, or underrepresented. And yet there are few, if any, standard methods of data analysis to check for the ‘health’ of data, particularly before model development.
Our Approach
With an aim of mitigating harms caused by automated decision-making systems, The Dataset Nutrition Label tool enhances context, content, and legibility of datasets. Drawing from the analogy of the Nutrition Facts Label on food, the Label highlights the “ingredients” of a dataset to help shed light on how (or whether) the dataset is healthy for use.
Led by Innovation Lab Fellow, Kasia Chmielinski, the Dataset Nutrition Label is a diagnostic framework that lowers the barrier to standardized data analysis by providing a distilled yet comprehensive overview of dataset “ingredients” before AI model development. This framework is optimized for the data practitioner journey and leverages potential use cases for the data alongside alerts or flags that highlight known issues and possible mitigation strategies.
The Label, now in its third generation and being leveraged across a diversity of domains including healthcare and humanitarian use cases, is intended to drive robust data analysis practices by making it easier and faster for data scientists to interrogate and select datasets; increase overall quality of models by driving the use of better and more appropriate datasets for those models; and enable the creation and publishing of responsible datasets by those who collect, clean and publish data.
At a glance information
The web-based label includes four distinct panes of information: About the Label (top bar), Metadata (left side), Use Case Matrix (top right), and Inference Risks (bottom right).
Milestones Ahead
The Data Nutrition Project is a research organization and product development team composed of technologists, designers, academics and scientists. Together, we are excited to continue the work of driving better AI through the exploration and development of practical tools.
We plan to launch the third generation of the Dataset Nutrition Label and Label Maker Tool (beta) in early 2023, and will continue working on a number of initiatives for this year and beyond:
- Publish a Labels Library. Creating additional Labels for high-impact datasets often used to train AI.
- Conduct User Research. With data scientists, we plan to conduct user research to refine the utility and legibility of the Label for data scientists.
- Research the Landscape. Continue research on the broader ecosystem of labeling and algorithmic accountability and make this broadly available for academics and policy makers.