Automatically Detect Annotation Errors in Image/Text Tagging Datasets

Hey guys! Many of us in ML work with multi-label data , where the image or text is tagged with multiple labels. Often these datasets contain frequent label errors and/or missing tags (check what we found below in the CelebA dataset) that make it hard to train highly accurate ML models. Support for multi-label data was one of the top features requested — so we added it, blogged it, benchmarked it, and published all of the research.

We are excited to share this newest research on algorithms to automatically find label errors in multi-label classification datasets. Image/document tagging represents important instances of multi-label classification tasks, where each example can belong to multiple (or none) of K possible classes. Because annotating such data requires many decisions for each example, often multi-label classification datasets contain tons of label errors, which harm the performance of ML models.

We’ve open-sourced our algorithms in the recent release of cleanlab v2.2. All you need to do to use them is write one line of open-source code via cleanlab.filter.find_label_issues.

from cleanlab.filter import find_label_issues

ranked_label_issues = find_label_issues(
# labels: list of lists of (multiple) labels of each example
# pred_probs: predicted class probabilities from any trained classifier

SUPRISE: Running the new find_label_issues() function on the CelebA image tagging dataset reveals around 30,000 mislabeled images! Check out a few of them in the blog post above!

Hope you find these practical tools useful in your real-world data science and ML applications!

1 Like