New paper on Automatically Detecting Label Errors in Entity Recognition Data

Think your entity recognition data is perfectly labeled? Just published research investigates automated methods to find sentences with mislabeled words in such datasets. Mislabeling is especially common in ML tasks like token classification, where labels must be chosen on a fine-grained basis. It is exhausting to get every single word labeled right!

This paper benchmarks a bunch of possible algorithms on real-world data (with actual label errors rather than synthetic errors often considered in academic studies) and identifies a straightforward approach that can find mislabeled words with better precision/recall than others.

This algorithm is now available to run on your own text data in one line of open-source code. Running this code on the famous CoNLL-2003 entity recognition dataset revealed hundreds of label errors.