Menu

Paper Links

From Label Error Detection to Correction: A Modular
Framework and Benchmark for Object Detection Datasets

Sarina Penquitt1,4Jonathan Klees1,4Rinor Cakaj3Daniel Kondermann3Matthias Rottmann2Lars Schmarje3

1 Department of Mathematics, University of Wuppertal, Wuppertal, Germany
2 Institute of Computer Science, Osnabrück University, Osnabrück, Germany
3 Quality Match, Heidelberg, Germany
4 Equal contribution

Title image 1

Abstract

We introduce a semi-automated framework for label-error correction called REC✓D (Rechecked). Building on existing detectors, the framework pairs their error proposals with lightweight, crowdsourced microtasks. These tasks enable multiple annotators to independently verify each candidate bounding box, and their responses are aggregated to estimate ambiguity and improve label quality. To demonstrate the effectiveness of REC✓D, we apply it to the class pedestrian in the KITTI dataset. Our crowdsourced review yields high-quality corrected annotations, which indicate a rate of at least 24% of missing and inaccurate annotations in original annotations. This validated set will be released as a new real-world benchmark for label error detection and correction. We show that current label error detection methods, when combined with our correction framework, can recover hundreds of errors in the time it would take a human to annotate bounding boxes from scratch. However, even the best methods still miss up to 66% of the true errors and with low quality labels introduce more errors than they find. This highlights the urgent need for further research, now enabled by our released benchmark.

Method Overview

We propose a semi-automatic framework that combines label error detection methods with a microtask based review process to efficiently correct label errors. It consists of three main stages: object detection, label error proposal, and microtask based correction.

Overview of our workflow

image

Microtasks interface

image

Object Detection Models

A trained object detector is used to obtain predictions on the respective dataset. We use YOLOX and Cascade R-CNN and these detectors are trained sequentially, with the output of one detector serving as the training set for the subsequent one.

Label Error Detection Methods

Generating automatically proposals for label errors by integrating existing label error detection methods, including MetaDetect, loss-based instance-wise scoring and ObjectLab. Each method predicts a score that reflects the likelihood of a predicted box indicating a label error for every detected bounding box.

Error Correction

The label error proposals are evaluated manually. We created short and easy-to-answer questions referred to as microtasks. With microtasks, annotators can quickly answer one simple question at a time We repeat the task with several people at the same time resulting in an estimate of the underlying distribution of the outcomes associated to this question. The microtask we used to construct the validated annotations of the KITTI dataset.

Metrics at a Glance

Number of identified label errors for the class pedestrian in the KITTI dataset

The number of identified label errors depend on the probability threshold for soft label annotations and the minimal height of considered objects. Through comparison of original and validated annotations, we identify a number of label errors in the original dataset even when considering only evident errors.

Segmentation metrics summary

Validated GT probability vs. Human Perception

Comparison of provided validated GT probability of being a pedestrian (y-axis). Each dot represents an annotation by three expert annotators and its corresponding probability in the validated GT. The diamond represents the center of the mean, the dashed lines the standard deviation and the star the median per annotator answer.

figure 4

Most evident overlooked objects in KITTI GT

Overlooked objects (in red color) that are not contained in original annotations (green color). These examples have a soft label probability of 0.8 or higher as well as a bounding box height of 40 pixels or more.

label-error0
label-error1
label-error2
label-error3

Resources & Links

Acknowledgments

S.P. and M.R. acknowledge support by the German Federal Ministry of Education and Research (BMBF) within the junior research group project “UnrEAL” (grant no. 01IS22069).

J.K. and M.R. acknowledge support by the German Federal Ministry of Education and Research (BMBF) within the project “RELiABEL” (grant no. 01IS24019B).

R.C., D.K. and L.S. also acknowledge support within the project “RELiABEL” (grant no. 01IS24019A).

Citation

@article{penquitt2025labelerrordetectioncorrection,
                title={From Label Error Detection to Correction: A Modular Framework and Benchmark for Object Detection Datasets}, 
                author={Penquitt, S. and Klees, J. and Cakaj, R. and Kondermann, D. and Rottmann, M. and Schmarje, L.},
                journal={arXiv:2508.06556},
                year={2025} 
              }
          

Contact

Have questions or want to collaborate? Reach out: