
Detecting mislabeled samples in image datasets is a notoriously difficult task that holds importance for training computer vision models. That’s why at Visual Layer Research, we’ve developed a state-of-the-art mislabel detection algorithm designed to detect label errors with unprecedented precision that generalizes across different data distributions and domains. We call this algorithm LabelRank, hinting at its ability to quantitatively rank the quality of an image or object label.
tl:dr - results summary
In our head-to-head evaluations against seven varieties of well-regarded mislabel detectors, LabelRank consistently outperformed all other offerings. The full experimental setup and methodology are detailed further along this blog post, but for those in a hurry, we have summarized the results in this table. Note that all the algorithms tested were given the same image embeddings (CLIP ViT/B-32) whenever an algorithm utilizes a precomputed embedding in its approach.
Background
Label noise in computer vision datasets can have detrimental effects on model training, with larger datasets being needed to achieve the same post-training accuracy as would be achieved by training on a dataset with lower label noise levels.¹ Unfortunately, such label noise seems to be pervasive in many of the open-source and academic datasets used to train vision models,² and the cost of manually reviewing all labels is onerous. To alleviate the costs associated with label correction, it is necessary to have a method for surfacing potentially-mislabeled samples with high recall and without sacrificing precision. As an important challenge in data-driven machine learning work, significant research has been conducted into developing such methods, and a variety of different approaches exist in the literature.³ Even so, there remains significant room for improvement, and Visual Layer Research undertook efforts in the past few months in pursuit of such improvements.
Evaluation Methodology
To evaluate Labelrank, we compared its performance against other well-regarded mislabel detection algorithms, using the Caltech101 dataset4 as our benchmark. The experimental framework was designed to assess performance under different types and degrees of label noise, utilizing CLIP ViT/B-32 as the embedding model for all evaluations.
We benchmarked Labelrank against:
- RER, from Voxel51’s Published Algorithm: Class-wise Autoencoders Measure Classification Difficulty and Detect Label Mistakes5
- Confident Learning, from CleanLab’s Published Algorithm: Confident learning: Estimating uncertainty in dataset labels6
- SimiFeat, from: Detecting Corrupted Labels Without Training a Model to Predict7
- Selfclean, from: Intrinsic Self-Supervision for Data Quality Audits8
- SEMD, from: An Empirical Study of Automated Mislabel Detection in Real World Vision Datasets9
Dataset Preparation and Mislabel Seeding
We seeded mislabels with two different noise levels: 5% and 30% mislabeled samples, simulating different patterns of annotation errors that commonly occur in real-world scenarios. We introduce isolated label errors based on embedding similarity. Given a dataset D with n samples, where each sample i has a ground truth label yᵢ, the process follows these steps:
For each sample xᵢ, we compute its embedding eᵢ using CLIP ViT/B-32
A similarity matrix S is constructed with the cosine similarities between all embeddings.
For each class c, we identify candidate mislabels by selecting samples that are most similar to samples from different classes, subject to the constraint:
Where Nₖ(i) represents the k-nearest neighbors of sample i, Dc is the set of samples in class c, and α is the maximum fraction of samples per class that can be mislabeled (set to 0.4 in our experiments).
This seeding methodology maintains strict control over the total percentage of introduced mislabels (p) while ensuring:
where y’ᵢ represents the potentially corrupted label and ε is the maximum allowed deviation (set to 0.25).
This setup enabled us to systematically evaluate Labelrank against other mislabel detection algorithms under different noise patterns and intensities.
Results
LabelRank successfully detects the vast majority (>99% in the 5% condition) of mislabeled samples, even in situations where the seeded mislabel is exceedingly similar to the ground truth label. Some examples of exceptional difficulty are provided below:

Our evaluation demonstrated that LabelRank consistently outperforms existing state-of-the-art mislabel detection methods across both benchmark configurations. The results are quantified using the area under the receiver operating characteristic curve (AUROC), a comprehensive metric that evaluates detection performance across all possible detection thresholds. The full results are provided in the table below:
LabelRank achieved AUROC scores of 0.990 and 0.982 for 5% and 30% noise levels respectively, surpassing the next best performing methods, SEMD (for the 5% condition) and SimiFeat (for the 30% condition). This performance gap, particularly evident in the high-noise regime (30%), indicates LabelRank’s superior robustness to detecting errors even in highly-mislabeled datasets.
Conclusion
LabelRank is available immediately as part of the suite of Quality Analysis features in the Visual Layer Platform. Visual Layer's customers are currently applying these capabilities in wide-ranging domains, spanning defense, e-commerce, manufacturing, and biomedical industries. These customers are using the quality analysis tools for wide-ranging applications that span model training, defect inspection, and intelligence analysis. In the near future, we expect to release more dataset evaluations beyond natural scenes, as well as open-sourcing the evaluation datasets for others to use in their benchmarks.
References
[1] Rolnick, D. (2017). Deep learning is robust to massive label noise. arXiv preprint arXiv:1705.10694.
[2] Northcutt, C. G., Athalye, A., & Mueller, J. (2021). Pervasive label errors in test sets destabilize machine learning benchmarks. arXiv preprint arXiv:2103.14749.
[3] [9] Srikanth, M., Irvin, J., Hill, B. W., Godoy, F., Sabane, I., & Ng, A. Y. (2023). An Empirical Study of Automated Mislabel Detection in Real World Vision Datasets. arXiv preprint arXiv:2312.02200.
[4] Li, F.-F., Andreeto, M., Ranzato, M., & Perona, P. (2022). Caltech 101 (1.0) [Data set]. CaltechDATA. https://doi.org/10.22002/D1.20086
[5] Marks, J., Griffin, B. A., & Corso, J. J. (2024). Class-wise Autoencoders Measure Classification Difficulty And Detect Label Mistakes. arXiv preprint arXiv:2412.02596.
[6] Northcutt, C., Jiang, L., & Chuang, I. (2021). Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research, 70, 1373–1411.
[7] Zhu, Z., Dong, Z., & Liu, Y. (2022, June). Detecting corrupted labels without training a model to predict. In International conference on machine learning (pp. 27412–27427). PMLR.
[8] Gröger, F., Lionetti, S., Gottfrois, P., Gonzalez-Jimenez, A., Amruthalingam, L., Groh, M., … & Pouly, M. Intrinsic Self-Supervision for Data Quality Audits. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track.


.png)








