Can We Challenge Open-Vocabulary Object Detectors with Generated Content in Street Scenes?

Annika Mütze^{1, 3} • Sadia Ilyas^{2, 3} • Christian Dörpelkus¹ • Matthias Rottmann¹

¹ Institute of Computer Science, University of Osnabrück, Osnabrück, Germany
² Aptiv Services Deutschland GmbH, Wuppertal, Germany
³ Equal contribution

WACV (2026)

Paper

Abstract

Open-vocabulary object detectors such as Grounding DINO are trained on vast and diverse data, achieving remarkable performance on challenging datasets. Due to that, it is unclear where to find their limitations, which is of major concern when using in safety-critical applications. Real-world data does not provide sufficient control, required for a rigorous evaluation of model generalization. In contrast, synthetically generated data allows to systematically explore the boundaries of model competence/generalization. In this work, we address two research questions: 1) Can we challenge open-vocabulary object detectors with generated image content? 2) Can we find systematic failure modes of those models? To address these questions, we design two automated pipelines using stable diffusion to inpaint unusual objects with high diversity in semantics, by sampling multiple substantives from WordNet and ChatGPT. On the synthetically generated data, we evaluate and compare multiple open-vocabulary object detectors as well as a classical object detector. The synthetic data is derived from two real-world datasets, namely LostAndFound, a challenging out-of-distribution (OOD) detection benchmark, and the NuImages dataset. Our results indicate that inpainting can challenge open-vocabulary object detectors in terms of overlooking objects. Additionally, we find a strong dependence of open-vocabulary models on object location, rather than on object semantics. This provides a systematic approach to challenge open-vocabulary models and gives valuable insights on how data could be acquired to effectively improve these models.

Resources & Links

Paper
PDF

Acknowledgments

OSVIA Lab

BMFTR

A.M. and M.R. gratefully acknowledge financial support from the German Federal Ministry for Research, Technology and Space (BMFTR) within the project ``REFRAME'' (grant no.\ 16IS24073C) and junior research group project "UnrEAL" (grant no.\ 16IS22069).

Citation

@inproceedings{mutze2026can,
  title={Can We Challenge Open-Vocabulary Object Detectors with Generated Content in Street Scenes?},
  author={M{\"u}tze, Annika and Ilyas, Sadia and D{\"o}rpelkus, Christian and Rottmann, Matthias},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
  pages={740--750},
  year={2026}}

Contact

Have questions or want to collaborate? Reach out:

Email: annika.muetze[at]uos.de

Paper Links

Abstract

Resources & Links

Acknowledgments

Citation

Contact