Abstract

The goal of this work is to establish a scalable pipeline for expanding an object detector towards novel/unseen categories, using zero manual annotations. To achieve that, we make the following four contributions: (i) in pursuit of generalisation, we propose a two-stage open-vocabulary object detector, where the class-agnostic object proposals are classified with a text encoder from pre-trained visual-language model; (ii) To pair the visual latent space (of RPN box proposals) with that of the pre-trained text encoder, we propose the idea of regional prompt learning to align the textual embedding space with regional visual object features; (iii) To scale up the learning procedure towards detecting a wider spectrum of objects, we exploit the available online resource via a novel self-training framework, which allows to train the proposed detector on a large corpus of noisy uncurated web images. Lastly, (iv) to evaluate our proposed detector, termed as PromptDet, we conduct extensive experiments on the challenging LVIS and MS-COCO dataset. PromptDet shows superior performance over existing approaches with fewer additional training images and zero manual annotations whatsoever.

Datasets

A summary of dataset statistics on open-vocabulary LVIS benchmark. LVIS is a large-vocabulary object detection dataset, where the frequent and common classes are treated as base categories (referred as LVIS-base), and the rare classes as the novel categories. We also use an external dataset, LAION-400M, which consists of 400 million image-text pairs filtered by pre-trained CLIP. We search for the images by using its 64G KNN indices and download about 300 images for each novel category (referred as LAION-novel). We conduct training on both LVIS-base and LAION-novel, and evaluation on the LVIS validation set.

Results

R1: Compare with SOTA open-vocabulary object detectors on the LVIS benchmark.

R2: Compare with SOTA open-vocabulary object detectors on the MS-COCO benchmark.

R3: Generalisation from LVIS-base to MS-COCO with different prompts.

Visualizations

Qualitative results from our PromptDet on images from LVIS validation set. The boxes with green denote the objects from novel categories, while blue boxes refer to the objects from base categories.

Publication

C. Feng, Y. Zhong, Z. Jie, X. Chu, H. Ren, X. Wei, W. Xie, L. Ma
PromptDet: Towards Open-vocabulary Detection using Uncurated Images
ECCV 2022
ArXiv | Code | Bibtex

Webpage template modified from here.