[Detic] Object Detection over Twenty-thousand Classes [Python]

In this article, I will introduce how to detect objects with arbitrary search keywords using Detic (Detecting Twenty-thousand Classes using Image-level Supervision) announced by Meta (formerly Facebook).

Detic

Abstract

Current object detectors are limited in vocabulary size due to the small scale of detection datasets. Image classifiers, on the other hand, reason about much larger vocabularies, as their datasets are larger and easier to collect. We propose Detic, which simply trains the classifiers of a detector on image classification data and thus expands the vocabulary of detectors to tens of thousands of concepts. Unlike prior work, Detic does not assign image labels to boxes based on model predictions, making it much easier to implement and compatible with a range of detection architectures and backbones. Our results show that Detic yields excellent detectors even for classes without box annotations. It outperforms prior work on both open-vocabulary and long-tail detection benchmarks. Detic provides a gain of 2.4 mAP for all classes and 8.3 mAP for novel classes on the open-vocabulary LVIS benchmark. On the standard LVIS benchmark, Detic reaches 41.7 mAP for all classes and 41.7 mAP for rare classes. For the first time, we train a detector with all the twenty-one-thousand classes of the ImageNet dataset and show that it generalizes to new datasets without fine-tuning.

Below is an example of the image classification dataset ImageNet.

Below is an example of the object detection dataset LVIS.

feature

As shown in the above figure, the object detection dataset needs to be given the position information of the object to be detected in the image. For this reason, it is more costly to secure a data set for object detection than to classify images. Detic is now able to train object detection on image classification datasets, so it is learning a large number of objects from a large dataset. This makes it possible to detect more and more various objects than before. In addition, by incorporating CLIP, object detection suitable for any keyword is also realized.

demo(Colaboratory)

This chapter describes a demo by Google Colab.

In addition, all the implementations described below are posted here. To run it, please change runtime to GPU.

Setup environment

import torch
TORCH_VERSION = ".".join(torch.__version__.split(".")[:2])
CUDA_VERSION = torch.__version__.split("+")[-1]
print("torch: ", TORCH_VERSION, "; cuda: ", CUDA_VERSION)

install detectron2 using pip.

# detectron2をインストール
!pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/$CUDA_VERSION/torch$TORCH_VERSION/index.html

then, code clone from GitHub

# clone and install Detic
!git clone https://github.com/facebookresearch/Detic.git --recurse-submodules
%cd Detic
!pip install -r requirements.txt

Setup environment is done.

Load model

Then load the model. I will download and use the trained model provided by facebook.

# Build the detector and download our pretrained weights
cfg = get_cfg()
add_centernet_config(cfg)
add_detic_config(cfg)
cfg.merge_from_file("configs/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.yaml")
cfg.MODEL.WEIGHTS = 'https://dl.fbaipublicfiles.com/detic/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth'
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5  # set threshold for this model
cfg.MODEL.ROI_BOX_HEAD.ZEROSHOT_WEIGHT_PATH = 'rand'
cfg.MODEL.ROI_HEADS.ONE_CLASS_PER_PROPOSAL = True # For better visualization purpose. Set to False for all classes.
predictor = DefaultPredictor(cfg)

Object detection

Next, upload the image for object detection. You can upload one or more files.

from google.colab import files
uploaded = files.upload()
uploaded = list(uploaded.keys())
print(uploaded)

This time, enter these 4 images.

Performs object detection. If you enter the name of the detected object in detect_target, the corresponding object will be detected. This time I will try to detect the laptop.

from detic.modeling.text.text_encoder import build_text_encoder
def get_clip_embeddings(vocabulary, prompt='a '):
    text_encoder = build_text_encoder(pretrain=True)
    text_encoder.eval()
    texts = [prompt + x for x in vocabulary]
    emb = text_encoder(texts).detach().permute(1, 0).contiguous().cpu()
    return emb

vocabulary = 'custom'
metadata = MetadataCatalog.get("__unused")

detect_target = 'laptop' #@param {type:"string"}

metadata.thing_classes = [detect_target]

classifier = get_clip_embeddings(metadata.thing_classes)
num_classes = len(metadata.thing_classes)
reset_cls_test(predictor.model, classifier, num_classes)

for file in uploaded:
  im = cv2.imread(file)

  # Reset visualization threshold
  output_score_threshold = 0.3
  for cascade_stages in range(len(predictor.model.roi_heads.box_predictor)):
    predictor.model.roi_heads.box_predictor[cascade_stages].test_score_thresh = output_score_threshold

  # Run model and show results
  outputs = predictor(im)
  v = Visualizer(im[:, :, ::-1], metadata)
  out = v.draw_instance_predictions(outputs["instances"].to("cpu"))
  cv2_imshow(out.get_image()[:, :, ::-1])

del metadata.thing_classes

The detection results are as follows.

It has been detected quite accurately. Finally, load the vocabulary and try to detect everything that can be detected.

# Setup the model's vocabulary using build-in datasets

BUILDIN_CLASSIFIER = {
    'lvis': 'datasets/metadata/lvis_v1_clip_a+cname.npy',
    'objects365': 'datasets/metadata/o365_clip_a+cnamefix.npy',
    'openimages': 'datasets/metadata/oid_clip_a+cname.npy',
    'coco': 'datasets/metadata/coco_clip_a+cname.npy',
}

BUILDIN_METADATA_PATH = {
    'lvis': 'lvis_v1_val',
    'objects365': 'objects365_v2_val',
    'openimages': 'oid_val_expanded',
    'coco': 'coco_2017_val',
}

vocabulary = 'lvis' # change to 'lvis', 'objects365', 'openimages', or 'coco'
metadata = MetadataCatalog.get(BUILDIN_METADATA_PATH[vocabulary])
classifier = BUILDIN_CLASSIFIER[vocabulary]
num_classes = len(metadata.thing_classes)
reset_cls_test(predictor.model, classifier, num_classes)

Load the vocabulary from the lvis dataset.

for file in uploaded:
  im = cv2.imread(file)

  # Reset visualization threshold
  output_score_threshold = 0.3
  for cascade_stages in range(len(predictor.model.roi_heads.box_predictor)):
    predictor.model.roi_heads.box_predictor[cascade_stages].test_score_thresh = output_score_threshold

  # Run model and show results
  outputs = predictor(im)
  v = Visualizer(im[:, :, ::-1], metadata)
  out = v.draw_instance_predictions(outputs["instances"].to("cpu"))
  cv2_imshow(out.get_image()[:, :, ::-1])

The detection results are as follows.

We can see that the object can be detected in very small detail.

Summary

In this article, I introduced how to use Detic to detect objects with arbitrary keywords. It is quite amazing to expand the conventional functions and improve the accuracy while making it easy to secure the data set. It's painful to actually create it, but annotating the object detection dataset is a painful task. Many people would be saved if Detic freed them from that task.

References

1. Paper - Detecting Twenty-thousand Classes using Image-level Supervision

2. GitHub - facebookresearch/Detic

[Detic] Object Detection over Twenty-thousand Classes [Python]

Detic

Abstract

feature

demo(Colaboratory)

Setup environment

Load model

Object detection

Summary

References

0 件のコメント :

コメントを投稿

AIで副業ならココから!

まずは無料会員登録

プロフィール

注目の投稿

[初心者向け] 機械学習がゼロから分かるおすすめオンライン講座

人気の投稿

カテゴリ

このブログを検索

ブログアーカイブ

注目の投稿

[初心者向け] Pythonで機械学習を始めるまでに読んだおすすめ書籍一覧

このブログについて

TeDokology

連絡フォーム

このブログについて

[Detic] Object Detection over Twenty-thousand Classes [Python]

Detic

Abstract

feature

demo(Colaboratory)

Setup environment

Load model

Object detection

Summary

References

0 件のコメント :

コメントを投稿

AIで副業ならココから!

まずは無料会員登録

プロフィール

注目の投稿

[初心者向け] 機械学習がゼロから分かるおすすめオンライン講座

人気の投稿

カテゴリ

このブログを検索

ブログ アーカイブ

注目の投稿

[初心者向け] Pythonで機械学習を始めるまでに読んだおすすめ書籍一覧

このブログについて

TeDokology

連絡フォーム

このブログについて

ブログアーカイブ