[Detic] Object Detection over Twenty-thousand Classes [Python]

2022年6月21日火曜日

Artificial Intelligence English

In this article, I will introduce how to detect objects with arbitrary search keywords using Detic (Detecting Twenty-thousand Classes using Image-level Supervision) announced by Meta (formerly Facebook).

eyecatch

Detic

Abstract

Current object detectors are limited in vocabulary size due to the small scale of detection datasets. Image classifiers, on the other hand, reason about much larger vocabularies, as their datasets are larger and easier to collect. We propose Detic, which simply trains the classifiers of a detector on image classification data and thus expands the vocabulary of detectors to tens of thousands of concepts. Unlike prior work, Detic does not assign image labels to boxes based on model predictions, making it much easier to implement and compatible with a range of detection architectures and backbones. Our results show that Detic yields excellent detectors even for classes without box annotations. It outperforms prior work on both open-vocabulary and long-tail detection benchmarks. Detic provides a gain of 2.4 mAP for all classes and 8.3 mAP for novel classes on the open-vocabulary LVIS benchmark. On the standard LVIS benchmark, Detic reaches 41.7 mAP for all classes and 41.7 mAP for rare classes. For the first time, we train a detector with all the twenty-one-thousand classes of the ImageNet dataset and show that it generalizes to new datasets without fine-tuning.

Below is an example of the image classification dataset ImageNet.

ImageNetデータセット例

Below is an example of the object detection dataset LVIS.

LVISデータセット例

feature

As shown in the above figure, the object detection dataset needs to be given the position information of the object to be detected in the image. For this reason, it is more costly to secure a data set for object detection than to classify images. Detic is now able to train object detection on image classification datasets, so it is learning a large number of objects from a large dataset. This makes it possible to detect more and more various objects than before. In addition, by incorporating CLIP, object detection suitable for any keyword is also realized.

demo(Colaboratory)

This chapter describes a demo by Google Colab.

In addition, all the implementations described below are posted here. To run it, please change runtime to GPU.

Setup environment

import torch
TORCH_VERSION = ".".join(torch.__version__.split(".")[:2])
CUDA_VERSION = torch.__version__.split("+")[-1]
print("torch: ", TORCH_VERSION, "; cuda: ", CUDA_VERSION)

install detectron2 using pip.

# detectron2をインストール
!pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/$CUDA_VERSION/torch$TORCH_VERSION/index.html

then, code clone from GitHub

# clone and install Detic
!git clone https://github.com/facebookresearch/Detic.git --recurse-submodules
%cd Detic
!pip install -r requirements.txt

Setup environment is done.

Load model

Then load the model. I will download and use the trained model provided by facebook.

# Build the detector and download our pretrained weights
cfg = get_cfg()
add_centernet_config(cfg)
add_detic_config(cfg)
cfg.merge_from_file("configs/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.yaml")
cfg.MODEL.WEIGHTS = 'https://dl.fbaipublicfiles.com/detic/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth'
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5  # set threshold for this model
cfg.MODEL.ROI_BOX_HEAD.ZEROSHOT_WEIGHT_PATH = 'rand'
cfg.MODEL.ROI_HEADS.ONE_CLASS_PER_PROPOSAL = True # For better visualization purpose. Set to False for all classes.
predictor = DefaultPredictor(cfg)

Object detection

Next, upload the image for object detection. You can upload one or more files.

from google.colab import files
uploaded = files.upload()
uploaded = list(uploaded.keys())
print(uploaded)

This time, enter these 4 images.

テストデータ

Performs object detection. If you enter the name of the detected object in detect_target, the corresponding object will be detected. This time I will try to detect the laptop.

from detic.modeling.text.text_encoder import build_text_encoder
def get_clip_embeddings(vocabulary, prompt='a '):
    text_encoder = build_text_encoder(pretrain=True)
    text_encoder.eval()
    texts = [prompt + x for x in vocabulary]
    emb = text_encoder(texts).detach().permute(1, 0).contiguous().cpu()
    return emb

vocabulary = 'custom'
metadata = MetadataCatalog.get("__unused")

detect_target = 'laptop' #@param {type:"string"}

metadata.thing_classes = [detect_target]

classifier = get_clip_embeddings(metadata.thing_classes)
num_classes = len(metadata.thing_classes)
reset_cls_test(predictor.model, classifier, num_classes)

for file in uploaded:
  im = cv2.imread(file)

  # Reset visualization threshold
  output_score_threshold = 0.3
  for cascade_stages in range(len(predictor.model.roi_heads.box_predictor)):
    predictor.model.roi_heads.box_predictor[cascade_stages].test_score_thresh = output_score_threshold

  # Run model and show results
  outputs = predictor(im)
  v = Visualizer(im[:, :, ::-1], metadata)
  out = v.draw_instance_predictions(outputs["instances"].to("cpu"))
  cv2_imshow(out.get_image()[:, :, ::-1])

del metadata.thing_classes

The detection results are as follows.

laptop検出結果1
laptop検出結果2
laptop検出結果3
laptop検出結果4

It has been detected quite accurately. Finally, load the vocabulary and try to detect everything that can be detected.

# Setup the model's vocabulary using build-in datasets

BUILDIN_CLASSIFIER = {
    'lvis': 'datasets/metadata/lvis_v1_clip_a+cname.npy',
    'objects365': 'datasets/metadata/o365_clip_a+cnamefix.npy',
    'openimages': 'datasets/metadata/oid_clip_a+cname.npy',
    'coco': 'datasets/metadata/coco_clip_a+cname.npy',
}

BUILDIN_METADATA_PATH = {
    'lvis': 'lvis_v1_val',
    'objects365': 'objects365_v2_val',
    'openimages': 'oid_val_expanded',
    'coco': 'coco_2017_val',
}

vocabulary = 'lvis' # change to 'lvis', 'objects365', 'openimages', or 'coco'
metadata = MetadataCatalog.get(BUILDIN_METADATA_PATH[vocabulary])
classifier = BUILDIN_CLASSIFIER[vocabulary]
num_classes = len(metadata.thing_classes)
reset_cls_test(predictor.model, classifier, num_classes)

Load the vocabulary from the lvis dataset.

for file in uploaded:
  im = cv2.imread(file)

  # Reset visualization threshold
  output_score_threshold = 0.3
  for cascade_stages in range(len(predictor.model.roi_heads.box_predictor)):
    predictor.model.roi_heads.box_predictor[cascade_stages].test_score_thresh = output_score_threshold

  # Run model and show results
  outputs = predictor(im)
  v = Visualizer(im[:, :, ::-1], metadata)
  out = v.draw_instance_predictions(outputs["instances"].to("cpu"))
  cv2_imshow(out.get_image()[:, :, ::-1])

The detection results are as follows.

全ての検出結果1
全ての検出結果2
全ての検出結果3
全ての検出結果4

We can see that the object can be detected in very small detail.

Summary

In this article, I introduced how to use Detic to detect objects with arbitrary keywords. It is quite amazing to expand the conventional functions and improve the accuracy while making it easy to secure the data set. It's painful to actually create it, but annotating the object detection dataset is a painful task. Many people would be saved if Detic freed them from that task.

References

1.  Paper - Detecting Twenty-thousand Classes using Image-level Supervision

2. GitHub - facebookresearch/Detic

AIで副業ならココから!

まずは無料会員登録

プロフィール

メーカーで研究開発を行う現役エンジニア
組み込み機器開発や機会学習モデル開発に従事しています

本ブログでは最新AI技術を中心にソースコード付きでご紹介します


Twitter

カテゴリ

このブログを検索

ブログ アーカイブ

TeDokology