MDETR

MDETR: Modulated Detection for End-to-End Multi-Modal Understanding

This repository contains code and links to pre-trained models for MDETR (Modulated DETR) for pre-training on data having aligned text and images with box annotations, as well as fine-tuning on tasks requiring fine grained understanding of image and text.

We show big gains on the phrase grounding task (Flickr30k), Referring Expression Comprehension (RefCOCO, RefCOCO+ and RefCOCOg) as well as Referring Expression Segmentation (PhraseCut, CLEVR Ref+). We also achieve competitive performance on visual question answering (GQA, CLEVR).

TL;DR. We depart from the fixed frozen object detector approach of several popular vision + language pre-trained models and achieve true end-to-end multi-modal understanding by training our detector in the loop. In addition, we only detect objects that are relevant to the given text query, where the class labels for the objects are just the relevant words in the text query. This allows us to expand our vocabulary to anything found in free form text, making it possible to detect and reason over novel combination of object classes and attributes.

For details, please see the paper: MDETR - Modulated Detection for End-to-End Multi-Modal Understanding by Aishwarya Kamath, Mannat Singh, Yann LeCun, Ishan Misra, Gabriel Synnaeve and Nicolas Carion.

Aishwarya Kamath and Nicolas Carion made equal contributions to this codebase.

Referring expression comprehension

Instructions for data preparation and script to run finetuning and evaluation can be found at Referring Expression Instructions

License

MDETR is released under the Apache 2.0 license. Please see the LICENSE file for more information.

Citation

If you find this repository useful please give it a star and cite as follows! :) :

    @article{kamath2021mdetr,
      title={MDETR--Modulated Detection for End-to-End Multi-Modal Understanding},
      author={Kamath, Aishwarya and Singh, Mannat and LeCun, Yann and Misra, Ishan and Synnaeve, Gabriel and Carion, Nicolas},
      journal={arXiv preprint arXiv:2104.12763},
      year={2021}
    }

Name		Name	Last commit message	Last commit date
parent directory ..
.github		.github
configs		configs
data		data
datasets		datasets
mdetr_annotations		mdetr_annotations
models		models
pretrained_weights		pretrained_weights
runs		runs
scripts		scripts
util		util
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
engine.py		engine.py
hubconf.py		hubconf.py
main.py		main.py
requirements.txt		requirements.txt
run_with_submitit.py		run_with_submitit.py
run_with_submitit_gqa_eval.py		run_with_submitit_gqa_eval.py
run_with_submitit_lvis_eval.py		run_with_submitit_lvis_eval.py
test_refcoco+.sh		test_refcoco+.sh
test_refcoco.sh		test_refcoco.sh
test_refcocogsh		test_refcocogsh
train_refcoco+.sh		train_refcoco+.sh
train_refcoco.sh		train_refcoco.sh
train_refcocog.sh		train_refcocog.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MDETR

MDETR

README.md

MDETR: Modulated Detection for End-to-End Multi-Modal Understanding

Referring expression comprehension

License

Citation

Files

MDETR

Directory actions

More options

Directory actions

More options

Latest commit

History

MDETR

Folders and files

parent directory

README.md

MDETR: Modulated Detection for End-to-End Multi-Modal Understanding

Referring expression comprehension

License

Citation