CLIP-ViL

CLIP-ViL: How Much Can CLIP Benefit Vision-and-Language Tasks?

Introduction

This is code and checkpoints for the vision-and-language pre-training model in our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?" (Link). CLIP-ViL with pre-training sets new single-model state-of-the-arts on benchmarks such as VQA v2.0 (76.70 on test-std).

The code is adopted from both the CLIP repo and the LXMERT repo. Many thanks to the authors of these repos~

Data & Files Required

Annotation files

Download data file with [0dhw] and save them as data/ file. Now, gpa and mscoco files are incomplete.

Image Data

Download COCO images and unzip them in data/mscoco file:

wget http://images.cocodataset.org/zips/train2014.zip -P data/mscoco
wget http://images.cocodataset.org/zips/val2014.zip -P data/mscoco
wget http://images.cocodataset.org/zips/test2015.zip -P data/mscoco

unzip data/mscoco/train2014.zip -d data/mscoco/ && rm data/mscoco/train2014.zip
unzip data/mscoco/val2014.zip -d data/mscoco/ && rm data/mscoco/val2014.zip
unzip data/mscoco/test2015.zip -d data/mscoco/ && rm data/mscoco/test2015.zip

Download original GQA dataset, including Scene Graphs (ver 1.1 / 42.7MB), Questions (ver 1.2 / 1.4GB), Images (20.3G), and unzip them in in data/gpa file.
Please refer to data/shot_for_check.jpg to check the download.

Environment Setup

Run pip install -r requirement.text to install the exactly same dependencies.

Or use conda-pack command to install the environment downloaded from here with [0dhw]:

pip install conda-pack
mkdir -p [path_to_conda_env]    # (e.g., ~/anaconda/envs/ENV_NAME)
tar -zxvf [ENV_NAME].tar.gz -C [path_to_conda_env]

Fine-Tuning

Caveats: To reduce CPU memory cost, we use shared memory to share annotation files across data readers. Be sure to delete any file with the prefix sharearray_ under /dev/shm/ after you finish training.

Training (Load checkpoint with [0dhw] to snap/pretrained/CLIP_VL_RN50x4/Epoch11_LXRT.pth):
```
./scripts/[train_side_xxx.sh] 0,1,2,3 [model_name] 9590 4
```
When the model finishes training, you will get snap/vqa or gqa/[model_name]/BEST.pth.
Testing:
```
./scripts/[test_side_xxx.sh] 0,1,2,3 [model_name] 9590 4
```
It will generate snap/vqa/[model_name]/test_predict.json for vqa or snap/gqa/[model_name]/submit_predict.json for gqa, which could submited to the VQA leaderboard or GQA leaderboard for Dev and Std results.
One can download our best checkpoints of vqa and gqa with [0dhw].

Name		Name	Last commit message	Last commit date
parent directory ..
clip		clip
data		data
scripts		scripts
snap		snap
src		src
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
apex-22.04-dev.zip		apex-22.04-dev.zip
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLIP-ViL

CLIP-ViL

README.md

CLIP-ViL: How Much Can CLIP Benefit Vision-and-Language Tasks?

Introduction

Data & Files Required

Annotation files

Image Data

Environment Setup

Fine-Tuning

Files

CLIP-ViL

Directory actions

More options

Directory actions

More options

Latest commit

History

CLIP-ViL

Folders and files

parent directory

README.md

CLIP-ViL: How Much Can CLIP Benefit Vision-and-Language Tasks?

Introduction

Data & Files Required

Annotation files

Image Data

Environment Setup

Fine-Tuning