[go: up one dir, main page]

Skip to content

Official code implemtation of paper AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?

License

Notifications You must be signed in to change notification settings

brown-palm/AntGPT

Repository files navigation

AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?

ICLR2024

[Website] [Arxiv] [PDF]

Can we better anticipate an actor’s future actions (e.g. mix eggs) by knowing what commonly happens after his/her current action (e.g. crack eggs)? What if we also know the longer-term goal of the actor (e.g. making egg fried rice)? We hypothesize that large language models (LLMs), which have been pretrained on procedure text data (e.g. recipes, how-tos), have the potential to help LTA from both perspectives. It can help provide the prior knowledge on the possible next actions, and infer the goal given the observed part of a procedure, respectively.

AntGPT is the proposed framework in our paper to leverage LLMs in video-based long-term action anticipation. AntGPT achieves state-of-the-art performance on the Ego4D LTA v1 and v2 benchmarks, EPIC-Kitchens-55, as well as EGTEA GAZE+ by the time of publication.

Contents

Setup Environment

Clone this repository.

git clone git@github.com:brown-palm/AntGPT.git
cd AntGPT

Set up python (3.9) virtual environment. Install pytorch with the right CUDA version.

python3 -m venv venv/forecasting
source venv/forecasting/bin/activate
pip install torch==2.0.0+cu117 torchvision==0.15.1+cu117 torchaudio==2.0.1 --extra-index-url https://download.pytorch.org/whl/cu117

Install CLIP.

pip install git+https://github.com/openai/CLIP.git

Install other packages.

pip install -r requirements.txt 

Install llama-recipe packages following instructions here.

Prepare Data

In our experiments, we used data from Ego4D, Epic-Kitchens-55, and EGTEA GAZE+. For Epic-Kitchens-55 and EGTEA GAZE+, we also used the data annotation and splits of EGO-TOPO. First start a data folder in the root directory.

mkdir data

Datasets

Download Ego4D dataset, annotations and pretrained models from here.
Download Epic-Kitchens 55 dataset and annotations.
Download EGTEA GAZE+ dataset from here.
Download data annotations from EGO-TOPO. Please refer to their instructions.

Preprocessed Files

You can find our preprocessed file including text prompts, goal features, etc here.
Downloaded and unzip both folders.
Place the goal_features under data folder.
Place the dataset folder under Llama2_models folder.
Make a symlink in the ICL subfolder of the Llama2_models folder.

ln -s {path_to_dataset} AntGPT/Llama2_models/ICL

Features

We used CLIP to extract features from these datasets. You can use the feature extraction file under transformer_models to extract the features.

python -m transformer_models.generate_clip_img_embedding

Data Folder Structure

We have a data folder structure like illustrated below. Feel free to use your own setup and remember to adjust the path configs accordingly.

data
├── ego4d 
│   └── annotations
|   │   ├── fho_lta_taxonomy.json
|   │   ├── fho_test_unannotated.json
│   │   ├── ...
│   │
│   └── clips
│       ├── 0a7a74bf-1564-41dc-a516-f5f1fa7f75d1.mp4
│       ├── 0a975e6e-4b13-426d-be5f-0ef99b123358.mp4
│       ├── ...
│
├── ek 
│   └── annotations
|   │   ├── EPIC_many_shot_verbs.csv
│   │   ├── ...
│   │
│   └── clips
│       ├── rgb
│       ├── obj
│       └── flow
│
├── gaze 
│   └── annotations
|   │   ├── action_list_t+v.csv
│   │   ├── ...
│   │
│   └── clips
│       ├── OP01-R01-PastaSalad.mp4
│       ├── ...
│
├── goal_features
│    ├── ego4d_feature_gt_val.pkl 
│    ├── ...
│
├── output_CLIP_img_embedding_ego4d
│
...

Running Experiments

Our codebase consists of three parts: the transformer experiments, the GPT experiments, and the Llama2 experiments. Implementation of each modules are located in the transformer_models folder, GPT_models, and Llama2_models folder respectively.

Download Outputs and Checkpoints

You can find our model checkpoints and output files for Ego4D LTA here.
Unzip both folders.
Place the ckpt folder under the llama_recipe subfolder of the Llama2_models folder.
Place the ego4d_outputs folder under the llama_recipe subfolder of the Llama2_models folder.

Evalutation on Ego4D LTA

Submit the output files to leaderboard.

Inference on Ego4D LTA

cd Llama2_models/Finetune/llama-recipes
CUDA_VISIBLE_DEVICES=0 python inference/inference_lta.py --model_name {your llama checkpoint path} --peft_model {pretrained model path} --prompt_file ../dataset/test_nseg8_recog_egovlp.jsonl --response_path {output file path}

Transformer Experiments

To run an experiment on the transformer models, please use the following command

python -m transformer_models.run --cfg transformer_models/configs/ego4d_image_pred_in8.yaml --exp_name ego4d_lta/clip_feature_in8

GPT Experiments

To run a GPT experiment, please use one of the workflow illustration notebooks.

Llama2 Experiments

To run a Llama2 experiment, please refer to the instructions in that folder.

Our Paper

Our paper is available on Arxiv. If you find our work useful, please consider citing us.

@article{zhao2023antgpt,
  title   = {AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?},
  author  = {Qi Zhao and Shijie Wang and Ce Zhang and Changcheng Fu and Minh Quan Do and Nakul Agarwal and Kwonjoon Lee and Chen Sun},
  journal = {ICLR},
  year    = {2024}
}

License

This project is released under the MIT license.