TASED-Net is a novel fully-convolutional network architecture for video saliency detection. The main idea is simple but effective: spatially decoding 3D video features while jointly aggregating all the temporal information. TASED-Net significantly outperforms previous state-of-the-art approaches on all three major large-scale datasets of video saliency detection: DHF1K, Hollywood2, and UCFSports. We observe that our model is especially better at attending to salient moving objects.
TASED-Net is currently leading the leaderboard of DHF1K online benchmark.
Model | Year | NSS↑ | CC ↑ | SIM↑ | AUC-J↑ | s-AUC↑ |
---|---|---|---|---|---|---|
TASED-Net (updated) | 2019 | 2.797 | 0.489 | 0.393 | 0.897 | 0.712 |
TASED-Net (reported) | 2019 | 2.667 | 0.470 | 0.361 | 0.895 | 0.712 |
SalEMA | 2019 | 2.574 | 0.449 | 0.466 | 0.890 | 0.667 |
STRA-Net | 2019 | 2.558 | 0.458 | 0.355 | 0.895 | 0.663 |
ACLNet | 2018 | 2.354 | 0.434 | 0.315 | 0.890 | 0.601 |
SalGAN | 2017 | 2.043 | 0.370 | 0.262 | 0.866 | 0.709 |
SALICON | 2015 | 1.901 | 0.327 | 0.232 | 0.857 | 0.590 |
GBVS | 2007 | 1.474 | 0.283 | 0.186 | 0.828 | 0.554 |
Video saliency detection aims to model the gaze fixation patterns of humans when viewing a dynamic scene. Because the predicted saliency map can be used to prioritize the video information across space and time, this task has a number of applications such as video surveillance, video captioning, video compression, etc.
We compare our TASED-Net to ACLNet, which was the previously leading state-of-the-art method. As shown in the examples below, TASED-Net is better at attending to the salient information. We also would like to point out that TASED-Net has a much smaller network size (82 MB v.s. 252 MB).
First, clone this repository and download this weight file. Then, just run the code using
$ python run_example.py
This will generate frame-wise saliency maps. You can also specify the input and output directories as command-line arguments. For example,
$ python run_example.py ./example ./output
-
The released model is a modified version to increase the performance. The updated results are reported above.
-
We recommend using PNG image files as input (although examples of this repository are in JPEG format).
-
For the encoder of TASED-Net, we use the S3D network. We pretrained S3D on Kinetics-400 dataset using PyTorch and it achieves 72.08% top1 accuracy (top5: 90.35%) on the validation set of the dataset. We release our weight file for S3D together this project. If you find it useful, you might want to consider citing our work.
-
For training, we recommend using ViP, which is the video platform for general purposes in PyTorch. Otherwise, you can just use
run_train.py
. Before running the training code, make sure to download our weight file for S3D.
@inproceedings{min2019tased,
title={TASED-Net: Temporally-Aggregating Spatial Encoder-Decoder Network for Video Saliency Detection},
author={Min, Kyle and Corso, Jason J},
booktitle={Proceedings of the IEEE International Conference on Computer Vision},
pages={2394--2403},
year={2019}
}