The implementation of NeurIPS 2025 Spotlight (Top 3%) paper Seeing Sound, Hearing Sight: Uncovering Modality Bias and Conflict of AI models in Sound Localization.
In this paper, we build the first framework to benchmark current Audio-Visual models' performance on Sound Localization task, a large scale stereo-image dataset generated by physics-based 3D simulator with Head-Related Transfer Function (HRTF) filtering and a neuroscience-inspired model, EchoPin with Cochleagram representation.
- [2025/10/20]: We released all related methods. Due to the space limitation, we don't release the cochleagram .npy files. If you want to build cochleagram for testing or fine-tuning, please follow the next guidance. If you meet any problems or bugs, welcome to contact me or open an issue, we will give feedback ASAP.
- [2025/09/25]: We are releasing the code in process. (Firstly the Unity rendering part, then the dataset part and finally the training and testing code. Due to some other DDLs, the total process may continue for a few weeks. Please understand.)
- [2025/09/18]: Our SSHS has been selected as a Spotlight paper at NeurIPS 2025! (Top 3% of 21575 submissions).
Unity 6000.10f1
MATLAB 2022bconda create -n SSHS python=3.9
conda activate SSHS
git submodule init
git submodule update
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
# git submodule update --init --recursiveIf you want to build the full AudioCOCO from scratch, please follow the next guidance.
-
Use the select_image.py to select images which contain the sounding objects.
-
Use the filter_audio.py and filter_image.py files to filter the high-quaility part.
-
Set up the DepthAnything model, move the depth.py file to DepthAnything folder and then estimate the depth information for all images and update all json files.
-
Open the UnityProject in the Unity, choose the generation file and move to the console. For a reference, each audio needs about 10 seconds to render.
-
Use the Pycochleagram to convert the waveform to cochleagram. The details can refer to Cochleagram_README.
Congruent condition \\
--config config1.json --condition no
ConflictVCue condition \\
--config config2.json --condition no
AbsentVCue condition \\
--config config3.json --condition no
AOnly (Vision noise) condition \\
--config config4.json --condition blind --label nosie
AOnly (Vision gray) condition \\
--config config4.json --condition blind --label gray
VOnly (Audio noise) condition \\
--config config1.json --condition slient --label nosie
VOnly (Audio silent) condition \\
--config config1.json --condition slient --label silent
Multi-Instance Localization condition \\
--config config6.json --condition noThe training dataset contains about 1,800,000 pairs, and you can use Unity to generate and train.
python train.py --train_config path_to_the_train_config_file \\
--image_root path_to_image \\
--audio_root path_to_audio \\
--coch_root path_to_cochleagram \\
--pretrained_path path_to_load_the_IS3_checkpoint_file_to_finetune \\The test dataset and the checkpoint can be downloaded here.
python test.py --pretrained_path path_to_the_checkpoint_file \\
--config path_to_the_test_config_file \\
--image_root path_to_image \\
--coch_root path_to_cochleagram \\
--condition select_the_condition4or5 \\
--label select_the_different_input_for_condition4or5 \\python human_aacc.py # caculate the A-Accuracy
python human_vacc.py # caculate the V-Accuracy
python human_asymmetry.py # analyze the direction biasOur code is based on IS3, DepthAnything and Pycochleagram. We sincerely appreciate for their contributions.
If you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:
@article{jia2025seeing,
title={Seeing sound, hearing sight: Uncovering modality bias and conflict of ai models in sound localization},
author={Jia, Yanhao and Xie, Ji and Jivaganesh, S and Li, Hao and Wu, Xu and Zhang, Mengmi},
journal={arXiv preprint arXiv:2505.11217},
year={2025}
}