[go: up one dir, main page]

Skip to main content

Advertisement

Log in

Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Masked Autoencoders (MAE) have been popular paradigms for large-scale vision representation pre-training. However, MAE solely reconstructs the low-level RGB signals after the decoder and lacks supervision upon high-level semantics for the encoder, thus suffering from sub-optimal learned representations and long pre-training epochs. To alleviate this, previous methods simply replace the pixel reconstruction targets of 75% masked tokens by encoded features from pre-trained image-image (DINO) or image-language (CLIP) contrastive learning. Different from those efforts, we propose to Mimic before Reconstruct for Masked Autoencoders, named as MR-MAE, which jointly learns high-level and low-level representations without interference during pre-training. For high-level semantics, MR-MAE employs a mimic loss over 25% visible tokens from the encoder to capture the pre-trained patterns encoded in CLIP and DINO. For low-level structures, we inherit the reconstruction loss in MAE to predict RGB pixel values for 75% masked tokens after the decoder. As MR-MAE applies high-level and low-level targets respectively at different partitions, the learning conflicts between them can be naturally overcome and contribute to superior visual representations for various downstream tasks. On ImageNet-1K, the MR-MAE base pre-trained for only 400 epochs achieves 85.8% top-1 accuracy after fine-tuning, surpassing the 1600-epoch MAE base by \(+2.2\)% and the previous state-of-the-art BEiT V2 base by \(+0.3\)%. Pretrained checkpoints are released at https://github.com/Alpha-VL/ConvMAE.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  • Baevski, A., Hsu, W. N., Xu, Q., Babu, A., Gu, J., & Auli, M. (2022). Data2vec: A general framework for self-supervised learning in speech, vision and language. arXiv preprint arXiv:2202.03555.

  • Bai, Y. (2022). Masked autoencoders enable efficient knowledge distillers. arXiv preprint arXiv:2208.12256.

  • Bao, H., Dong, L. & Wei, F. (2021) . Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254.

  • Brown, T., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.

    Google Scholar 

  • Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9650–9660).

  • Chen, X., Ding, M., Wang, X., Xin, Y., Mo, S., Wang, Y., Han, S., Luo, P., Zeng, G., & Wang, J. (2022). Context autoencoder for self-supervised representation learning. arXiv preprint arXiv:2202.03026.

  • Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In: PMLR (pp. 1597–1607).

  • Chen, X., Xie, S. & He, K. (2021). An empirical study of training self-supervised vision transformers, In 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 9620–9629).

  • Chen, A., Zhang, K., Zhang, R., Wang, Z., Lu, Y., Guo, Y., & Zhang, S. (2023). PIMAE: Point cloud and image interactive masked autoencoders for 3d object detection. In: CVPR 2023.

  • Dalal, N. & Triggs, B. (2005). Histograms of oriented gradients for human detection. In: IEEE (Vol. 1, pp. 886–893).

  • Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., & Uszkoreit, J. (2020). An image is worth \(16\times 16\) words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

  • Fu, K., Gao, P., Liu, S., Zhang, R., Qiao, Y., & Wang, M. (2022). Pos-bert: Point cloud one-stage bert pre-training. arXiv preprint arXiv:2204.00989.

  • Gao, P., Lu, J., Li, H., Mottaghi, R. & Kembhavi, A. (2021). Container: Context aggregation network. arXiv preprint arXiv:2106.01401.

  • Gao, P., Ma, T., Li, H., Dai, J. & Qiao, Y. (2022). MCMAE: Masked convolution meets masked autoencoders. Advances in Neural Information Processing Systems, 35, 35632–35644.

  • Guo, J., Han, K., Wu, H., Tang, Y., Chen, X., Wang, Y., & Xu, C. (2022). CMT: Convolutional neural networks meet vision transformers (pp. 12175–12185).

  • Guo, Z., Zhang, R., Qiu, L., Li, X. & Heng, P. A. (2023). Joint-mae: 2d-3d joint masked autoencoders for 3d point cloud pre-training. In: IJCAI 2023.

  • He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16000–16009).

  • He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners.

  • He, K., Fan, H., Wu, Y., Xie, S. & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9729–9738).

  • He, K., Gkioxari, G., Dollár, P. & Girshick, R. (2017). Mask r-CNN. In Proceedings of the IEEE international conference on computer vision (pp. 2961–2969).

  • Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.

  • Hou, Z., Sun, F., Chen, Y.-K., Xie, Y. & Kung, S.-Y. (2022). Milan: Masked image pretraining on language assisted representation. arXiv preprint arXiv:2208.06049.

  • Huang, Z., Jin, X., Lu, C., Hou, Q., Cheng, M.M., Fu, D., Shen, X., & Feng, J. (2022). Contrastive masked autoencoders are stronger vision learners. arXiv preprint arXiv:2207.13532.

  • Huang, L., You, S., Zheng, M., Wang, F., Qian, C., & Yamasaki, T. (2022). Green hierarchical vision transformer for masked image modeling. arXiv preprint arXiv:2205.13515.

  • Kakogeorgiou, I., Gidaris, S., Psomas, B., Avrithis, Y., Bursuc, A., Karantzalos, K., & Komodakis, N. (2022). What to hide from your students: Attention-guided masked image modeling. arXiv preprint arXiv:2203.12719.

  • Li, Y., Mao, H., Girshick, R. & He, K. (2022) . Exploring plain vision transformer backbones for object detection. arXiv preprint arXiv:2203.16527.

  • Li, X., Wang, W., Yang, L. & Yang, J. (2022). Uniform masking: Enabling mae pre-training for pyramid-based vision transformers with locality. arXiv preprint arXiv:2205.10063.

  • Li, K., Wang, Y., Zhang, J., Gao, P., Song, G., Liu, Y., Li, H., & Qiao, Y. (2022). Uniformer: Unifying convolution and self-attention for visual recognition. arXiv preprint arXiv:2201.09450.

  • Li, W., Xie, J. & Loy, C. C. (2023). Correlational image modeling for self-supervised visual pre-training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 15105–15115).

  • Li, Y., Xie, S., Chen, X., Dollar, P., He, K., & Girshick, R. (2021). Benchmarking detection transfer learning with vision transformers. arXiv preprint arXiv:2111.11429.

  • Li, Z., Chen, Z., Yang, F., Li, W., Zhu, Y., Zhao, C., Deng, R., Wu, L., Zhao, R., Tang, M., & Wang, J. (2021). MST: Masked self-supervised transformer for visual representation. Advances in Neural Information Processing Systems, 34, 13165–13176.

    Google Scholar 

  • Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2117–2125).

  • Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context (pp. 740–755). Springer.

  • Liu, J., Huang, X., Liu, Y. & Li, H. (2022) . Mixmim: Mixed and masked image modeling for efficient visual representation learning. arXiv preprint arXiv:2205.13137.

  • Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10012–10022).

  • Loshchilov, I. & Hutter, F. (2018). Fixing weight decay regularization in ADAM. https://openreview.net/pdf?id=rk6qdGgCZ

  • Mu, N., Kirillov, A., Wagner, D. & Xie, S. (2022). Slip: Self-supervision meets language-image pre-training (pp. 529–544). Springer.

  • Peng, Z., Dong, L., Bao, H., Ye, Q. & Wei, F. (2022) . Beit v2: Masked image modeling with vector-quantized visual tokenizers. arXiv preprint arXiv:2208.06366.

  • Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., & Krueger, G. (2021). Learning transferable visual models from natural language supervision. In: International conference on machine learning (pp. 8748–8763). PMLR.

  • Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., & Krueger, G. (2021). Learning transferable visual models from natural language supervision.

  • Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.

    Google Scholar 

  • Ramesh, A., Dhariwal, P., Nichol, A., Chu, C. & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125.

  • Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., & Sutskever, I. (2021). Zero-shot text-to-image generation. In: PMLR (pp. 8821–8831).

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., & Berg, A. C. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.

    Article  MathSciNet  Google Scholar 

  • Shi, H., Gao, J., Xu, H., Liang, X., Li, Z., Kong, L., Lee, S., & Kwok, J. T. (2022). Revisiting over-smoothing in bert from the perspective of graph. arXiv preprint arXiv:2202.08625.

  • Shi, Y., Siddharth, N., Torr, P. & Kosiorek, A. R. (2022). Adversarial masking for self-supervised learning. In: PMLR (pp. 20026–20040).

  • Wang, L., Liang, F., Li, Y., Zhang, H., Ouyang, W., & Shao, J. (2022). Repre: Improving self-supervised vision transformer with reconstructive pre-training. arXiv preprint arXiv:2201.06857.

  • Wei, C., Fan, H., Xie, S., Wu, C.Y., Yuille, A., & Feichtenhofer, C. (2022). Masked feature prediction for self-supervised visual pre-training (pp. 14668–14678).

  • Wei, Y., Hu, H., Xie, Z., Zhang, Z., Cao, Y., Bao, J., Chen, D., & Guo, B. (2022). Contrastive learning rivals masked image modeling in fine-tuning via feature distillation. arXiv preprint arXiv:2205.14141.

  • Wei, L., Xie, L., Zhou, W., Li, H. & Tian, Q. (2022). MVP: Multimodality-guided visual pre-training. In European conference on computer vision (pp. 337–353). Springer.

  • Wu, Z., Xiong, Y., Yu, S. X. & Lin, D. (2018). Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3733–3742).

  • Xiang, W., Yang, H., Huang, D. & Wang, Y. (2023). Denoising diffusion autoencoders are unified self-supervised learners. arXiv preprint arXiv:2303.09769.

  • Xiao, T., Singh, M., Mintun, E., Darrell, T., Dollár, P., & Girshick, R. (2021). Early convolutions help transformers see better. Advances in Neural Information Processing Systems, 34, 30392–30400.

    Google Scholar 

  • Xie, J., Li, W., Zhan, X., Liu, Z., Ong, Y. S., & Loy, C. C. (2022). Masked frequency modeling for self-supervised visual pre-training. arXiv preprint arXiv:2206.07706.

  • Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., Dai, Q., & Hu, H. (2022). Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9653–9663).

  • Zhang, R., Guo, Z., Gao, P., Fang, R., Zhao, B., Wang, D., Qiao, Y., & Li, H. (2022). Point-m2ae: Multi-scale masked autoencoders for hierarchical point cloud pre-training. In: NeurIPS 2022.

  • Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L. M., & Shum, H. Y. (2022). Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605.

  • Zhang, R., Wang, L., Qiao, Y., Gao, P. & Li, H. (2023). Learning 3d representations from 2d pre-trained models via image-to-point masked autoencoders. In: CVPR 2023.

Download references

Acknowledgements

This project is funded in part by the National Natural Science Foundation of China (No. 62206272), by the National Key R &D Program of China Project (No. 2022ZD0161100), by the Centre for Perceptual and Interactive Intelligence (CPII) Ltd under the Innovation and Technology Commission (ITC)’s InnoHK, and by the General Research Fund of Hong Kong RGC Project 14204021. Hongsheng Li is a PI of CPII under the InnoHK.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Hongsheng Li or Yu Qiao.

Additional information

Communicated by Kaiyang Zhou.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gao, P., Lin, Z., Zhang, R. et al. Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking. Int J Comput Vis 132, 1546–1556 (2024). https://doi.org/10.1007/s11263-023-01898-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-023-01898-4

Keywords

Navigation