8000 GitHub - huggingface/pytorch-image-models at v1.0.19
[go: up one dir, main page]

Skip to content

The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more

License

Notifications You must be signed in to change notification settings

huggingface/pytorch-image-models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PyTorch Image Models

What's New

July 23, 2025

  • Add set_input_size() method to EVA models, used by OpenCLIP 3.0.0 to allow resizing for timm based encoder models.
  • Release 1.0.18, needed for PE-Core S & T models in OpenCLIP 3.0.0
  • Fix small typing issue that broke Python 3.9 compat. 1.0.19 patch release.

July 21, 2025

  • ROPE support added to NaFlexViT. All models covered by the EVA base (eva.py) including EVA, EVA02, Meta PE ViT, timm SBB ViT w/ ROPE, and Naver ROPE-ViT can be now loaded in NaFlexViT when use_naflex=True passed at model creation time
  • More Meta PE ViT encoders added, including small/tiny variants, lang variants w/ tiling, and more spatial variants.
  • PatchDropout fixed with NaFlexViT and also w/ EVA models (regression after adding Naver ROPE-ViT)
  • Fix XY order with grid_indexing='xy', impacted non-square image use in 'xy' mode (only ROPE-ViT and PE impacted).

July 7, 2025

  • MobileNet-v5 backbone tweaks for improved Google Gemma 3n behaviour (to pair with updated official weights)
    • Add stem bias (zero'd in updated weights, compat break with old weights)
    • GELU -> GELU (tanh approx). A minor change to be closer to JAX
  • Add two arguments to layer-decay support, a min scale clamp and 'no optimization' scale threshold
  • Add 'Fp32' LayerNorm, RMSNorm, SimpleNorm variants that can be enabled to force computation of norm in float32
  • Some typing, argument cleanup for norm, norm+act layers done with above
  • Support Naver ROPE-ViT (https://github.com/naver-ai/rope-vit) in eva.py, add RotaryEmbeddingMixed module for mixed mode, weights on HuggingFace Hub
model img_size top1 top5 param_count
vit_large_patch16_rope_mixed_ape_224.naver_in1k 224 84.84 97.122 304.4
vit_large_patch16_rope_mixed_224.naver_in1k 224 84.828 97.116 304.2
vit_large_patch16_rope_ape_224.naver_in1k 224 84.65 97.154 304.37
vit_large_patch16_rope_224.naver_in1k 224 84.648 97.122 304.17
vit_base_patch16_rope_mixed_ape_224.naver_in1k 224 83.894 96.754 86.59
vit_base_patch16_rope_mixed_224.naver_in1k 224 83.804 96.712 86.44
vit_base_patch16_rope_ape_224.naver_in1k 224 83.782 96.61 86.59
vit_base_patch16_rope_224.naver_in1k 224 83.718 96.672 86.43
vit_small_patch16_rope_224.naver_in1k 224 81.23 95.022 21.98
vit_small_patch16_rope_mixed_224.naver_in1k 224 81.216 95.022 21.99
vit_small_patch16_rope_ape_224.naver_in1k 224 81.004 95.016 22.06
vit_small_patch16_rope_mixed_ape_224.naver_in1k 224 80.986 94.976 22.06
  • Some cleanup of ROPE modules, helpers, and FX tracing leaf registration
  • Preparing version 1.0.17 release

June 26, 2025

  • MobileNetV5 backbone (w/ encoder only variant) for Gemma 3n image encoder
  • Version 1.0.16 released

June 23, 2025

  • Add F.grid_sample based 2D and factorized pos embed resize to NaFlexViT. Faster when lots of different sizes (based on example by https://github.com/stas-sl).
  • Further speed up patch embed resample by replacing vmap with matmul (based on snippet by https://github.com/stas-sl).
  • Add 3 initial native aspect NaFlexViT checkpoints created while testing, ImageNet-1k and 3 different pos embed configs w/ same hparams.
Model Top-1 Acc Top-5 Acc Params (M) Eval Seq Len
naflexvit_base_patch16_par_gap.e300_s576_in1k 83.67 96.45 86.63 576
naflexvit_base_patch16_parfac_gap.e300_s576_in1k 83.63 96.41 86.46 576
naflexvit_base_patch16_gap.e300_s576_in1k 83.50 96.46 86.63 576
  • Support gradient checkpointing for forward_intermediates and fix some checkpointing bugs. Thanks https://github.com/brianhou0208
  • Add 'corrected weight decay' (https://arxiv.org/abs/2506.02285) as option to AdamW (legacy), Adopt, Kron, Adafactor (BV), Lamb, LaProp, Lion, NadamW, RmsPropTF, SGDW optimizers
  • Switch PE (perception encoder) ViT models to use native timm weights instead of remapping on the fly
  • Fix cuda stream bug in prefetch loader

June 5, 2025

  • Initial NaFlexVit model code. NaFlexVit is a Vision Transformer with:
    1. Encapsulated embedding and position encoding in a single module
    2. Support for nn.Linear patch embedding on pre-patchified (dictionary) inputs
    3. Support for NaFlex variable aspect, variable resolution (SigLip-2: https://arxiv.org/abs/2502.14786)
    4. Support for FlexiViT variable patch size (https://arxiv.org/abs/2212.08013)
    5. Support for NaViT fractional/factorized position embedding (https://arxiv.org/abs/2307.06304)
  • Existing vit models in vision_transformer.py can be loaded into the NaFlexVit model by adding the use_naflex=True flag to create_model
    • Some native weights coming soon
  • A full NaFlex data pipeline is available that allows training / fine-tuning / evaluating with variable aspect / size images
    • To enable in train.py and validate.py add the --naflex-loader arg, must be used with a NaFlexVit
  • To evaluate an existing (classic) ViT loaded in NaFlexVit model w/ NaFlex data pipe:
    • python validate.py /imagenet --amp -j 8 --model vit_base_patch16_224 --model-kwargs use_naflex=True --naflex-loader --naflex-max-seq-len 256
  • The training has some extra args features worth noting
    • The --naflex-train-seq-lens' argument specifies which sequence lengths to randomly pick from per batch during training
    • The --naflex-max-seq-len argument sets the target sequence length for validation
    • Adding --model-kwargs enable_patch_interpolator=True --naflex-patch-sizes 12 16 24 will enable random patch size selection per-batch w/ interpolation
    • The --naflex-loss-scale arg changes loss scaling mode per batch relative to the batch size, timm NaFlex loading changes the batch size for each seq len

May 28, 2025

Feb 21, 2025

  • SigLIP 2 ViT image encoders added (https://huggingface.co/collections/timm/siglip-2-67b8e72ba08b09dd97aecaf9)
    • Variable resolution / aspect NaFlex versions are a WIP
  • Add 'SO150M2' ViT weights trained with SBB recipes, great results, better for ImageNet than previous attempt w/ less training.
    • vit_so150m2_patch16_reg1_gap_448.sbb_e200_in12k_ft_in1k - 88.1% top-1
    • vit_so150m2_patch16_reg1_gap_384.sbb_e200_in12k_ft_in1k - 87.9% top-1
    • vit_so150m2_patch16_reg1_gap_256.sbb_e200_in12k_ft_in1k - 87.3% top-1
    • vit_so150m2_patch16_reg4_gap_256.sbb_e200_in12k
  • Updated InternViT-300M '2.5' weights
  • Release 1.0.15

Feb 1, 2025

  • FYI PyTorch 2.6 & Python 3.13 are tested and working w/ current main and released version of timm

Jan 27, 2025

Jan 19, 2025

  • Fix loading of LeViT safetensor weights, remove conversion code which should have been deactivated
  • Add 'SO150M' ViT weights trained with SBB recipes, decent results, but not optimal shape for ImageNet-12k/1k pretrain/ft
    • vit_so150m_patch16_reg4_gap_256.sbb_e250_in12k_ft_in1k - 86.7% top-1
    • vit_so150m_patch16_reg4_gap_384.sbb_e250_in12k_ft_in1k - 87.4% top-1
    • vit_so150m_patch16_reg4_gap_256.sbb_e250_in12k
  • Misc typing, typo, etc. cleanup
  • 1.0.14 release to get above LeViT fix out

Jan 9, 2025

  • Add support to train and validate in pure bfloat16 or float16
  • wandb project name arg added by https://github.com/caojiaolong, use arg.experiment for name
  • Fix old issue w/ checkpoint saving not working on filesystem w/o hard-link support (e.g. FUSE fs mounts)
  • 1.0.13 release

Jan 6, 2025

  • Add torch.utils.checkpoint.checkpoint() wrapper in timm.models that defaults use_reentrant=False, unless TIMM_REENTRANT_CKPT=1 is set in env.

Dec 31, 2024

Nov 28, 2024

Nov 12, 2024

  • Optimizer factory refactor
    • New factory works by registering optimizers using an OptimInfo dataclass w/ some key traits
    • Add list_optimizers, get_optimizer_class, get_optimizer_info to reworked create_optimizer_v2 fn to explore optimizers, get info or class
    • deprecate optim.optim_factory, move fns to optim/_optim_factory.py and optim/_param_groups.py and encourage import via timm.optim
  • Add Adopt (https://github.com/iShohei220/adopt) optimizer
  • Add 'Big Vision' variant of Adafactor (https://github.com/google-research/big_vision/blob/main/big_vision/optax.py) optimizer
  • Fix original Adafactor to pick better factorization dims for convolutions
  • Tweak LAMB optimizer with some improvements in torch.where functionality since original, refactor clipping a bit
  • dynamic img size support in vit, deit, eva improved to support resize from non-square patch grids, thanks https://github.com/wojtke

Oct 31, 2024

Add a set of new very well trained ResNet & ResNet-V2 18/34 (basic block) weights. See https://huggingface.co/blog/rwightman/resnet-trick-or-treat

Oct 19, 2024

  • Cleanup torch amp usage to avoid cuda specific calls, merge support for Ascend (NPU) devices from MengqingCao that should work now in PyTorch 2.5 w/ new device extension autoloading feature. Tested Intel Arc (XPU) in Pytorch 2.5 too and it (mostly) worked.

Oct 16, 2024

Oct 14, 2024

  • Pre-activation (ResNetV2) version of 18/18d/34/34d ResNet model defs added by request (weights pending)
  • Release 1.0.10

Oct 11, 2024

  • MambaOut (https://github.com/yuweihao/MambaOut) model & weights added. A cheeky take on SSM vision models w/o the SSM (essentially ConvNeXt w/ gating). A mix of original weights + custom variations & weights.
model img_size top1 top5 param_count
mambaout_base_plus_rw.sw_e150_r384_in12k_ft_in1k 384 87.506 98.428 101.66
mambaout_base_plus_rw.sw_e150_in12k_ft_in1k 288 86.912 98.236 101.66
mambaout_base_plus_rw.sw_e150_in12k_ft_in1k 224 86.632 98.156 101.66
mambaout_base_tall_rw.sw_e500_in1k 288 84.974 97.332 86.48
mambaout_base_wide_rw.sw_e500_in1k 288 84.962 97.208 94.45
mambaout_base_short_rw.sw_e500_in1k 288 84.832 97.27 88.83
mambaout_base.in1k 288 84.72 96.93 84.81
mambaout_small_rw.sw_e450_in1k 288 84.598 97.098 48.5
mambaout_small.in1k 288 84.5 96.974 48.49
mambaout_base_wide_rw.sw_e500_in1k 224 84.454 96.864 94.45
mambaout_base_tall_rw.sw_e500_in1k 224 84.434 96.958 86.48
mambaout_base_short_rw.sw_e500_in1k 224 84.362 96.952 88.83
mambaout_base.in1k 224 84.168 96.68 84.81
mambaout_small.in1k 224 84.086 96.63 48.49
mambaout_small_rw.sw_e450_in1k 224 84.024 96.752 48.5
mambaout_tiny.in1k 288 83.448 96.538 26.55
mambaout_tiny.in1k 224 82.736 96.1 26.55
mambaout_kobe.in1k 288 81.054 95.718 9.14
mambaout_kobe.in1k 224 79.986 94.986 9.14
mambaout_femto.in1k 288 79.848 95.14 7.3
mambaout_femto.in1k 224 78.87 94.408 7.3

Sept 2024

Aug 21, 2024

  • Updated SBB ViT models trained on ImageNet-12k and fine-tuned on ImageNet-1k, challenging quite a number of much larger, slower models
model top1 top5 param_count img_size
vit_mediumd_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k 87.438 98.256 64.11 384
vit_mediumd_patch16_reg4_gap_256.sbb2_e200_in12k_ft_in1k 86.608 97.934 64.11 256
vit_betwixt_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k 86.594 98.02 60.4 384
vit_betwixt_patch16_reg4_gap_256.sbb2_e200_in12k_ft_in1k 85.734 97.61 60.4 256
  • MobileNet-V1 1.25, EfficientNet-B1, & ResNet50-D weights w/ MNV4 baseline challenge recipe
model top1 top5 param_count img_size
resnet50d.ra4_e3600_r224_in1k 81.838 95.922 25.58 288
efficientnet_b1.ra4_e3600_r240_in1k 81.440 95.700 7.79 288
resnet50d.ra4_e3600_r224_in1k 80.952 95.384 25.58 224
efficientnet_b1.ra4_e3600_r240_in1k 80.406 95.152 7.79 240
mobilenetv1_125.ra4_e3600_r224_in1k 77.600 93.804 6.27 256
mobilenetv1_125.ra4_e3600_r224_in1k 76.924 93.234 6.27 224
  • Add SAM2 (HieraDet) backbone arch & weight loading support
  • Add Hiera Small weights trained w/ abswin pos embed on in12k & fine-tuned on 1k
model top1 top5 param_count
hiera_small_abswin_256.sbb2_e200_in12k_ft_in1k 84.912 97.260 35.01
hiera_small_abswin_256.sbb2_pd_e200_in12k_ft_in1k 84.560 97.106 35.01

Aug 8, 2024

July 28, 2024

  • Add mobilenet_edgetpu_v2_m weights w/ ra4 mnv4-small based recipe. 80.1% top-1 @ 224 and 80.7 @ 256.
  • Release 1.0.8

July 26, 2024

  • More MobileNet-v4 weights, ImageNet-12k pretrain w/ fine-tunes, and anti-aliased ConvLarge models
model top1 top1_err top5 top5_err param_count img_size
mobilenetv4_conv_aa_large.e230_r448_in12k_ft_in1k 84.99 15.01 97.294 2.706 32.59 544
mobilenetv4_conv_aa_large.e230_r384_in12k_ft_in1k 84.772 15.228 97.344 2.656 32.59 480
mobilenetv4_conv_aa_large.e230_r448_in12k_ft_in1k 84.64 15.36 97.114 2.886 32.59 448
mobilenetv4_conv_aa_large.e230_r384_in12k_ft_in1k 84.314 15.686 97.102 2.898 32.59 384
mobilenetv4_conv_aa_large.e600_r384_in1k 83.824 16.176 96.734 3.266 32.59 480
mobilenetv4_conv_aa_large.e600_r384_in1k 83.244 16.756 96.392 3.608 32.59 384
mobilenetv4_hybrid_medium.e200_r256_in12k_ft_in1k 82.99 17.01 96.67 3.33 11.07 320
mobilenetv4_hybrid_medium.e200_r256_in12k_ft_in1k 82.364 17.636 96.256 3.744 11.07 256
model top1 top1_err top5 top5_err param_count img_size
efficientnet_b0.ra4_e3600_r224_in1k 79.364 20.636 94.754 5.246 5.29 256
efficientnet_b0.ra4_e3600_r224_in1k 78.584 21.416 94.338 5.662 5.29 224
mobilenetv1_100h.ra4_e3600_r224_in1k 76.596 23.404 93.272 6.728 5.28 256
mobilenetv1_100.ra4_e3600_r224_in1k 76.094 23.906 93.004 6.996 4.23 256
mobilenetv1_100h.ra4_e3600_r224_in1k 75.662 24.338 92.504 7.496 5.28 224
mobilenetv1_100.ra4_e3600_r224_in1k 75.382 24.618 92.312 7.688 4.23 224
  • Prototype of set_input_size() added to vit and swin v1/v2 models to allow changing image size, patch size, window size after model creation.
  • Improved support in swin for different size handling, in addition to set_input_size, always_partition and strict_img_size args have been added to __init__ to allow more flexible input size constraints
  • Fix out of order indices info for intermediate 'Getter' feature wrapper, check out or range indices for same.
  • Add several tiny < .5M param models for testing that are actually trained on ImageNet-1k
model top1 top1_err top5 top5_err param_count img_size crop_pct
test_efficientnet.r160_in1k 47.156 52.844 71.726 28.274 0.36 192 1.0
test_byobnet.r160_in1k 46.698 53.302 71.674 28.326 0.46 192 1.0
test_efficientnet.r160_in1k 46.426 53.574 70.928 29.072 0.36 160 0.875
test_byobnet.r160_in1k 45.378 54.622 70.572 29.428 0.46 160 0.875
test_vit.r160_in1k 42.0 58.0 68.664 31.336 0.37 192 1.0
test_vit.r160_in1k 40.822 59.178 67.212 32.788 0.37 160 0.875
  • Fix vit reg token init, thanks Promisery
  • Other misc fixes

June 24, 2024

  • 3 more MobileNetV4 hybrid weights with different MQA weight init scheme
model top1 top1_err top5 top5_err param_count img_size
mobilenetv4_hybrid_large.ix_e600_r384_in1k 84.356 15.644 96.892 3.108 37.76 448
mobilenetv4_hybrid_large.ix_e600_r384_in1k 83.990 16.010 96.702 3.298 37.76 384
mobilenetv4_hybrid_medium.ix_e550_r384_in1k 83.394 16.606 96.760 3.240 11.07 448
mobilenetv4_hybrid_medium.ix_e550_r384_in1k 82.968 17.032 96.474 3.526 11.07 384
mobilenetv4_hybrid_medium.ix_e550_r256_in1k 82.492 17.508 96.278 3.722 11.07 320
mobilenetv4_hybrid_medium.ix_e550_r256_in1k 81.446 18.554 95.704 4.296 11.07 256
  • florence2 weight loading in DaViT model

June 12, 2024

  • MobileNetV4 models and initial set of timm trained weights added:
model top1 top1_err top5 top5_err param_count img_size
mobilenetv4_hybrid_large.e600_r384_in1k 84.266 15.734 96.936 3.064 37.76 448
mobilenetv4_hybrid_large.e600_r384_in1k 83.800 16.200 96.770 3.230 37.76 384
mobilenetv4_conv_large.e600_r384_in1k 83.392 16.608 96.622 3.378 32.59 448
mobilenetv4_conv_large.e600_r384_in1k 82.952 17.048 96.266 3.734 32.59 384
mobilenetv4_conv_large.e500_r256_in1k 82.674 17.326 96.31 3.69 32.59 320
mobilenetv4_conv_large.e500_r256_in1k 81.862 18.138 95.69 4.31 32.59 256
mobilenetv4_hybrid_medium.e500_r224_in1k 81.276 18.724 95.742 4.258 11.07 256
mobilenetv4_conv_medium.e500_r256_in1k 80.858 19.142 95.768 4.232 9.72 320
mobilenetv4_hybrid_medium.e500_r224_in1k 80.442 19.558 95.38 4.62 11.07 224
mobilenetv4_conv_blur_medium.e500_r224_in1k 80.142 19.858 95.298 4.702 9.72 256
mobilenetv4_conv_medium.e500_r256_in1k 79.928 20.072 95.184 4.816 9.72 256
mobilenetv4_conv_medium.e500_r224_in1k 79.808 20.192 95.186 4.814 9.72 256
mobilenetv4_conv_blur_medium.e500_r224_in1k 79.438 20.562 94.932 5.068 9.72 224
mobilenetv4_conv_medium.e500_r224_in1k 79.094 20.906 94.77 5.23 9.72 224
mobilenetv4_conv_small.e2400_r224_in1k 74.616 25.384 92.072 7.928 3.77 256
mobilenetv4_conv_small.e1200_r224_in1k 74.292 25.708 92.116 7.884 3.77 256
mobilenetv4_conv_small.e2400_r224_in1k 73.756 26.244 91.422 8.578 3.77 224
mobilenetv4_conv_small.e1200_r224_in1k 73.454 26.546 91.34 8.66 3.77 224
  • Apple MobileCLIP (https://arxiv.org/pdf/2311.17049, FastViT and ViT-B) image tower model support & weights added (part of OpenCLIP support).
  • ViTamin (https://arxiv.org/abs/2404.02132) CLIP image tower model & weights added (part of OpenCLIP support).
  • OpenAI CLIP Modified ResNet image tower modelling & weight support (via ByobNet). Refactor AttentionPool2d.

May 14, 2024

  • Support loading PaliGemma jax weights into SigLIP ViT models with average pooling.
  • Add Hiera models from Meta (https://github.com/facebookresearch/hiera).
  • Add normalize= flag for transforms, return non-normalized torch.Tensor with original dtype (for chug)
  • Version 1.0.3 release

May 11, 2024

  • Searching for Better ViT Baselines (For the GPU Poor) weights and vit variants released. Exploring model shapes between Tiny and Base.
model top1 top5 param_count img_size
vit_mediumd_patch16_reg4_gap_256.sbb_in12k_ft_in1k 86.202 97.874 64.11 256
vit_betwixt_patch16_reg4_gap_256.sbb_in12k_ft_in1k 85.418 97.48 60.4 256
vit_mediumd_patch16_rope_reg1_gap_256.sbb_in1k 84.322 96.812 63.95 256
vit_betwixt_patch16_rope_reg4_gap_256.sbb_in1k 83.906 96.684 60.23 256
vit_base_patch16_rope_reg1_gap_256.sbb_in1k 83.866 96.67 86.43 256
vit_medium_patch16_rope_reg1_gap_256.sbb_in1k 83.81 96.824 38.74 256
vit_betwixt_patch16_reg4_gap_256.sbb_in1k 83.706 96.616 60.4 256
vit_betwixt_patch16_reg1_gap_256.sbb_in1k 83.628 96.544 60.4 256
vit_medium_patch16_reg4_gap_256.sbb_in1k 83.47 96.622 38.88 256
vit_medium_patch16_reg1_gap_256.sbb_in1k 83.462 96.548 38.88 256
vit_little_patch16_reg4_gap_256.sbb_in1k 82.514 96.262 22.52 256
vit_wee_patch16_reg1_gap_256.sbb_in1k 80.256 95.360 13.42 256
vit_pwee_patch16_reg1_gap_256.sbb_in1k 80.072 95.136 15.25 256
vit_mediumd_patch16_reg4_gap_256.sbb_in12k N/A N/A 64.11 256
vit_betwixt_patch16_reg4_gap_256.sbb_in12k N/A N/A 60.4 256

About

The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Sponsor this project

Packages

No packages published

Contributors 158

0