[go: up one dir, main page]

https://huggingface.co/collections/HuggingFaceTB/smolvlm2-smallest-video-lm-ever-67ab6b5e84bf8aaa60cb17c7
Tiny Image collection: https://huggingface.co/collections/HuggingFaceTB/smolvlm-256m-and-500m-6791fafc5bb0ab8acc960fb0
Codebase: https://github.com/huggingface/smollm

\n","updatedAt":"2025-04-08T09:38:14.569Z","author":{"_id":"65d66b494bbd0d92b641cdbb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d66b494bbd0d92b641cdbb/6-7dm7B-JxcoS1QlCPdMN.jpeg","fullname":"Andres Marafioti","name":"andito","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":483}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.9228962659835815},"editors":["andito"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/65d66b494bbd0d92b641cdbb/6-7dm7B-JxcoS1QlCPdMN.jpeg"],"reactions":[{"reaction":"πŸš€","users":["orrzohar","AdinaY","onuralpszr","Aurelien-Morgan","iky1e","Inflammable1230"],"count":6},{"reaction":"πŸ”₯","users":["Jaward","Inflammable1230","pierrci"],"count":3}],"isReport":false}},{"id":"67f560f8612ec3067a466c36","author":{"_id":"65fc5dca9c557fcbc68a664e","avatarUrl":"/avatars/fbd2635ff79f4053f72b90db374f4b99.svg","fullname":"Nurmyrat","name":"Nurmyrat1998","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false},"createdAt":"2025-04-08T17:46:32.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"ΠŸΡ€ΠΈΠ²Π΅Ρ‚ ","html":"

ΠŸΡ€ΠΈΠ²Π΅Ρ‚

\n","updatedAt":"2025-04-08T17:46:32.520Z","author":{"_id":"65fc5dca9c557fcbc68a664e","avatarUrl":"/avatars/fbd2635ff79f4053f72b90db374f4b99.svg","fullname":"Nurmyrat","name":"Nurmyrat1998","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false}},"numEdits":0,"identifiedLanguage":{"language":"ru","probability":0.9691303968429565},"editors":["Nurmyrat1998"],"editorAvatarUrls":["/avatars/fbd2635ff79f4053f72b90db374f4b99.svg"],"reactions":[{"reaction":"πŸ‘","users":["parlorsky"],"count":1}],"isReport":false}},{"id":"67f56b4abd0e5f3893ae98e7","author":{"_id":"653a7d0732bd4db35d6f0067","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/KSsNVRk4DwaTLJgoYwApj.jpeg","fullname":"Ihor VItenko","name":"Strongich","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false},"createdAt":"2025-04-08T18:30:34.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Based on your findings in Section 3.4 that \"excessive CoT data harms compact model performance\", would you expect the same effect when CoT reasoning is learned through RL in compact VLMs? Or is this relationship not obviously transferable and would require separate testing?","html":"

Based on your findings in Section 3.4 that \"excessive CoT data harms compact model performance\", would you expect the same effect when CoT reasoning is learned through RL in compact VLMs? Or is this relationship not obviously transferable and would require separate testing?

\n","updatedAt":"2025-04-08T18:30:34.571Z","author":{"_id":"653a7d0732bd4db35d6f0067","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/KSsNVRk4DwaTLJgoYwApj.jpeg","fullname":"Ihor VItenko","name":"Strongich","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9726264476776123},"editors":["Strongich"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/KSsNVRk4DwaTLJgoYwApj.jpeg"],"reactions":[],"isReport":false},"replies":[{"id":"67f57c6e42d840e0e712e69c","author":{"_id":"648c9605565e3a44f3c9bb7b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648c9605565e3a44f3c9bb7b/W5chvk17Zol6-2QSWkFVR.jpeg","fullname":"Orr Zohar","name":"orrzohar","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":97},"createdAt":"2025-04-08T19:43:42.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"We did not try RL-style post-training, so it is hard to draw a definitive conclusion. However, larger LLMs (e.g., S1) can use SFT on CoT and succeed in 'distilling' such reasoning from CoT data. \n\nMy intuition is that these small models do not have that \"emergent\" property to learn to reason -- which is why CoT distillation was not helpful. ","html":"

We did not try RL-style post-training, so it is hard to draw a definitive conclusion. However, larger LLMs (e.g., S1) can use SFT on CoT and succeed in 'distilling' such reasoning from CoT data.

\n

My intuition is that these small models do not have that \"emergent\" property to learn to reason -- which is why CoT distillation was not helpful.

\n","updatedAt":"2025-04-08T19:43:42.197Z","author":{"_id":"648c9605565e3a44f3c9bb7b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648c9605565e3a44f3c9bb7b/W5chvk17Zol6-2QSWkFVR.jpeg","fullname":"Orr Zohar","name":"orrzohar","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":97}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9754781723022461},"editors":["orrzohar"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/648c9605565e3a44f3c9bb7b/W5chvk17Zol6-2QSWkFVR.jpeg"],"reactions":[{"reaction":"πŸ‘","users":["Jaward","Strongich","Inflammable1230"],"count":3}],"isReport":false,"parentCommentId":"67f56b4abd0e5f3893ae98e7"}},{"id":"67f647d953800b079151c0c6","author":{"_id":"653a7d0732bd4db35d6f0067","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/KSsNVRk4DwaTLJgoYwApj.jpeg","fullname":"Ihor VItenko","name":"Strongich","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false},"createdAt":"2025-04-09T10:11:37.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Great answer, much appreciated! I was reading through Section 3.1 about \"Learned Tokens\" and noticed you mentioned they work better than raw text tokens. Could you explain how these learned tokens actually work, or point me to a specific paper where I can learn more about this? I'd like to understand the mechanics behind them (: ","html":"

Great answer, much appreciated! I was reading through Section 3.1 about \"Learned Tokens\" and noticed you mentioned they work better than raw text tokens. Could you explain how these learned tokens actually work, or point me to a specific paper where I can learn more about this? I'd like to understand the mechanics behind them (:

\n","updatedAt":"2025-04-09T10:11:37.722Z","author":{"_id":"653a7d0732bd4db35d6f0067","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/KSsNVRk4DwaTLJgoYwApj.jpeg","fullname":"Ihor VItenko","name":"Strongich","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9805828332901001},"editors":["Strongich"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/KSsNVRk4DwaTLJgoYwApj.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"67f56b4abd0e5f3893ae98e7"}},{"id":"67f6da8881eb6175bfdca899","author":{"_id":"648c9605565e3a44f3c9bb7b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648c9605565e3a44f3c9bb7b/W5chvk17Zol6-2QSWkFVR.jpeg","fullname":"Orr Zohar","name":"orrzohar","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":97},"createdAt":"2025-04-09T20:37:28.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"We add `` to the tokenizer, so these get unique tokens/embeddings in the tokenizer, and they are optimized during training. \nThis is different then using the naitive text tokenizer which: `` -> `<`,`row`,`_`,`{}`,`_`,`col`,`_`,`{}`,`>` so instead of treating these as on-learable string representations, we basically learn new tokens to represent the spatial position of each image patch.\n","html":"

We add &lt;row_{}_col_{}&gt; to the tokenizer, so these get unique tokens/embeddings in the tokenizer, and they are optimized during training.
This is different then using the naitive text tokenizer which: &lt;row_{}_col_{}&gt; -&gt; &lt;,row,_,{},_,col,_,{},&gt; so instead of treating these as on-learable string representations, we basically learn new tokens to represent the spatial position of each image patch.

\n","updatedAt":"2025-04-09T20:37:28.743Z","author":{"_id":"648c9605565e3a44f3c9bb7b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648c9605565e3a44f3c9bb7b/W5chvk17Zol6-2QSWkFVR.jpeg","fullname":"Orr Zohar","name":"orrzohar","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":97}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.804935097694397},"editors":["orrzohar"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/648c9605565e3a44f3c9bb7b/W5chvk17Zol6-2QSWkFVR.jpeg"],"reactions":[{"reaction":"🀯","users":["Strongich"],"count":1},{"reaction":"πŸ‘","users":["Strongich"],"count":1}],"isReport":false,"parentCommentId":"67f56b4abd0e5f3893ae98e7"}}]},{"id":"67f720f1262969d32f241df6","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":286},"createdAt":"2025-04-10T01:37:53.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model](https://huggingface.co/papers/2503.21782) (2025)\n* [Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI](https://huggingface.co/papers/2502.17092) (2025)\n* [SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding](https://huggingface.co/papers/2503.18943) (2025)\n* [Small Vision-Language Models: A Survey on Compact Architectures and Techniques](https://huggingface.co/papers/2503.10665) (2025)\n* [FCoT-VL:Advancing Text-oriented Large Vision-Language Models with Efficient Visual Token Compression](https://huggingface.co/papers/2502.18512) (2025)\n* [Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation](https://huggingface.co/papers/2502.13145) (2025)\n* [Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection](https://huggingface.co/papers/2503.11794) (2025)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

\n

The following papers were recommended by the Semantic Scholar API

\n\n

Please give a thumbs up to this comment if you found it helpful!

\n

If you want recommendations for any Paper on Hugging Face checkout this Space

\n

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: \n\n@librarian-bot\n\t recommend

\n","updatedAt":"2025-04-10T01:37:53.378Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":286}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6712037920951843},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"68353d07e759f596d01ffc87","author":{"_id":"6813ee19c9b224a738fea856","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/g1uPHIKEgWe1ftHGHbo_U.png","fullname":"YJ","name":"yjh415","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false},"createdAt":"2025-05-27T04:18:15.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"an audio overview (~17 minutes): https://youtu.be/FWkAxDz0-4g\nby the way, the team's demo video is very impressive!","html":"

an audio overview (~17 minutes): https://youtu.be/FWkAxDz0-4g
by the way, the team's demo video is very impressive!

\n","updatedAt":"2025-05-27T04:18:15.349Z","author":{"_id":"6813ee19c9b224a738fea856","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/g1uPHIKEgWe1ftHGHbo_U.png","fullname":"YJ","name":"yjh415","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8380816578865051},"editors":["yjh415"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/g1uPHIKEgWe1ftHGHbo_U.png"],"reactions":[],"isReport":false}},{"id":"69031eed3d61c223e6bd9b85","author":{"_id":"643795574aacf7bf787351ad","avatarUrl":"/avatars/04d5e6a8acaf147604967bf8af92c0ae.svg","fullname":"Huang Chi En","name":"josefph","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false},"createdAt":"2025-10-30T08:16:45.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"Unfortunately, smolvlm-256M-Instruct, 4 times of less params then internvl-1B, still slower then Internvl-v2-1B (event i enable flash attention, bf16) ?\n \nIt seems less input token & dimension didn't contribute to the inference speed . Does anyone plz give me some suggestion about this issue ?\nsee github issue : https://github.com/huggingface/transformers/issues/41947","html":"

Unfortunately, smolvlm-256M-Instruct, 4 times of less params then internvl-1B, still slower then Internvl-v2-1B (event i enable flash attention, bf16) ?

\n

It seems less input token &amp; dimension didn't contribute to the inference speed . Does anyone plz give me some suggestion about this issue ?
see github issue : https://github.com/huggingface/transformers/issues/41947

\n","updatedAt":"2025-10-30T08:21:01.621Z","author":{"_id":"643795574aacf7bf787351ad","avatarUrl":"/avatars/04d5e6a8acaf147604967bf8af92c0ae.svg","fullname":"Huang Chi En","name":"josefph","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.7348234057426453},"editors":["josefph"],"editorAvatarUrls":["/avatars/04d5e6a8acaf147604967bf8af92c0ae.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2504.05299","authors":[{"_id":"67f4cf5b504263bce1236d87","user":{"_id":"65d66b494bbd0d92b641cdbb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d66b494bbd0d92b641cdbb/6-7dm7B-JxcoS1QlCPdMN.jpeg","isPro":false,"fullname":"Andres Marafioti","user":"andito","type":"user"},"name":"AndrΓ©s Marafioti","status":"claimed_verified","statusLastChangedAt":"2025-04-08T08:46:33.366Z","hidden":false},{"_id":"67f4cf5b504263bce1236d88","user":{"_id":"648c9605565e3a44f3c9bb7b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648c9605565e3a44f3c9bb7b/W5chvk17Zol6-2QSWkFVR.jpeg","isPro":true,"fullname":"Orr Zohar","user":"orrzohar","type":"user"},"name":"Orr Zohar","status":"claimed_verified","statusLastChangedAt":"2025-04-08T09:11:26.498Z","hidden":false},{"_id":"67f4cf5b504263bce1236d89","user":{"_id":"61ed0ff29539bc0a3bbc89f4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61ed0ff29539bc0a3bbc89f4/iYWK7GParA7Ke5F6q132W.jpeg","isPro":false,"fullname":"Miquel FarrΓ©","user":"mfarre","type":"user"},"name":"Miquel FarrΓ©","status":"claimed_verified","statusLastChangedAt":"2025-04-08T09:11:24.824Z","hidden":false},{"_id":"67f4cf5b504263bce1236d8a","user":{"_id":"60315285d2c57896177ce764","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1613845120202-noauth.png","isPro":false,"fullname":"Merve Noyan","user":"mervenoyan","type":"user"},"name":"Merve Noyan","status":"admin_assigned","statusLastChangedAt":"2025-04-08T10:29:11.673Z","hidden":false},{"_id":"67f4cf5b504263bce1236d8b","user":{"_id":"651e96991b97c9f33d26bde6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/651e96991b97c9f33d26bde6/-Bqs6qrmz0yCfwtB2e-6q.jpeg","isPro":true,"fullname":"Elie Bakouch","user":"eliebak","type":"user"},"name":"Elie Bakouch","status":"admin_assigned","statusLastChangedAt":"2025-04-08T10:29:19.329Z","hidden":false},{"_id":"67f4cf5b504263bce1236d8c","user":{"_id":"603d25b75f9d390ab190b777","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1617264212503-603d25b75f9d390ab190b777.jpeg","isPro":true,"fullname":"Pedro Cuenca","user":"pcuenq","type":"user"},"name":"Pedro Cuenca","status":"admin_assigned","statusLastChangedAt":"2025-04-08T10:29:25.903Z","hidden":false},{"_id":"67f4cf5b504263bce1236d8d","user":{"_id":"66ba71a4447411b9c0e19d71","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/4f93ZrYdaKfK3F53IB51x.jpeg","isPro":false,"fullname":"Cyril","user":"cyrilzakka","type":"user"},"name":"Cyril Zakka","status":"admin_assigned","statusLastChangedAt":"2025-04-08T10:29:33.323Z","hidden":false},{"_id":"67f4cf5b504263bce1236d8e","user":{"_id":"61c141342aac764ce1654e43","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61c141342aac764ce1654e43/81AwoT5IQ_Xdw0OVw7TKu.jpeg","isPro":false,"fullname":"Loubna Ben Allal","user":"loubnabnl","type":"user"},"name":"Loubna Ben Allal","status":"claimed_verified","statusLastChangedAt":"2025-04-08T09:11:22.957Z","hidden":false},{"_id":"67f4cf5b504263bce1236d8f","user":{"_id":"602e6dee60e3dd96631c906e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1613655355830-noauth.png","isPro":false,"fullname":"Anton Lozhkov","user":"anton-l","type":"user"},"name":"Anton Lozhkov","status":"admin_assigned","statusLastChangedAt":"2025-04-08T10:29:39.905Z","hidden":false},{"_id":"67f4cf5b504263bce1236d90","user":{"_id":"5ff8c9f4b2035d9a81a859f7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1652134289581-5ff8c9f4b2035d9a81a859f7.jpeg","isPro":true,"fullname":"Nouamane Tazi","user":"nouamanetazi","type":"user"},"name":"Nouamane Tazi","status":"admin_assigned","statusLastChangedAt":"2025-04-08T10:29:46.326Z","hidden":false},{"_id":"67f4cf5b504263bce1236d91","user":{"_id":"61b85ce86eb1f2c5e6233736","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1655385361868-61b85ce86eb1f2c5e6233736.jpeg","isPro":true,"fullname":"Vaibhav Srivastav","user":"reach-vb","type":"user"},"name":"Vaibhav Srivastav","status":"claimed_verified","statusLastChangedAt":"2025-04-08T10:28:41.404Z","hidden":false},{"_id":"67f4cf5b504263bce1236d92","user":{"_id":"61b253b7ac5ecaae3d1efe0c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61b253b7ac5ecaae3d1efe0c/hwiQ0uvz3t-L5a-NtBIO6.png","isPro":false,"fullname":"Joshua","user":"Xenova","type":"user"},"name":"Joshua Lochner","status":"claimed_verified","statusLastChangedAt":"2025-04-08T10:28:45.098Z","hidden":false},{"_id":"67f4cf5b504263bce1236d93","user":{"_id":"641cc77c92cd25302998b740","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641cc77c92cd25302998b740/5A81W5s3ecLaLXFir52Rw.jpeg","isPro":false,"fullname":"Hugo Larcher","user":"hlarcher","type":"user"},"name":"Hugo Larcher","status":"admin_assigned","statusLastChangedAt":"2025-04-08T10:29:52.171Z","hidden":false},{"_id":"67f4cf5b504263bce1236d94","user":{"_id":"664d7d1e4f54c9372970e121","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/664d7d1e4f54c9372970e121/TEWsq1_zSBBeuHLu0HlDc.png","isPro":false,"fullname":"Mathieu Morlon","user":"glutamatt","type":"user"},"name":"Mathieu Morlon","status":"admin_assigned","statusLastChangedAt":"2025-04-08T10:29:57.811Z","hidden":false},{"_id":"67f4cf5b504263bce1236d95","user":{"_id":"5f0c746619cb630495b814fd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1594651707950-noauth.jpeg","isPro":true,"fullname":"Lewis Tunstall","user":"lewtun","type":"user"},"name":"Lewis Tunstall","status":"admin_assigned","statusLastChangedAt":"2025-04-08T10:30:03.679Z","hidden":false},{"_id":"67f4cf5b504263bce1236d96","user":{"_id":"5e48005437cb5b49818287a5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5e48005437cb5b49818287a5/4uCXGGui-9QifAT4qelxU.png","isPro":false,"fullname":"Leandro von Werra","user":"lvwerra","type":"user"},"name":"Leandro von Werra","status":"claimed_verified","statusLastChangedAt":"2025-04-08T10:28:43.360Z","hidden":false},{"_id":"67f4cf5b504263bce1236d97","user":{"_id":"5df7e9e5da6d0311fd3d53f9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1583857746553-5df7e9e5da6d0311fd3d53f9.jpeg","isPro":true,"fullname":"Thomas Wolf","user":"thomwolf","type":"user"},"name":"Thomas Wolf","status":"admin_assigned","statusLastChangedAt":"2025-04-08T10:30:10.389Z","hidden":false}],"publishedAt":"2025-04-07T17:58:57.000Z","submittedOnDailyAt":"2025-04-08T06:10:52.675Z","title":"SmolVLM: Redefining small and efficient multimodal models","submittedOnDailyBy":{"_id":"65d66b494bbd0d92b641cdbb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d66b494bbd0d92b641cdbb/6-7dm7B-JxcoS1QlCPdMN.jpeg","isPro":false,"fullname":"Andres Marafioti","user":"andito","type":"user"},"summary":"Large Vision-Language Models (VLMs) deliver exceptional performance but\nrequire significant computational resources, limiting their deployment on\nmobile and edge devices. Smaller VLMs typically mirror design choices of larger\nmodels, such as extensive image tokenization, leading to inefficient GPU memory\nusage and constrained practicality for on-device applications.\n We introduce SmolVLM, a series of compact multimodal models specifically\nengineered for resource-efficient inference. We systematically explore\narchitectural configurations, tokenization strategies, and data curation\noptimized for low computational overhead. Through this, we identify key design\nchoices that yield substantial performance gains on image and video tasks with\nminimal memory footprints.\n Our smallest model, SmolVLM-256M, uses less than 1GB GPU memory during\ninference and outperforms the 300-times larger Idefics-80B model, despite an\n18-month development gap. Our largest model, at 2.2B parameters, rivals\nstate-of-the-art VLMs consuming twice the GPU memory. SmolVLM models extend\nbeyond static images, demonstrating robust video comprehension capabilities.\n Our results emphasize that strategic architectural optimizations, aggressive\nyet efficient tokenization, and carefully curated training data significantly\nenhance multimodal performance, facilitating practical, energy-efficient\ndeployments at significantly smaller scales.","upvotes":200,"discussionId":"67f4cf5d504263bce1236dda","projectPage":"https://huggingface.co/collections/HuggingFaceTB/smolvlm2-smallest-video-lm-ever-67ab6b5e84bf8aaa60cb17c7","githubRepo":"https://github.com/huggingface/smollm","ai_summary":"SmolVLM, a series of compact multimodal models, achieves high performance with minimal GPU memory usage, making efficient deployment on mobile and edge devices possible.","ai_keywords":["Large Vision-Language Models","VLMs","SmolVLM","multimodal models","image tokenization","GPU memory","inference","architectural configurations","tokenization strategies","data curation","video comprehension","Idefics-80B","parameter-efficient inference"],"organization":{"_id":"5e67bd5b1009063689407478","name":"huggingface","fullname":"Hugging Face","avatar":"https://cdn-uploads.huggingface.co/production/uploads/1583856921041-5dd96eb166059660ed1ee413.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"648c9605565e3a44f3c9bb7b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648c9605565e3a44f3c9bb7b/W5chvk17Zol6-2QSWkFVR.jpeg","isPro":true,"fullname":"Orr Zohar","user":"orrzohar","type":"user"},{"_id":"61ed0ff29539bc0a3bbc89f4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61ed0ff29539bc0a3bbc89f4/iYWK7GParA7Ke5F6q132W.jpeg","isPro":false,"fullname":"Miquel FarrΓ©","user":"mfarre","type":"user"},{"_id":"64b74920fe6a108d03fed767","avatarUrl":"/avatars/a2c05b809c36fa5fab8e1a43b3e67051.svg","isPro":false,"fullname":"Minki Kang","user":"Nardien","type":"user"},{"_id":"63c6782e83ce71db8eda40fb","avatarUrl":"/avatars/f22f0722c3dbcd8c273117062a656301.svg","isPro":true,"fullname":"Mohammed Mohammed Ali","user":"MohammedEltoum","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"65d66b494bbd0d92b641cdbb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d66b494bbd0d92b641cdbb/6-7dm7B-JxcoS1QlCPdMN.jpeg","isPro":false,"fullname":"Andres Marafioti","user":"andito","type":"user"},{"_id":"65703fab7f50602340d23704","avatarUrl":"/avatars/324c45f5fba9cd8c38a89b30427c06b4.svg","isPro":false,"fullname":"Xiaohan Wang","user":"nicholswang","type":"user"},{"_id":"5f0c746619cb630495b814fd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1594651707950-noauth.jpeg","isPro":true,"fullname":"Lewis Tunstall","user":"lewtun","type":"user"},{"_id":"61c141342aac764ce1654e43","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61c141342aac764ce1654e43/81AwoT5IQ_Xdw0OVw7TKu.jpeg","isPro":false,"fullname":"Loubna Ben Allal","user":"loubnabnl","type":"user"},{"_id":"6621a958ae801a8e82ce36c5","avatarUrl":"/avatars/eff54c14f0ce2800e9714636e4c92ec6.svg","isPro":false,"fullname":"Omns TTBS","user":"SLBBS","type":"user"},{"_id":"648eb1eb59c4e5c87dc116e0","avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"_id":"62e0626a1b0ece20b8aaf2b8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62e0626a1b0ece20b8aaf2b8/TFFSkfqIcrHgoufx8HGya.png","isPro":false,"fullname":"neuralink","user":"neuralink","type":"user"}],"acceptLanguages":["*"],"dailyPaperRank":1,"organization":{"_id":"5e67bd5b1009063689407478","name":"huggingface","fullname":"Hugging Face","avatar":"https://cdn-uploads.huggingface.co/production/uploads/1583856921041-5dd96eb166059660ed1ee413.png"}}">

Abstract

SmolVLM, a series of compact multimodal models, achieves high performance with minimal GPU memory usage, making efficient deployment on mobile and edge devices possible.

AI-generated summary

Large Vision-Language Models (VLMs) deliver exceptional performance but require significant computational resources, limiting their deployment on mobile and edge devices. Smaller VLMs typically mirror design choices of larger models, such as extensive image tokenization, leading to inefficient GPU memory usage and constrained practicality for on-device applications. We introduce SmolVLM, a series of compact multimodal models specifically engineered for resource-efficient inference. We systematically explore architectural configurations, tokenization strategies, and data curation optimized for low computational overhead. Through this, we identify key design choices that yield substantial performance gains on image and video tasks with minimal memory footprints. Our smallest model, SmolVLM-256M, uses less than 1GB GPU memory during inference and outperforms the 300-times larger Idefics-80B model, despite an 18-month development gap. Our largest model, at 2.2B parameters, rivals state-of-the-art VLMs consuming twice the GPU memory. SmolVLM models extend beyond static images, demonstrating robust video comprehension capabilities. Our results emphasize that strategic architectural optimizations, aggressive yet efficient tokenization, and carefully curated training data significantly enhance multimodal performance, facilitating practical, energy-efficient deployments at significantly smaller scales.

Community

Paper author Paper submitter
β€’
edited Apr 8

ΠŸΡ€ΠΈΠ²Π΅Ρ‚

Based on your findings in Section 3.4 that "excessive CoT data harms compact model performance", would you expect the same effect when CoT reasoning is learned through RL in compact VLMs? Or is this relationship not obviously transferable and would require separate testing?

Β·
Paper author

We did not try RL-style post-training, so it is hard to draw a definitive conclusion. However, larger LLMs (e.g., S1) can use SFT on CoT and succeed in 'distilling' such reasoning from CoT data.

My intuition is that these small models do not have that "emergent" property to learn to reason -- which is why CoT distillation was not helpful.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

an audio overview (~17 minutes): https://youtu.be/FWkAxDz0-4g
by the way, the team's demo video is very impressive!

Unfortunately, smolvlm-256M-Instruct, 4 times of less params then internvl-1B, still slower then Internvl-v2-1B (event i enable flash attention, bf16) ?

It seems less input token & dimension didn't contribute to the inference speed . Does anyone plz give me some suggestion about this issue ?
see github issue : https://github.com/huggingface/transformers/issues/41947

Sign up or log in to comment

Models citing this paper 27

Browse 27 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.05299 in a dataset README.md to link it from this page.

Spaces citing this paper 107

Collections including this paper 28