SmolDocling seems to be a significant advancement in compact document understanding models. This 256M parameter vision-language model is designed for efficient document processing while maintaining high performance across a range of document understanding tasks. Developed by researchers from IBM Research and HuggingFace, this model bridges the gap between large, resource-intensive models and more specialized ensemble approaches. I also like the name because it sounds like “Smol Duckling” and it’s nice to get a cute model name every once in a while.
Ahem…
Architecture and Design
SmolDocling is built upon the SmolVLM architecture approach, specifically using the SmolVLM-256M variant. It consists of a SigLIP base patch-16/512 (93M) visual backbone and a lightweight variant of the SmolLM-2 family (135M) language backbone. This makes it between 5 and 10 times smaller in parameters than comparable vision-language models, and up to 27 times smaller than some models it outperforms.
Read more