SmolDocling: An Ultra-Compact VLM for Document Understanding

SmolDocling seems to be a significant advancement in compact document understanding models. This 256M parameter vision-language model is designed for efficient document processing while maintaining high performance across a range of document understanding tasks. Developed by researchers from IBM Research and HuggingFace, this model bridges the gap between large, resource-intensive models and more specialized ensemble approaches. I also like the name because it sounds like “Smol Duckling” and it’s nice to get a cute model name every once in a while.

Ahem…

Architecture and Design

“Figure 1:SmolDocling/SmolVLM architecture. SmolDocling converts images of document pages to DocTags sequences. First, input images are encoded using a vision encoder and reshaped via projection and pooling. Then, the projected embeddings are concatenated with the text embeddings of the user prompt, possibly with interleaving. Finally, the sequence is used by an LLM to autoregressively predict the DocTags sequence.”

SmolDocling is built upon the SmolVLM architecture approach, specifically using the SmolVLM-256M variant. It consists of a SigLIP base patch-16/512 (93M) visual backbone and a lightweight variant of the SmolLM-2 family (135M) language backbone. This makes it between 5 and 10 times smaller in parameters than comparable vision-language models, and up to 27 times smaller than some models it outperforms.

Read more

Leave a Reply

Your email address will not be published. Required fields are marked *