Recent advancements in multimodal models like
Through analysis of web-crawled editing examples, the team categorized image editing into 11 distinct types. This taxonomy guided the creation of a comprehensive data pipeline that generated over 20 million instruction-image triplets. After rigorous filtering using both Multimodal LLMs and human annotators, the final dataset contained more than 1 million high-quality examples.
Read more