Abstract

Open-source multimodal large language models (MLLMs) have shown significantpotential in a broad range of multimodal tasks. However, their reasoningcapabilities remain constrained by existing instruction-tuning datasets, whichwere predominately repurposed from academic datasets such as VQA, AI2D, andChartQA. These datasets target simplistic tasks, and only provide phrase-levelanswers without any intermediate rationales. To address these challenges, weintroduce a scalable and cost-effective method to construct a large-scalemultimodal instruction-tuning dataset with rich intermediate rationalesdesigned to elicit CoT reasoning. Using only open models, we create a datasetcontaining 12M instruction-response pairs to cover diverse, reasoning-intensivetasks with detailed and faithful rationales. Experiments demonstrate thattraining MLLMs on this dataset significantly improves reasoning capabilities,achieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%),MMMU-Pro (+7%), and MuirBench (+13.3%). Additionally, the model demonstratesnotable improvements of up to 4% on non-reasoning-based benchmarks. Ablationstudies further highlight the importance of key components, such as rewritingand self-filtering, in the dataset construction process.

Source PDF