SmolLM3: smol, multilingual, long-context reasoner
omnivore paper transformers good-read!
Read on Omnivore | Read Original
Highlights
Grouped Query Attention (GQA): We replaced multi-head attention with grouped-query attention using 4 groups. Our ablations on a 3B model trained with 100B tokens from FineWeb-Edu showed that GQA matches the performance of multi-head attention while significantly reducing the KV cache size during inference. ⤴️
NoPE: We implemented NoPE from “RoPE to NoRoPE and Back Again: A New Hybrid Attention Strategy” (Yang et al., 2025), selectively removing rotary position embeddings from every 4th layer. This approach improves long context performance without affecting short context capabilities, as confirmed by our ablations. ⤴️
Intra-Document Masking: During training, we use attention masking to ensure tokens from different documents in the same training sequence don’t attend to each other. Similar to Llama 3, this helps with faster and more stable long context training while maintaining short context performance. ⤴️
Training Stability: Following OLMo 2, we remove weight decay from embedding layers to improve training stability. This modification contributed to more stable training dynamics, with embedding norms naturally stabilizing at healthier values during training without impacting overall performance in our ablations. ⤴️
WSD (Warmup-Stable-Decay) scheduler ⤴️
nanotron framework for the training, datatrove for data processing and lighteval for evaluation. ⤴️
Stage 1: Stable phase (0T → 8T tokens) This foundation stage establishes strong general capabilities with our core dataset mixture: ⤴️
Stage 2: Stable phase (8T → 10T tokens) We introduce higher quality math and code datasets while maintaining good web coverage: ⤴️
Stage 3: Decay Phase (10T → 11.1T tokens) The final stage further upsamples math and code data ⤴️
call the long context adaptation and reasoning adaption “mid-training”. ⤴️
trained SmolLM3 on an additional 100B tokens to extend its context length. We sequentially extended the context window in two stages for 50B tokens each: first transitioning from 4k to 32k context with RoPE theta increased to 1.5M, then from 32k to 64k context with RoPE theta increased to 5M. ⤴️
Following Qwen2.5, we use YARN to extrapolate beyond the training context length. During inference, the model can handle up to 128k context (2x extension beyond the 64k training length). ⤴️
main difference between the mid-training stage and the pre- and post-training stages is that we targeted a general capability without yet focusing on a specific domain. ⤴️
dual instruction models that support both reasoning and non-reasoning modes. ⤴️
balance performance between reasoning and non-reasoning modes through a carefully designed training pipeline that includes mid-training for general reasoning capabilities, supervised fine-tuning with synthetic data generation, and alignment using Anchored Preference Optimization (APO) - a recent variant of DPO. ⤴️
In non-reasoning mode, we pre-fill the model’s response with empty think blocks, similar to Qwen3, to ensure direct answers without explicit reasoning. ⤴️
scarcity of datasets containing reasoning traces for certain domains. To address this gap, we generated synthetic data by prompting Qwen3-32B in reasoning mode with prompts from existing non-reasoning datasets. This allowed us to improve performance in domains where the model initially struggled in reasoning mode, such as multi-turn conversations, multilinguality, and everyday conversations. ⤴️
Anchored Preference Optimization (APO) is a variant of Direct Preference Optimization (DPO) that provides a more stable optimization objective. ⤴️
focus on reasoning capabilities impacted long context performance. ⤴️
Model merging is a popular and powerful technique that allows combining the strengths of different models without the computational overhead of ensembling or the need for additional training. We used the MergeKit library to perform the model merging, as it includes several merging methods, including linear and non-linear merging. ⤴️