Technical reports:
| Apertus 70B | Olmo 3 32B Think | EuroLLM 22B Instruct | |
|---|---|---|---|
| Web demo | Public AI Switzerland | Allen AI Palyground | ? |
| Hugging Face | https://huggingface.co/swiss-ai/Apertus-70B-2509 | https://huggingface.co/allenai/Olmo-3-32B-Think | https://huggingface.co/utter-project/EuroLLM-22B-Instruct-2512 |
| Apertus 70B | Olmo 3 32B Think | EuroLLM 22B Instruct | |
|---|---|---|---|
| costs | 6 million GPU hours, the model can be realistically trained in approximately 90 days on 4096 GPU | 56 days on a cluster with 1024 H100 GPUs, approx. 1.4 million GPU hours | 1.7 GPU hours credit |
| codebase | fork of Megatron-LM | OLMo-core | fork of Megatron-LM |
| location | Alps Research Infrastructure at the Swiss National Supercomputing Centre (CSCS) | ? | EuroHPC centres |
| Apertus 70B | Olmo 3 32B Think | EuroLLM 22B Instruct | |
|---|---|---|---|
| Pre-training | 15 trillion tokens, retroactively apply the most recent crawling permissions specified by data owners, FineWeb-HQ, FineWeb-Edu, DCLM-Edu, FineWeb-2 (1811 languages), FineWeb-2-HQ (20 high-resource languaes), Translation Parallel Data, Clean Wikipedia, StarCoderData, StarCoder Edu, CommonPile/Stack v2 Edu, FineMath, MegaMath, EuroBlocks-SFT-Synthetic- 1124, Flan filtered for licenses allowing commercial use, pipelines for creating each data part are available on GitHub | The Dolma 3 Mix, 6 trillion tokens, web content, academic publications, code, and more, mostly from CommonCrawl Dolma Python library available for replicating the data creation process | 4 trillion tokens EuroWeb: FineWeb-edu, RedPajama-Data-v2, HPLT, MADLAD-400, CulturaX, mC4, a lot of parallel data (e.g. Europarl, OPUS100, WikiTitles,… ) |
| Post-training | instruction datasets from OLMo2, Tulu3, WildChat, SciRiff, OpenMath GSM8Km, Llama-Nemotron mathematical reasoning and code data, function-calling datasets from xlam, Glaive, APIGen, SmolTalk2 conversational data, EuroBlocks synthetic multilingual instructions, s1k 42 langs, Tome, local Swiss and low-resource instructions, Chat format | Dolci Think SFT, Dolci Think DPO, Dolci sourced from the Olmo2 suite with additions (e.g. ZebraLogic , Simple QA), includes AceCoder, The Algorithms, Llama Nemotron, OpenCodeReasoning, Tülu 3, WildChat, TableGPT, Aya) Think RL | EuroBlocks (sourced from public data sets), Hermes-3, Tülu 3, Nemotron V1-2 |
| Focus | Compliance | Source data | Cleaning |
| Multilinguality | Strong | Weak | Medium |
| Apertus 70B | Olmo 3 32B Think | EuroLLM 22B Instruct | |
|---|---|---|---|
| what | v3 tekken tokenizer from Mistral-Nemo-Base-2407 | OpenAI’s cl100k | BPE-based |
| why | wins on intrinsic evaluation, smaller vocab | ?, same as Olmo2 | broad coverage of European and global languages, same as in previous versions |
| vocab size | 131,072, 47 custom tokens for math | not specified but could be 100k, special tokens only in SFT | 128,000 |
| Apertus 70B | Olmo 3 32B Think | EuroLLM 22B Instruct | |
|---|---|---|---|
| sequence length | initially 4096 then doubled | 8192 with sliding window attention (SWA) 4096 | 32,768 |
| number of layers | 80 | 64 | 54 |
| embedding size | ? | ? | 6,144 |
| hidden size | 8,192 | 5,120 | 16,384 |
| positional encodings | RoPE | RoPE | RoPE |
| batch size | initially 2048 then doubled | 1,024 | ? |
| activation | xIELU | SwiGLU | SwiGLU |
| loss function | Goldfish, prevents memorising | Z-Loss (?) | standard cross-entropy objective (?) |
| optimisation | AdEMAMix - scales more favourably with model size | AdamW (?) | ? |
| long context | 65,536 | 65,536 | 32, 768 |
| Apertus 70B | Olmo 3 32B Think | EuroLLM 22B Instruct | |
|---|---|---|---|
| format and chat template | Apertus Chat Format Specification | OpenAPI specification | axolotl |
| SFT data size | 3.8 million examples | 2.3 million examples | 10.6 million multilingual examples |
| alignment goals | helpfulness, honesty, safety, and refusal | chat, math, coding, safety, science, multilingual | axolotl |
| alignment method | Quantile Reward Policy Optimization algorithm (QRPO) | Direct Preference Optimisation (DPO) | axolotl |
| Alignment data size (number of prompts) | ? | 200’000 | ? |
| Apertus 70B | Olmo 3 32B Think | EuroLLM 22B Instruct | |
|---|---|---|---|
| MMLU | 65.2, the score reported in EuroLLM 67.9 | 86.4, the score for OLMo-3.1-32B reported in EuroLLM 80.1 | 69.8 |
| ARC | 70.6, the score reported in EuroLLM 84.7 | ?, the score for OLMo-3.1-32B reported in EuroLLM 93,6 | 89.8 |