nlp_intro

15. Fully open LLMs Apertus 70B, Olmo 3 32B Think and EuroLLM 22B (summary comparison)

Technical reports:

 

User interface

  Apertus 70B Olmo 3 32B Think EuroLLM 22B Instruct
Web demo Public AI Switzerland Allen AI Palyground ?
Hugging Face https://huggingface.co/swiss-ai/Apertus-70B-2509 https://huggingface.co/allenai/Olmo-3-32B-Think https://huggingface.co/utter-project/EuroLLM-22B-Instruct-2512

 

Training infrastructure

  Apertus 70B Olmo 3 32B Think EuroLLM 22B Instruct
costs 6 million GPU hours, the model can be realistically trained in approximately 90 days on 4096 GPU 56 days on a cluster with 1024 H100 GPUs, approx. 1.4 million GPU hours 1.7 GPU hours credit
codebase fork of Megatron-LM OLMo-core fork of Megatron-LM
location Alps Research Infrastructure at the Swiss National Supercomputing Centre (CSCS) ? EuroHPC centres

 

Data

  Apertus 70B Olmo 3 32B Think EuroLLM 22B Instruct
Pre-training 15 trillion tokens, retroactively apply the most recent crawling permissions specified by data owners, FineWeb-HQ, FineWeb-Edu, DCLM-Edu, FineWeb-2 (1811 languages), FineWeb-2-HQ (20 high-resource languaes), Translation Parallel Data, Clean Wikipedia, StarCoderData, StarCoder Edu, CommonPile/Stack v2 Edu, FineMath, MegaMath, EuroBlocks-SFT-Synthetic- 1124, Flan filtered for licenses allowing commercial use, pipelines for creating each data part are available on GitHub The Dolma 3 Mix, 6 trillion tokens, web content, academic publications, code, and more, mostly from CommonCrawl Dolma Python library available for replicating the data creation process 4 trillion tokens EuroWeb: FineWeb-edu, RedPajama-Data-v2, HPLT, MADLAD-400, CulturaX, mC4, a lot of parallel data (e.g. Europarl, OPUS100, WikiTitles,… )
Post-training instruction datasets from OLMo2, Tulu3, WildChat, SciRiff, OpenMath GSM8Km, Llama-Nemotron mathematical reasoning and code data, function-calling datasets from xlam, Glaive, APIGen, SmolTalk2 conversational data, EuroBlocks synthetic multilingual instructions, s1k 42 langs, Tome, local Swiss and low-resource instructions, Chat format Dolci Think SFT, Dolci Think DPO, Dolci sourced from the Olmo2 suite with additions (e.g. ZebraLogic , Simple QA), includes AceCoder, The Algorithms, Llama Nemotron, OpenCodeReasoning, Tülu 3, WildChat, TableGPT, Aya) Think RL EuroBlocks (sourced from public data sets), Hermes-3, Tülu 3, Nemotron V1-2
Focus Compliance Source data Cleaning
Multilinguality Strong Weak Medium

 

Tokenisation

  Apertus 70B Olmo 3 32B Think EuroLLM 22B Instruct
what v3 tekken tokenizer from Mistral-Nemo-Base-2407 OpenAI’s cl100k BPE-based
why wins on intrinsic evaluation, smaller vocab ?, same as Olmo2 broad coverage of European and global languages, same as in previous versions
vocab size 131,072, 47 custom tokens for math not specified but could be 100k, special tokens only in SFT 128,000

 

Base model architecture choices (pre-training)

  Apertus 70B Olmo 3 32B Think EuroLLM 22B Instruct
sequence length initially 4096 then doubled 8192 with sliding window attention (SWA) 4096 32,768
number of layers 80 64 54
embedding size ? ? 6,144
hidden size 8,192 5,120 16,384
positional encodings RoPE RoPE RoPE
batch size initially 2048 then doubled 1,024 ?
activation xIELU SwiGLU SwiGLU
loss function Goldfish, prevents memorising Z-Loss (?) standard cross-entropy objective (?)
optimisation AdEMAMix - scales more favourably with model size AdamW (?) ?
long context 65,536 65,536 32, 768

 

Post-training choices

  Apertus 70B Olmo 3 32B Think EuroLLM 22B Instruct
format and chat template Apertus Chat Format Specification OpenAPI specification axolotl
SFT data size 3.8 million examples 2.3 million examples 10.6 million multilingual examples
alignment goals helpfulness, honesty, safety, and refusal chat, math, coding, safety, science, multilingual axolotl
alignment method Quantile Reward Policy Optimization algorithm (QRPO) Direct Preference Optimisation (DPO) axolotl
Alignment data size (number of prompts) ? 200’000 ?

 

Performance evaluation

  Apertus 70B Olmo 3 32B Think EuroLLM 22B Instruct
MMLU 65.2, the score reported in EuroLLM 67.9 86.4, the score for OLMo-3.1-32B reported in EuroLLM 80.1 69.8
ARC 70.6, the score reported in EuroLLM 84.7 ?, the score for OLMo-3.1-32B reported in EuroLLM 93,6 89.8