nlp_intro

15. Fully open LLMs Apertus 70B, Olmo 3 32B Think and EuroLLM 22B (summary comparison)

Technical reports:

Apertus: Democratizing Open and Compliant LLMs for Global Language Environments

Olmo 3

EuroLLM-22B: Technical Report

User interface

	Apertus 70B	Olmo 3 32B Think	EuroLLM 22B Instruct
Web demo	Public AI Switzerland	Allen AI Palyground	?
Hugging Face	https://huggingface.co/swiss-ai/Apertus-70B-2509	https://huggingface.co/allenai/Olmo-3-32B-Think	https://huggingface.co/utter-project/EuroLLM-22B-Instruct-2512

Training infrastructure

	Apertus 70B	Olmo 3 32B Think	EuroLLM 22B Instruct
costs	6 million GPU hours, the model can be realistically trained in approximately 90 days on 4096 GPU	56 days on a cluster with 1024 H100 GPUs, approx. 1.4 million GPU hours	1.7 GPU hours credit
codebase	fork of Megatron-LM	OLMo-core	fork of Megatron-LM
location	Alps Research Infrastructure at the Swiss National Supercomputing Centre (CSCS)	?	EuroHPC centres

Data

	Apertus 70B	Olmo 3 32B Think	EuroLLM 22B Instruct
Pre-training	15 trillion tokens, retroactively apply the most recent crawling permissions specified by data owners, FineWeb-HQ, FineWeb-Edu, DCLM-Edu, FineWeb-2 (1811 languages), FineWeb-2-HQ (20 high-resource languaes), Translation Parallel Data, Clean Wikipedia, StarCoderData, StarCoder Edu, CommonPile/Stack v2 Edu, FineMath, MegaMath, EuroBlocks-SFT-Synthetic- 1124, Flan filtered for licenses allowing commercial use, pipelines for creating each data part are available on GitHub	The Dolma 3 Mix, 6 trillion tokens, web content, academic publications, code, and more, mostly from CommonCrawl Dolma Python library available for replicating the data creation process	4 trillion tokens EuroWeb: FineWeb-edu, RedPajama-Data-v2, HPLT, MADLAD-400, CulturaX, mC4, a lot of parallel data (e.g. Europarl, OPUS100, WikiTitles,… )
Post-training	instruction datasets from OLMo2, Tulu3, WildChat, SciRiff, OpenMath GSM8Km, Llama-Nemotron mathematical reasoning and code data, function-calling datasets from xlam, Glaive, APIGen, SmolTalk2 conversational data, EuroBlocks synthetic multilingual instructions, s1k 42 langs, Tome, local Swiss and low-resource instructions, Chat format	Dolci Think SFT, Dolci Think DPO, Dolci sourced from the Olmo2 suite with additions (e.g. ZebraLogic , Simple QA), includes AceCoder, The Algorithms, Llama Nemotron, OpenCodeReasoning, Tülu 3, WildChat, TableGPT, Aya) Think RL	EuroBlocks (sourced from public data sets), Hermes-3, Tülu 3, Nemotron V1-2
Focus	Compliance	Source data	Cleaning
Multilinguality	Strong	Weak	Medium

Tokenisation

	Apertus 70B	Olmo 3 32B Think	EuroLLM 22B Instruct
what	v3 tekken tokenizer from Mistral-Nemo-Base-2407	OpenAI’s cl100k	BPE-based
why	wins on intrinsic evaluation, smaller vocab	?, same as Olmo2	broad coverage of European and global languages, same as in previous versions
vocab size	131,072, 47 custom tokens for math	not specified but could be 100k, special tokens only in SFT	128,000

Base model architecture choices (pre-training)

	Apertus 70B	Olmo 3 32B Think	EuroLLM 22B Instruct
sequence length	initially 4096 then doubled	8192 with sliding window attention (SWA) 4096	32,768
number of layers	80	64	54
embedding size	?	?	6,144
hidden size	8,192	5,120	16,384
positional encodings	RoPE	RoPE	RoPE
batch size	initially 2048 then doubled	1,024	?
activation	xIELU	SwiGLU	SwiGLU
loss function	Goldfish, prevents memorising	Z-Loss (?)	standard cross-entropy objective (?)
optimisation	AdEMAMix - scales more favourably with model size	AdamW (?)	?
long context	65,536	65,536	32, 768

Post-training choices

	Apertus 70B	Olmo 3 32B Think	EuroLLM 22B Instruct
format and chat template	Apertus Chat Format Specification	OpenAPI specification	axolotl
SFT data size	3.8 million examples	2.3 million examples	10.6 million multilingual examples
alignment goals	helpfulness, honesty, safety, and refusal	chat, math, coding, safety, science, multilingual	axolotl
alignment method	Quantile Reward Policy Optimization algorithm (QRPO)	Direct Preference Optimisation (DPO)	axolotl
Alignment data size (number of prompts)	?	200’000	?

Performance evaluation

	Apertus 70B	Olmo 3 32B Think	EuroLLM 22B Instruct
MMLU	65.2, the score reported in EuroLLM 67.9	86.4, the score for OLMo-3.1-32B reported in EuroLLM 80.1	69.8
ARC	70.6, the score reported in EuroLLM 84.7	?, the score for OLMo-3.1-32B reported in EuroLLM 93,6	89.8