Teuken-7B and multilingual evaluation

Overview

OpenGPT-X (2022–2025) was a BMWK-funded German consortium led by Fraunhofer IAIS to build open, multilingual large language models for Europe. I joined the project as a Research Engineer at TU Dresden’s ZIH (Center for Information Services and High Performance Computing), where I focused on the evaluation side of the work: the pipelines, benchmarks, and comparative studies that made large multilingual LLMs comparable, reproducible, and meaningful for European languages.

The four papers below tell one continuous story — from how to tokenize multilingual text, to how to instruction-tune it, to the Teuken-7B model itself, and finally to the open multilingual evaluation infrastructure used to compare it with the rest of the LLM landscape.

The story behind the papers

1. Getting the basics right — multilingual tokenizers

Tokenizer Choice For LLM Training: Negligible or Crucial? — Findings of NAACL 2024

Across 24 mono- and multilingual 2.6B-parameter models, we asked a basic question that turns out to matter a lot: how much does tokenizer design cost you in multilingual training? An English-centric tokenizer can inflate training cost by up to 68% on multilingual data. Tokenizers trained on a balanced language mix produce shorter sequences (lower fertility), keep multilingual training within budget, and yield better non-English downstream performance.

2. Teaching the model to follow multilingual instructions

Investigating Multilingual Instruction-Tuning: Do Polyglot Models Demand for Multilingual Instructions? — EMNLP 2024

We built a 1,030-instruction multilingual dataset across five languages and asked whether multilingual models really need multilingual instructions. They do: multilingual instruction-tuning improves low-resource accuracy by up to 10%, and human-curated data substantially outperforms synthetic. Diversity and dataset size matter most.

3. Teuken-7B — the open European LLM

Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs — EACL 2025

The flagship deliverable: a 7B-parameter LLM trained from scratch on all 24 EU official languages, with a strong focus on non-English European language data, plus an instruction-tuned variant. The tokenizer and instruction-tuning design follow directly from the two studies above. The full pipeline — data curation, training, instruction-tuning, evaluation — is openly documented, and the models are released for research and commercial use on Hugging Face.

4. A trustworthy evaluation infrastructure for European LLMs

Towards Multilingual LLM Evaluation for European Languages — arXiv 2025

This work turned the in-house evaluation infrastructure into a public resource. We translated five widely used benchmarks into 20 EU languages — the EU20 benchmark suite — and evaluated 40+ state-of-the-art LLMs on reasoning, knowledge, and truthfulness. The resulting European LLM Leaderboard on Hugging Face exposes a >20% accuracy gap between high- and mid-resource languages. We also validated the automated benchmarks against human preferences from the Chatbot Arena and found strong agreement — supporting EU20 as a reliable multilingual evaluation instrument.

Papers

Tokenizer Choice For LLM Training: Negligible or Crucial?
Mehdi Ali, Michael Fromm, Klaudia-Doris Thellmann, et al. — Findings of NAACL 2024.
DOI: https://doi.org/10.18653/v1/2024.findings-naacl.247
Investigating Multilingual Instruction-Tuning: Do Polyglot Models Demand for Multilingual Instructions?
Alexander Arno Weber, Klaudia-Doris Thellmann, Jan Ebert, et al. — EMNLP 2024.
DOI: https://doi.org/10.18653/v1/2024.emnlp-main.1159
Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs.
Mehdi Ali, Michael Fromm, Klaudia-Doris Thellmann, et al. — EACL 2025.
arXiv: https://arxiv.org/abs/2410.03730
Towards Multilingual LLM Evaluation for European Languages.
Klaudia-Doris Thellmann, Bernhard Stadler, Michael Fromm, et al. — arXiv 2025.
arXiv: https://arxiv.org/abs/2410.08928

Code & models

Teuken-7B models on Hugging Face:
European LLM Leaderboard: Hugging Face Space
Leaderboard data: openGPT-X/leaderboard_data · leaderboard_data_ogx
Code on GitHub: OpenGPTX organization — including OpenGPTX forks of lm-evaluation-harness (used and extended for multilingual evaluation) and Megatron-LM (used for training)
Hugging Face organization: openGPT-X

Overview#

The story behind the papers#

1. Getting the basics right — multilingual tokenizers#

2. Teaching the model to follow multilingual instructions#

3. Teuken-7B — the open European LLM#

4. A trustworthy evaluation infrastructure for European LLMs#

Papers#

Code & models#

Press & project links#