I am a Machine Learning Engineer / NLP researcher at TU Dresden, working on multilingual LLM evaluation and scalable benchmarking pipelines. My background is in computer science (university degree) with a background in data management (Big Data architectures & analytics), now focusing on NLP and large language models (LLMs). My research focuses on the reliable evaluation of multilingual LLMs, especially the validity of translated benchmarks, translation-aware evaluation, and culturally robust multilingual benchmarking.
Experience
- TU Dresden (ZIH/VDR) — Machine Learning Engineer (NLP/LLMs), since 01/2023
- TU Dresden (ScaDS.AI) — Research Associate (NLP), 02/2020–02/2021
- Fraunhofer IAIS — Data Scientist / Lecturer (Big Data architectures), 01/2017–06/2019
- Fraunhofer IAIS — Software Engineer / Big Data Architect, 07/2013–12/2016
- University of Bonn (EIS) — Research Associate (Semantic Web), 01/2014–11/2015
Research Interests
- Multilingual LLM evaluation & benchmark design
- Translation artifacts & translation-aware metrics
- Automated benchmark QA (e.g., MT quality estimation, LLM-as-a-judge)
- Human-aligned evaluation / preference validation
- Efficient large-scale evaluation on HPC clusters
Contact: klaudia-doris.thellmann [at] tu-dresden [dot] de
News
- 2026 Under review — “Quantifying the Impact of Translation Errors on Multilingual LLM Evaluation.”
- 2026 LREC 2026 — “Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite.”
- 2025 EACL 2025 — “Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs.”
- 2024 EMNLP 2024 — “Investigating Multilingual Instruction-Tuning: Do Polyglot Models Demand for Multilingual Instructions?”
- 2024 Findings of NAACL 2024 — “Tokenizer Choice For LLM Training: Negligible or Crucial?”