Hello 👋

I am a Machine Learning Engineer / NLP researcher at TU Dresden, working on multilingual LLM evaluation and scalable benchmarking pipelines. My background is in computer science (university degree) with a background in data management (Big Data architectures & analytics), now focusing on NLP and large language models (LLMs). My research focuses on the reliable evaluation of multilingual LLMs, especially the validity of translated benchmarks, translation-aware evaluation, and culturally robust multilingual benchmarking.

Experience

  • TU Dresden (ZIH/VDR) — Machine Learning Engineer (NLP/LLMs), since 01/2023
  • TU Dresden (ScaDS.AI) — Research Associate (NLP), 02/2020–02/2021
  • Fraunhofer IAIS — Data Scientist / Lecturer (Big Data architectures), 01/2017–06/2019
  • Fraunhofer IAIS — Software Engineer / Big Data Architect, 07/2013–12/2016
  • University of Bonn (EIS) — Research Associate (Semantic Web), 01/2014–11/2015

Research Interests

  • Multilingual LLM evaluation & benchmark design
  • Translation artifacts & translation-aware metrics
  • Automated benchmark QA (e.g., MT quality estimation, LLM-as-a-judge)
  • Human-aligned evaluation / preference validation
  • Efficient large-scale evaluation on HPC clusters

Contact: klaudia-doris.thellmann [at] tu-dresden [dot] de

Portrait

News

  • 2026 Under review — “Quantifying the Impact of Translation Errors on Multilingual LLM Evaluation.”
  • 2026 LREC 2026 — “Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite.”
  • 2025 EACL 2025 — “Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs.”
  • 2024 EMNLP 2024 — “Investigating Multilingual Instruction-Tuning: Do Polyglot Models Demand for Multilingual Instructions?”
  • 2024 Findings of NAACL 2024 — “Tokenizer Choice For LLM Training: Negligible or Crucial?”