Overview

This project investigates how automated quality assurance methods can help diagnose translated benchmark items before they are used for multilingual LLM evaluation.

Associated publication

Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite.
Klaudia-Doris Thellmann, Bernhard Stadler, Michael Färber — LREC 2026.

Resources

ResourceLink
PaperTODO
CodeTODO
Dataset / benchmark materialTODO
Slides / posterTODO
Blog postLREC 2026 conference notes

Research questions

  • How can translated benchmark items be checked systematically and at scale?
  • Which types of translation artifacts are especially relevant for model evaluation?
  • How can benchmark QA become part of reproducible multilingual evaluation pipelines?