Overview
This project investigates how automated quality assurance methods can help diagnose translated benchmark items before they are used for multilingual LLM evaluation.
Associated publication
Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite.
Klaudia-Doris Thellmann, Bernhard Stadler, Michael Färber — LREC 2026.
Resources
| Resource | Link |
|---|---|
| Paper | TODO |
| Code | TODO |
| Dataset / benchmark material | TODO |
| Slides / poster | TODO |
| Blog post | LREC 2026 conference notes |
Research questions
- How can translated benchmark items be checked systematically and at scale?
- Which types of translation artifacts are especially relevant for model evaluation?
- How can benchmark QA become part of reproducible multilingual evaluation pipelines?