Samsung Electronics has unveiled TRUEBench (Trustworthy Real-world Usage Evaluation Benchmark), a proprietary framework developed by Samsung Research to measure AI productivity in enterprise settings.
TRUEBench aims to provide a realistic assessment of how large language models (LLMs) perform in real-world workplace tasks, emphasizing diverse dialogue scenarios and multilingual conditions that reflect actual business communication.
The benchmark covers common enterprise activities such as content generation, data analysis, summarization, and translation, spanning ten task categories and 46 sub-categories.
It relies on AI-powered automatic evaluation built from criteria collaboratively designed and refined by both human experts and AI to enhance reliability and reduce subjective bias. TRUEBench also supports cross-linguistic evaluation, comprising 2,485 test sets across 12 languages to mirror global workflows.
Test lengths vary from brief prompts of eight characters to lengthy documents exceeding 20,000 characters, ensuring coverage of simple requests as well as complex, multi-step tasks.
Samsung Research notes that current benchmarks are often English-centric, focus on single-turn Q&A, and fail to capture the complexities of real work environments. TRUEBench seeks to address these gaps by evaluating not only accuracy but also the implicit needs of users through detailed conditions that must be satisfied for a model to pass each test.
The evaluation process features a collaborative, iterative approach. Human annotators establish the initial criteria, which are then reviewed by AI to identify errors, contradictions, or overly restrictive constraints. Afterward, human evaluators refine the criteria again, and this cycle repeats to produce increasingly precise standards.
The resulting automatic evaluation applies these cross-verified criteria, promoting consistency and minimizing bias. For each test, all stipulated conditions must be met, enabling more granular and precise scoring across tasks.
TRUEBench data samples and leaderboards are hosted on Hugging Face, allowing public access to model performance comparisons. The platform supports up to five models per comparison and publishes metrics on both performance and efficiency, including data on average response lengths. Details about TRUEBench can be found on the Hugging Face page at https://huggingface.co/spaces/SamsungResearch/TRUEBench.
Paul (Kyungwhoon) Cheun, CTO of Samsung Electronics’ DX Division and Head of Samsung Research, underscored that TRUEBench embodies Samsung’s deep in-house AI productivity experience.
He stated that the benchmark aims to set new standards for productivity evaluation and reinforce Samsung’s position as a technology leader.