MLCommons Announces Expansion of Industry-Leading AILuminate Benchmark
World leader in AI benchmarking announces new partnership with India's NASSCOM; updated reliability grades for leading LLMs
SAN FRANCISCO, May 29, 2025 (GLOBE NEWSWIRE) -- MLCommons® today announced that it is expanding its first-of-its-kind AILuminate benchmark to measure AI reliability across new models, languages, and tools. As part of this expansion, MLCommons is partnering with NASSCOM, India's premier technology trade association, to bring AILuminate's globally recognized AI reliability benchmarks to South Asia. MLCommons is also unveiling new proof of concept testing for AILuminate's Chinese-language capabilities and new AILuminate reliability grades for an expanded suite of large language models (LLMs).
”We're looking forward to working with NASSCOM to develop India-specific, Hindi-language benchmarks and ensure companies in India and around the world can better measure the reliability and risk of their AI products,” said Peter Mattson, President of MLCommons. “This partnership, along with new AILuminate grades and proof of concept for Chinese language capabilities, represents a major step towards the development of globally inclusive industry standards for AI reliability.”
“The rapid development of AI is reshaping India's technology sector and, in order to harness risk and foster innovation, rigorous global standards can help align the growth of the industry with emerging best practices,” said Ankit Bose, Head of NASSCOM AI. “We plan to work alongside MLCommons to develop these standards and ensure that the growth and societal integration of AI technology continues responsibly.”
The NASSCOM collaboration builds on MLCommons' intentionally global approach to AI benchmarking. Modeled after MLCommons' ongoing partnership with Singapore's AI Verify Foundation, the NASSCOM partnership will help to meet South Asia's urgent need for standardized AI benchmarks that are collaboratively designed and trusted by the region's industry experts, policymakers, civil society members, and academic researchers. MLCommons' partnership with the AI Verify Foundation – in close collaboration with the National University of Singapore – has already resulted in significant progress towards globally-inclusive AI benchmarking across East Asia, including just-released proof of concept scores for Chinese-language LLMs.
AILuminate is also unveiling new reliability grades for an updated and expanded suite of LLMs, to help companies around the world better measure product risk. Like previous AILuminate testing, these grades are based on LLM responses to 24,000 test prompts across 12 hazard categories – including including violent and non-violent crimes, child sexual exploitation, hate, and suicide/self-harm. None of the LLMs evaluated were given any advance knowledge of the evaluation prompts (a common problem in non-rigorous benchmarking), nor access to the evaluator model used to assess responses. This independence provides a methodological rigor uncommon in standard academic research or private benchmarking.
“Companies are rapidly incorporating chatbots into their products, and these updated grades will help them better understand and compare risk across new and constantly-updated models,” said Rebecca Weiss, Executive Director of MLCommons.”We're grateful to our partners on the Risk and Reliability Working Group – including some of the foremost AI researchers, developers, and technical experts – for ensuring a rigorous, empirically-sound analysis that can be trusted by industry and academia like.”
Having successfully expanded the AILuminate benchmark to multiple languages, the AI Risk & Reliability Working Group is beginning the process of evaluating reliability across increasingly sophisticated AI tools, including mutli-modal LLMs and agentic AI. We hope to announce proof-of-concept benchmarks in these spaces later this year.
About MLCommons
MLCommons is the world leader in building benchmarks for AI. It is an open engineering consortium with a mission to make AI better for everyone through benchmarks and data. The foundation for MLCommons began with the MLPerf® benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 125+ members, global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire AI industry through benchmarks and metrics, public datasets, and measurements for AI risk and reliability.
Press Inquiries:
press@mlcommons.org
