Efforts are underway to establish a common set of benchmarks for evaluating generative artificial intelligence (AI) products and to create a “body of knowledge” on how these tools should be tested. The initiative, called Sandbox, is led by Singapore’s Infocomm Media Development Authority (IMDA) and AI Verify Foundation, with support from global market players including Amazon Web Services (AWS), Anthropic, Google, and Microsoft. The goal is to provide a standardized approach to evaluating generative AI applications and to address the associated risks. This approach marks a departure from previous fragmented efforts.
Sandbox is guided by a draft catalog that categorizes existing benchmarks and evaluation methods for large language models (LLMs). The catalog compiles commonly used technical testing tools and organizes them based on their testing objectives and methods. It also recommends a baseline set of tests for evaluating generative AI products. IMDA aims to establish a common language and promote the safe and trustworthy adoption of generative AI through this initiative. Rigorous evaluation of models is crucial for building trust and determining their intended uses and limitations. It also provides developers with a roadmap for improvement.
To achieve this common language, a standardized taxonomy and baseline set of pre-deployment safety evaluations for LLMs are needed. IMDA hopes that the draft catalog will serve as a starting point for global discussions and drive consensus on safety standards for LLMs. In addition to model developers, other stakeholders in the ecosystem, such as application developers and third-party testing tool developers, should be involved in the development of these standards. Through Sandbox, IMDA aims to demonstrate how different players, including model developers, app developers, and third-party testers, can collaborate on generative AI use cases in sectors like finance and telecommunications. Regulators, such as Singapore’s Personal Data Protection Commission, should also be involved to create an environment for experimentation and development where all parties can be transparent about their needs.
IMDA expects Sandbox to uncover gaps in the current state of generative AI evaluations, particularly in domain-specific applications like human resources and cultural-specific areas. The agency plans to develop benchmarks for evaluating model performance in these specific areas, taking into account cultural and language specificities. IMDA is collaborating with Anthropic on a Sandbox project to identify aspects for red teaming, which challenges policies and assumptions used in AI systems through an adversarial approach. Anthropic’s models and research tooling platform will be used to develop red-teaming methodologies customized for Singapore’s diverse linguistic and cultural landscape.
In July, the Singapore government launched two sandboxes utilizing Google Cloud’s generative AI toolsets. One sandbox is exclusively used by government agencies for developing and testing generative AI applications, while the other is available to local organizations free of charge for three months and supports up to 100 use cases.