As generative AI models evolve, customized test benchmarks and openness are crucial

Models for spheres in different colors

Andriy Onufriyenko/Getty Images

As generative artificial intelligence (AI) models continue to evolve, industry collaboration and customized test benchmarks will be crucial amid organizations’ efforts to establish the right fit for their business. 

This effort will be necessary as enterprises seek out large language models (LLMs) trained on data specific to their verticals, and as countries look to ensure AI models are trained on data as well as principles that are based on their own unique values, according to Ong Cheng Hui, assistant chief executive of the business and technology group at Infocomm Media Development Authority (IMDA). 

Also40% of workers will have to reskill in the next three years due to AI, says IBM study

She questioned whether one large foundation model is really the way forward or whether there is a need for more specialized models, pointing to Bloomberg’s efforts to build its own large-scale generative AI model, BloombergGPT, that has been specifically trained on financial data. 

As long as the necessary expertise, data, and compute resources “are not locked up”, the industry can continue to drive developments forward, said Ong, who was speaking to media on the sidelines of the Red Hat Summit this week. 

The software vendor is a member of Singapore’s AI Verify Foundation, which aims to tap the open-source community to develop test toolkits to guide the responsible and ethical use of AI. Launched in June with six other premier members apart from Red Hat, including Google and Microsoft, the initiative is led by IMDA and currently has more than 60 general members.

Also: The best AI chatbots right now

Singapore has the highest adoption of open-source technologies and principles in the Asia-Pacific region, according to Guna Chellappan, Red Hat’s Singapore general manager. Citing findings from research the vendor commissioned, Chellappan noted that 72% of Singapore organizations said they have made “high or very high progress” in their adoption of open source. 

Port operator PSA Singapore and UOB are among Red Hat’s local customers, with the former deploying open-source applications to automate its operations. Local bank UOB taps Red Hat OpenShift to support its cloud development. 

Going the open-source route is key because transparency is important to driving the message around AI ethics, Ong said, noting that it would be ironic to ask the public to trust the foundation’s test toolkits if details on them were not freely available. 

She also took inspiration from other fields, in particular, cybersecurity, where tools are often developed in an open-source environment and where the community continuously contributes updates to improve these applications. 

“We want AI Verify to be the same,” she said, adding that if the foundation developed the toolkits in silos, it would not be able to keep up with the industry’s fast-changing developments. 

Also: How this simple ChatGPT prompt tweak can help refine your AI-generated content

This open collaboration will also help navigate efforts toward the best and most effective solutions, she noted. The automotive industry went through a similar cycle where seatbelts were designed, tested, and redesigned, so the one that could best protect drivers could be determined. 

The same approach now needs to happen with generative AI, where models and applications should be continuously tested and tweaked to ensure they can be safely deployed within the organization’s guardrails. 

As it is, though, decisions by major players such as OpenAI not to disclose technical details behind their LLMs are worrying some sections of the industry. 

A team of academics led by University of Oxford’s Emanuele La Malfa last month published a research paper highlighting issues that could surface from the lack of information about large language AI models in four areas: accessibility, replicability, reliability, and trustworthiness (AART).

The scholars note that “commercial pressure” has pushed market players to make their AI models accessible as a service to customers, typically via an API. However, information on the models’ architecture, implementation, training data, or training processes is neither provided nor made available to be inspected.  

Also: How to use ChatGPT to make charts and tables

These access restrictions, along with how LLMs are often black box in nature, contravene the public’s and research community’s need to understand, trust, and control these models better, La Malfa’s team wrote. “This causes a significant problem at the field’s core: the most potent and risky models are also the most difficult to analyze,” they noted. 

OpenAI previously defended its decision not to provide details of its GPT-4 iteration, pointing to the competitive landscape and the security implications of releasing such information on large-scale models, including their architecture, training method, and dataset construction.

Asked how organizations should go about adopting generative AI, Ong said two camps will emerge in the foundation model layer, with one camp comprising a handful of proprietary large language AI models, including OpenAI’s ChatGPT-4, and the other camp opting to build their models on an open-source architecture, such as Meta’s Llama-v2.

Businesses that are concerned about transparency can choose the open-source alternatives, she suggested. 

Customized test benchmarks are needed 

At the same time, though, businesses increasingly will build on top of the foundation layer in order to deploy generative AI applications that better meet their domain-specific requirements, such as education and financial services.

Also: One in four workers fears being considered ‘lazy’ if they use AI tools

This application layer will also need to have the guardrails and, hence, some level of transparency and trust will need to be established here, Ong said. 

Here is where AI Verify, with its test toolkits, hopes to help steer companies in the right direction. With organizations operating in different markets, regions, and industries, their primary concern will not be whether an AI model is open source, but whether their generative AI applications fulfil their AI ethics and safety principles, she explained. 

Ong noted that many businesses, as well as governments, are currently testing and assessing generative AI tools, for both consumer-facing and non-consumer-facing use cases. Often they start with the latter to minimize potential risks and customer impact, and expand their test pilots to include consumer-facing applications when they have reached a certain comfort level.

Organizations in highly regulated sectors, such as financial services, will exercise even more caution with consumer-facing applications, she added. 

Countries and societies also hold different values and cultures. Governments will want to ensure AI models are built on training data and principles that are based on their population’s unique mix. 

Also: Why generative AI so popular: Everything you need to know

Singapore’s demographic, for instance, is multi-racial, multi-religion, and multi-lingual. Racial harmony is unique to its society as are local structures and policies, such as its national social security savings scheme, Ong said.

Noting that the LLMs that are widely used today do not perform uniformly well when tested against cultural questions, she pondered with this deficiency suggested a need for Singapore to build its own LLM and, if so, whether it has sufficient data — as a country with a small population — to train the AI model. 

With market players in other regions, specifically China, also releasing their own LLMs trained on local data, ZDNET asked if there was a way to fuse or integrate foundation models from different regions, so they are better adapted to Singapore’s population mix. 

Ong believes there may be a possibility for different LLMs to learn from each other, which is a potential application that can be explored in the research field. Efforts here will have to ensure data privacy and sensitive data remain protected, she said. 

Singapore is currently evaluating the feasibility of such options, including the potential of building its own LLM, according to Ong.

Also: The AI boom will amplify social problems if we don’t act now, says AI ethicist

Requirements for specialized generative AI models will further drive the importance of customized toolkits and benchmarks against which AI models are tested and assessed, she said. 

These bechmarks will be needed to test generative AI applications, including third-party and vertical-specific tools, against an organization’s or country’s AI principles and to ensure their deployment remains responsible and ethical.

READ MORE HERE