IBM to test Southeast Asian LLM and facilitate localization efforts

May 28, 2024 TH Author

bangkok4gettyimages-1499456004 — @ Didier Marti/Getty Images

IBM has inked an agreement with AI Singapore (AISG) to test the latter’s Southeast Asian large language model (LLM) and make it available for developers to build customized artificial intelligence (AI) applications.

Under the partnership, IBM will test the Southeast Asian Languages in One Network (SEA-LION) model using Big Blue’s AI technology and data platform, Watsonx, and work with AISG to fine-tune the LLM. The goal is to help organizations choose suitable AI models for their business requirements, IBM and AISG said in a joint statement on Tuesday.

Also: Google joins collaborative efforts to build localized large language models

IBM will also make SEA-LION available in its AI use case library, dubbed Digital Self-Serve Co-Create Experience (DSCE), enabling developers and data scientists to build localized generative AI (GenAI) applications.

An open-source LLM developed by AISG, SEA-LION is designed to be smaller, more flexible, and faster than other LLMs, according to AISG. Its current iteration runs on two base models: a 3-billion-parameter model and a 7-billion-parameter model. The LLM’s training data is composed of 981 billion language tokens, which AISG defines as fragments of words created from breaking down text during the tokenization process. These fragments include 623 billion English tokens, 128 billion Southeast Asia tokens, and 91 billion Chinese tokens.

With SEA-LION, Singapore aims to drive the development of LLMs that better reflect Southeast Asia’s societal mix and exhibit stronger contextual understanding of the region’s cultures and languages.

The partnership aims to push forward a “custom-made foundation model” for Southeast Asia and made by Southeast Asians, according to Leslie Teo, AISG’s senior director of AI products. The two organizations will also look to build use cases, fuel SEA-LION’s adoption, and help organizations “scale AI safely and responsibly,” Teo said.

The collaboration encompasses efforts to incorporate AI governance into SEA-LION, so businesses can better navigate compliance, risk management, and model lifecycle management, even as government regulations on AI continue to evolve.

“[IBM] believes further progress of GenAI will bring greater performance in smaller language models, with users given the opportunity to personalize models based on their business and industry requirements,” Catherine Lian, IBM Asean’s general manager and technology leader, said in a statement.

Also: Generative AI may be creating more work than it saves

“No one model is a one-size-fits-all for businesses, and organizations must be empowered with a choice to use their models based on their needs,” Lian said. “[The] SEA-LION LLM is a big step forward in creating an open AI system and addressing the Asean language challenges that companies and governments face when working with AI.”

In March, AISG also announced a partnership with Google to enhance datasets used to train, fine-tune, and assess AI models in languages specific to Southeast Asia. Called Project Southeast Asian Languages in One Network Data, the initiative aims to “improve cultural context awareness” in LLMs built for the region.

Initially, the project will focus on Indonesian, Thai, Tamil, Filipino, and Burmese — languages for which AISG and Google will develop translocalization and translation models. They will also build tools to help scale translocalization capabilities, share best practices for tuning datasets, and publish pre-training guides for Southeast Asian languages.

Artificial Intelligence