Why language coding needs an uplift in a globalized economy
The case for standardization is well established. Not only do standards assist regulation but they also help effectively join up systems across a globalized economy. Hundreds of functions are subject to codes or standardization and languages are one of them.
Internationally recognized codes representing each language, language family and dialect help country systems and organizations correctly identify and manage data accordingly.
They are used for bibliographic purposes within libraries, information management systems, databases and websites and to ensure that machine learning training data addresses its correct intention. What’s more, correctly aligning language variants is not only efficient, it is convenient and protects your brand. So what code should you be looking for a precise language designation? The ISO 639-3 unambiguously defines almost all known languages in the world together with ISO 3166 country codes.
Before ISO 639-3
Before ISO 639-3, however, there was ISO 639-1, which was a two-letter designation but as the digital world has grown, so has the demand for more precise language support. For instance, “zh” for Chinese under ISO 639-1 is “zho” under ISO 639-3 with around 16 additional language codes for different dialects eg “cdo” for Min Dong Chinese, “cmn” for Mandarin Chinese, “hak” for Hakka Chinese etc.
In the spirit of the 2022 World Cup in Qatar, we could take the English language example of “football” to understand the importance of differentiating language variants. Football is understood as a completely different game to US English speakers than, for example, those in the UK who identify it as what Americans call soccer. However, the differentiation between US and UK English is actually not dealt with by ISO 639, although it is differentiated under the country codes.
Even with all the various forms of English around the world, ISO-639 does not count English as a macro language. The other English codes are mostly Creole or Pidgin variants, such as Jamaican Creole English, which perhaps amplifies how inappropriate it is to collate Arabic variants under one code, such as Egyptian Arabic (arz), as differentiated from standard Arabic (ara). When it comes to the lexicons, training data and data management solutions, language differentiations are crucial to avoid messy results. The best practice for most applications is combining ISO 639-3 and ISO 3166 to identify specific language and region you intend to use.
ISO standards for languages
ISO (International Organization for Standardization) has released five parts for language identification standardization: ISO 639 establishes internationally recognized codes (either 2, 3, or 4-letter codes) for representing languages or language families.
Part 1 – ISO 639-1 – is the oldest standard representing the majority of languages using a two-letter code. It covers the most common spoken languages but doesn’t account for variations within languages. Parts 2-5 use three-letter codes and provide more local combinations to account for all known natural languages, living or extinct.
ISO 639-3 extends much further than ISO 639-2 to cover 7,000 languages and is intended for use as metadata code. It is commonly used in computer and information systems, such as the web and SaaS applications, for support of many different languages.
Delivering increasingly personalized solutions to end customers means precise language ID is a must so applications can align with the end user expectations in each region and their spoken language. The 3-letter ISO 639-3 and ISO 3166 codes provide the ability to differentiate these unique languages. Ethnologue, one of the largest and most comprehensive language databases available today, uses the 3-letter ISO system.
Yet there are still a surprising number of requests to provide training data for undifferentiated languages either undefined by ISO 639 or expect outputs that include two or more variants that share an ISO 639-1 code. The later migration to ISO 639-3 happens, the more system ambiguity will occur in systems where language classification is necessary. There will be a higher risk of cross-variant contamination at a higher cost.
Once you start working with languages with more than one variant, it’s essential to migrate to the 3-letter code system. However, while one-to-one mapping exists from all 2-letter codes to 3-letter codes, it is not so easy the other way around. However, updating procedures to the ISO-639-3 standard is a future-proof move worth pre-empting.
Benefits of ISO standards
Training natural language processing (NLP) models need detail and accuracy to be effective for spoken languages. The best combination is ISO-639-3 language and 3166 country codes. For example, English (eng) can be divided into American English (eng-USA), British English (eng-GBR), Canadian English (eng-CAN), Australian English (eng-AUS), South African English (eng-ZAF) ), etc. A voice assistant designed to recognize speech should be able to identify the English dialect to accurately understand the request and make the correct output.
The two key benefits of consistently applying these ISO standards throughout a system are: The ability to accurately identify candidates with the correct language skills for every task and the ability to consistently refer to the same language across organizations and applications.