In the sea of many languages, it's sometimes easy to forget just how challenging mastering one language can be. Today we are going to dive into one of the biggest challenges when it comes to translation - pluralization
We discussed pluralization before on our blog, and this is the perfect time to remember what it's all about.
What is pluralization?
Pluralization (p11n) is a tricky process of changing nouns from singular to plural form - from one apple to two apples. Now, you might think - well, you just add a suffix, right? Well, yes and no.
Firstly, not all languages treat the number of nouns in the same way. For example, Chinese and Japanese have a single form for plurals. German, English, Spanish have two forms. On the other hand, some Slavic languages have 3 or more forms for plural nouns. And then there are languages like Arabic and Welsh that have 6 forms.
Secondly, there is a difference between plural rules and plural forms.
Plural rules will tell you how to "get" the plural form. Sometimes it means adding a simple suffix like "s". That way, in English we go from "hat" to "hats". Fairly easy.
But what to do when things get trickier? What when there are different categories of suffixes? What do we do with irregular forms?
For example, if we wanted to use the noun dog in plural and singular form in English, it would look something like this:
My sister has a dog (sing.). I love dogs (plural). My brother is afraid of dogs (plural).
But in Serbian, cases influence the form of the plural. Cases change the form of words in sentences, usually based on its function in the sentence (declension).
Moja sestra ima psa (singular, accusative). Ja volim pse (plural, accusative). Moj brat se boji pasa (plural, genitive).
Dealing with plural forms
Proper pluralization can be a big task to take on and it might be even hard to know where to start. Damian Conway counts 4 general ways to deal with plurals in a text in his paper “An algorithmic approach to English Pluralization”, but we opted for two main groups.
1. Evasive techniques
Evasive techniques for handling plurals in a text are cheap and don't take a lot of time. Because of that, these techniques are a common choice.
Basically, with evasive techniques you have a structured, fixed part of the text, that is not directly influenced by the number of the noun.
However, it doesn't matter if you are ignoring the problem or avoiding it. The odds are, in the end, you are faced with a rather inelegant product.
It might look something like this:
There are 0 tickets available.
OR
Number of ticket(s) available: 1
Number of ticket(s) available: 11
2. Blood-sweat-and-tears techniques
The programmer could provide both singular and plural form, which the system would recognize and apply. For example, a pluralizing algorithm could automatically add an appropriate suffix at the end of the word.
How difficult this part is, depends on the languages you are working with. Let's take the example of the English language. While it might be great to add automatically "s" to nouns, that still leaves us with many nouns with specific suffixes (like "es") and those with irregular forms.
Solving the p11n puzzle
International Components for Unicode (ICU)
ICU created MessageFormat for the specific task of localization. With it, you can define and sort all plural forms and rules.
Firstly, you create the fixed part of the message and then include a variable element {placeholders}, which our Lingohubbers are familiar with. This is where CLDR comes in handy.
The CLDR, or Common Locale Data Repository project, is a useful project by Unicode Consortium.
Apple, Google, IBM, Microsoft, Amazon, Babel, jQuery, Mozilla, Oracle, Twitter, Yahoo and Wikimedia Foundation (Wikipedia) are just some companies that use CLDR.
This repository holds a huge amount of data that tech giants use to handle internalization in a more efficient manner. CLDR covers data on the number, date, time formatting, and plural forms and rules for many languages. CLDR defines up to 6 different plural forms, although most languages do not need it. The categories are:
- Zero
- One
- Two
- Few
- Many
- Other
CLDR Overview
Here’s how CLDR works for cardinal numbers for most frequent languages:
Language | Zero | One | Two | Few | Many | Other |
English | 1 | 0, >1 | ||||
German | 1 | 0, >1 | ||||
Dutch | 1 | 0, >1 | ||||
Danish | 1 | 0, >1 | ||||
Swedish | 1 | 0, >1 | ||||
Norwegian | 1 | 0, >1 | ||||
Italian | 1 | 0, >1 | ||||
Portuguese | 1 | 0, >1 | ||||
Spanish | 1 | 0, >1 | ||||
French | 0, 1 | >1 | ||||
Romanian | 1 | 0, 2-19, 101-119... | >19, | |||
Russian | 1, 21, 191, 1001... | 2-4, 22-24, 52-54, 102-104... | 0, >5-19, 25-29, ... | >0.0,... | ||
Polish | 1, 21, 191, 1001... | 2-4, 22-24, 52-54, 102-104... | 0, 5-19, 25-29, ... | |||
Czech | 1 | 2-4, 22-24, 52-54, 102-104... | 0.0-1.5, ... | 0, 5-19, 25-29, 100, 1000... | ||
Ukrainian | 1 | 2-4, 22-24, 52-54, 102-104... | 0, >5-19, 25-29, 100, 1000... | 0.0-1.5, ... | ||
Latvian | 0, 10, 100... | 1, 21, 191, 1001... | 2-9, 12-19, 102, 1002... | |||
Lithuanian | 1, 21, 191, 1001... | 2-9, 22-29, 202-209,... | 0.1-0.9,... | 0, 10, 100, 1000, ... | ||
Estonian | 1 | 0, >1 | ||||
Hungarian | 1 | 0, >1 | ||||
Serbian | 1, 21, 191, 1001... | 2-4, 22-24, 52-54, 102-104... | 0, 5-19, 25-29, ... | |||
Slovenian | 1, 21, 191, 1001... | 2, 22, 102, ... | 3, 4, 23, 24, 33, 104... | 0, 5-19, 25-29, ... | ||
Bosnian | 1, 21, 191, 1001... | 2-4, 22-24, 52-54, 102-104... | 0, 5-19, 25-29, ... | |||
Croatian | 1, 21, 191, 1001... | 2-4, 22-24, 52-54, 102-104... | 0, 5-19, 25-29, ... | |||
Bulgarian | 1 | 0, >1 | ||||
Macedonian | 1, 21, 191, 1001... | 0, 2-9, ... | ||||
Albanian | 1 | 0, >1 | ||||
Greek | 1 | 0, >1 | ||||
Turkish | 1 | >0 | ||||
Irish | 1 | 2 | 3-6 | 7-10 | 0, 11-25, ... | |
Hebrew | 1 | 2 | 20, 30, 100... | 0, 3-17, 101 ... | ||
Hindi | 0, 1 | >1 | ||||
Mandarin Chinese | >0 | |||||
Japanese | >0 | |||||
Korean | >0 |
The full CLDR chart of plural language rules is available here. Remember, the combination of using CLDR and MessageFormat makes p11n much easier. But, what else can we do to make p11n easier and better?
Tips and tricks
Any type of machine translation depends on the quality of source text. To make sure your text is up to par, here are a couple of tips to make the whole p11n process easier.
What to avoid:
- Long sentences in order to keep the meaning clear
- Too short sentences because translating information requires actual information
- Bad text segmentation interferes with the processing of the language
- Pronouns because they can be too ambiguous in certain languages
Revision - the mother of all good translation
You’ve done the work; the text was written well, and it’s machine translation-friendly. Now is the right time for skilled native speakers to plan a revision. They will catch all the nuances and tricks of the language - in singular and plural. And we at Lingohub have just the right people for the job. Order the professional proofreading in our TMS translation management system to ensure perfect localization.