Marina Katić Development 0 Comments

SHARE

Pluralization (p11n) – the many of plurals

In the sea of many languages, it’s sometimes easy to forget just how challenging mastering one language can be. Today we are going to dive into one of the biggest challenges when it comes to translation – pluralization. We discussed pluralization before on our blog, and this is the perfect time to remember what it’s all about. 

What is pluralization?

Pluralization (p11n) is a tricky process of changing nouns from singular to plural form – from one apple to two apples. Now, you might think – well, you just add a suffix, right? Well, yes and no.
 
Firstly, not all languages treat the number of nouns in the same way. For example, Chinese and Japanese have a single form for plurals. German, English, Spanish have two forms. On the other hand, some Slavic languages have 3 or more forms for plural nouns. And then there are languages like Arabic and Welsh that have 6 forms.
 
Secondly, there is a difference between plural rules and plural forms. Plural rules will tell you how to “get” the plural form. Sometimes it means adding a simple suffix like “s”. That way, in English we go from “hat” to “hats”. Fairly easy. But what to do when things get trickier? What when there are different categories of suffixes? What do we do with irregular forms?

For example, if we wanted to use the noun dog in plural and singular form in English, it would look something like this:

My sister has a dog (sing.). I love dogs (plural). My brother is afraid of dogs (plural).

But in Serbian, add also cases influence the form of the plural. Cases change the form of words in sentences, usually based on its function in the sentence (declension). 

Moja sestra ima psa (singular, accusative). Ja volim pse (plural, accusative). Moj brat se boji pasa (plural, genitive).

Dealing with plural forms

Proper pluralization can be a big task to take on and it might be even hard to know where to start. Damian Conway counts 4 general ways to deal with plurals in a text in his paper “An algorithmic approach to English Pluralization”, but we opted for two main groups.

1. Evasive techniques

Evasive techniques for handling plurals in a text are cheap and don’t take a lot of time. Because of that, these techniques are a common choice. Basically, with evasive techniques you have a structured, fixed part of the text, that is not directly influenced by the number of the noun.
However, it doesn’t matter if you are ignoring the problem or avoiding it. The odds are, in the end, you are faced with a rather inelegant product. It might look something like this:
There are 0 tickets available.
OR
Number of ticket(s) available: 1
Number of ticket(s) available: 11

2. Blood-sweat-and-tears techniques

The programmer could provide both singular and plural form, which the system would recognize and apply. For example, a pluralizing algorithm could automatically add an appropriate suffix at the end of the word.
How difficult this part is, depends on the languages you are working with. Let’s take the example of the English language. While it might be great to add automatically “s” to nouns, that still leaves us with many nouns with specific suffixes (like “es”) and those with irregular forms.

Solving the p11n puzzle 

International Components for Unicode (ICU) created MessageFormat for the specific task of localization. With it, you can define and sort all plural forms and rules. Firstly, you create the fixed part of the message and then include a variable element {placeholders}, which our LingoHubbers are familiar with. This is where CLDR comes in handy.
 
The CLDR, or Common Locale Data Repository project, is a useful project by Unicode Consortium. Apple, Google, IBM, Microsoft, Amazon, Babel, jQuery, Mozilla, Oracle, Twitter, Yahoo and Wikimedia Foundation (Wikipedia) are just some companies that use CLDR. This repository holds a huge amount of data that tech giants use to handle internalization in a more efficient manner. CLDR covers data on the number, date, time formatting, and plural forms and rules for many languages. CLDR defines up to 6 different plural forms, although most languages do not need it. The categories are:
 
  • Zero
  • One
  • Two
  • Few
  • Many
  • Other

CLDR Overview

Here’s how CLDR works for cardinal numbers for most frequent languages:

Language Zero One Two Few Many Other
English   1       0, >1
German   1       0, >1
Dutch   1       0, >1
Danish   1       0, >1
Swedish   1       0, >1
Norwegian   1       0, >1
Italian   1       0, >1
Portuguese   1       0, >1
Spanish   1       0, >1
French   0, 1       >1
Romanian   1   0, 2-19, 101-119…   >19,
Russian   1, 21, 191, 1001…   2-4, 22-24, 52-54, 102-104… 0, >5-19, 25-29, … >0.0,…
Polish   1, 21, 191, 1001…   2-4, 22-24, 52-54, 102-104…   0, 5-19, 25-29, …
Czech   1   2-4, 22-24, 52-54, 102-104… 0.0-1.5, … 0, 5-19, 25-29, 100, 1000…
Ukrainian   1   2-4, 22-24, 52-54, 102-104…          0, >5-19, 25-29, 100, 1000…      0.0-1.5, …
Latvian 0, 10, 100…    1, 21, 191, 1001…       2-9, 12-19, 102, 1002…
Lithuanian   1, 21, 191, 1001…   2-9, 22-29, 202-209,…  0.1-0.9,… 0, 10, 100, 1000, …
Estonian   1       0, >1
Hungarian   1       0, >1
Serbian   1, 21, 191, 1001…   2-4, 22-24, 52-54, 102-104…   0, 5-19, 25-29, …
Slovenian   1, 21, 191, 1001…   2, 22, 102, …               3, 4, 23, 24, 33, 104…   0, 5-19, 25-29, …
Bosnian   1, 21, 191, 1001…   2-4, 22-24, 52-54, 102-104…   0, 5-19, 25-29, …
Croatian   1, 21, 191, 1001…   2-4, 22-24, 52-54, 102-104…   0, 5-19, 25-29, …
Bulgarian   1       0, >1
Macedonian   1, 21, 191, 1001…       0, 2-9, …
Albanian   1       0, >1
Greek   1       0, >1
Turkish   1       >0
Irish   1 2 3-6 7-10 0, 11-25, …
Hebrew   1 2 20, 30, 100…   0, 3-17, 101 …
Hindi   0, 1       >1
Mandarin Chinese           >0
Japanese           >0
Korean           >0
 
The full CLDR chart of plural language rules is available here. Remember, the combination of using CLDR and MessageFormat makes p11n much easier. But, what else can we do to make p11n easier and better?

Tips and tricks

Any type of machine translation heavily depends on the quality of source text. To make sure your text is up to par, here are a couple of tips to make the whole p11n process easier. What to avoid: 
 
  • Long sentences in order to keep the meaning clear
  • Too short sentences because translating information requires actual information
  • Bad text segmentation interferes with the processing of the language
  • Pronouns because they can be too ambiguous in certain languages

Revision – the mother of all good translation

You’ve done the work, the text was written well and it’s machine translation friendly. Now is the right time to plan a revision by skilled native speakers. They will catch all the nuances and tricks of the language – in singular and plural. And we at LingoHub have just the right people for the job.