Hi,
I did some preliminary work with the gettext msgcat
command, which basically merges two .po files together into a third one. Identical strings are automatically merged in the new file, while strings that differ are both added with a #, fuzzy
tag to indicate an error. I initially ran the tool on 3 files: admin.po, submission.po, and default.po., which together comprise 166 strings. I then went through every fuzzy string manually to check the actual differences. I identified 3 reasons:
- inclusive writing,
- incorrect translation (ranging anywhere from a spacing difference to a flat-out invalid translation), and
- diverging translations (i.e. both strings are correct, but dissimilar).
Here are the numbers for each of those three files:
admin.po (55 strings):
- Identical : 25 strings (45%)
- Fuzzy : 30 strings (55%), distributed as follows:
- inclusive writing: 14 strings (47%)
- incorrect translation: 7 strings (23%)
- diverging translation: 9 strings (30%)
submission.po (69 strings)
- Identical: 35 strings (51%)
- Fuzzy: 34 strings (49%)
- Inclusive writing: 8 strings (24%)
- incorrect translation: 19 strings (56%)
- diverging translation: 7 strings (20%)
default.po (42 strings)
- Identical: 20 strings (48%)
- Fuzzy: 22 strings (52%)
- inclusive: 11 strings (50%)
- incorrect: 9 strings (40%)
- diverging: 2 strings (10%)
Overall, half of all strings in these files are identical and merge automatically. I ran the command on a bigger file to confirm this (author.po, 103 strings) and got similar results: 54 identical strings (52%), 49 fuzzy (48%). Incorrect strings account for an average of 40% of alI fuzzy strings. Many of them include trivial spacing issues, a few typos, or erroneous translations (a wrong translation of the word “galley” in FR_Fr accounts for almost half of all incorrect strings in submission.po). In all cases but one, the error is limited to one string and the other one is correct, meaning those should be identical upon correction. The only exception is admin.languages.supportedLocalesInstruction
in admin.po, (FR_Fr does not translate correctly and FR_CA contains a typo). Inclusive writing is another major source of divergence (37% of all fuzzy strings on average), but the main take from the post dedicated to this issue in the translation section is that both locale can share a common ground on this. Actual diverging translations are quite limited both in number (23% of all fuzzy) and in substance. Most are very trivial and limited to a word (e.g. all diverging strings in admin.po boil down to these: ceci/cela, chemin/chemin d’accès, OJS/de OJS, PKP/de PKP, lire/voir, informations/renseignements). I identified what I believe to be one regionalism: interessé à soumettre (CA)/interessé pour soumettre(Fr) and I spotted one or two more during translation of the email component, which is not included here. The few remaining strings are ones that, in my opinion, are better translated in one or the other, but are both correct and do not contain any Canadian regionalism (I don’t think they contain any French regionalism either, but a canadian colleague would be better placed to confirm this).
All of this seems to indicate that FR_CA and FR_Fr are substantially “mergeable” since virtually all differences have little to do with linguistics. However, the workload would be substantial. As things stand (and assuming those stats accurately represent the whole picture), merging would require a review of half of all strings. Even after eliminating all errors and inclusive writing issues, we would still need to go through 20% of all strings.
I have uploaded the raw concatenated files to Github if anyone wants to check it out.
Paul