1 2# This document contains text in Perl "POD" format. 3# Use a POD viewer like perldoc or perlman to render it. 4 5# This corrects some typoes in the previous release. 6 7=head1 NAME 8 9Locale::Maketext::TPJ13 -- article about software localization 10 11=head1 SYNOPSIS 12 13 # This an article, not a module. 14 15=head1 DESCRIPTION 16 17The following article by Sean M. Burke and Jordan Lachler 18first appeared in I<The Perl 19Journal> #13 and is copyright 1999 The Perl Journal. It appears 20courtesy of Jon Orwant and The Perl Journal. This document may be 21distributed under the same terms as Perl itself. 22 23=head1 Localization and Perl: gettext breaks, Maketext fixes 24 25by Sean M. Burke and Jordan Lachler 26 27This article points out cases where gettext (a common system for 28localizing software interfaces -- i.e., making them work in the user's 29language of choice) fails because of basic differences between human 30languages. This article then describes Maketext, a new system capable 31of correctly treating these differences. 32 33=head2 A Localization Horror Story: It Could Happen To You 34 35=over 36 37"There are a number of languages spoken by human beings in this 38world." 39 40-- Harald Tveit Alvestrand, in RFC 1766, "Tags for the 41Identification of Languages" 42 43=back 44 45Imagine that your task for the day is to localize a piece of software 46-- and luckily for you, the only output the program emits is two 47messages, like this: 48 49 I scanned 12 directories. 50 51 Your query matched 10 files in 4 directories. 52 53So how hard could that be? You look at the code that 54produces the first item, and it reads: 55 56 printf("I scanned %g directories.", 57 $directory_count); 58 59You think about that, and realize that it doesn't even work right for 60English, as it can produce this output: 61 62 I scanned 1 directories. 63 64So you rewrite it to read: 65 66 printf("I scanned %g %s.", 67 $directory_count, 68 $directory_count == 1 ? 69 "directory" : "directories", 70 ); 71 72...which does the Right Thing. (In case you don't recall, "%g" is for 73locale-specific number interpolation, and "%s" is for string 74interpolation.) 75 76But you still have to localize it for all the languages you're 77producing this software for, so you pull Locale::gettext off of CPAN 78so you can access the C<gettext> C functions you've heard are standard 79for localization tasks. 80 81And you write: 82 83 printf(gettext("I scanned %g %s."), 84 $dir_scan_count, 85 $dir_scan_count == 1 ? 86 gettext("directory") : gettext("directories"), 87 ); 88 89But you then read in the gettext manual (Drepper, Miller, and Pinard 1995) 90that this is not a good idea, since how a single word like "directory" 91or "directories" is translated may depend on context -- and this is 92true, since in a case language like German or Russian, you'd may need 93these words with a different case ending in the first instance (where the 94word is the object of a verb) than in the second instance, which you haven't even 95gotten to yet (where the word is the object of a preposition, "in %g 96directories") -- assuming these keep the same syntax when translated 97into those languages. 98 99So, on the advice of the gettext manual, you rewrite: 100 101 printf( $dir_scan_count == 1 ? 102 gettext("I scanned %g directory.") : 103 gettext("I scanned %g directories."), 104 $dir_scan_count ); 105 106So, you email your various translators (the boss decides that the 107languages du jour are Chinese, Arabic, Russian, and Italian, so you 108have one translator for each), asking for translations for "I scanned 109%g directory." and "I scanned %g directories.". When they reply, 110you'll put that in the lexicons for gettext to use when it localizes 111your software, so that when the user is running under the "zh" 112(Chinese) locale, gettext("I scanned %g directory.") will return the 113appropriate Chinese text, with a "%g" in there where printf can then 114interpolate $dir_scan. 115 116Your Chinese translator emails right back -- he says both of these 117phrases translate to the same thing in Chinese, because, in linguistic 118jargon, Chinese "doesn't have number as a grammatical category" -- 119whereas English does. That is, English has grammatical rules that 120refer to "number", i.e., whether something is grammatically singular 121or plural; and one of these rules is the one that forces nouns to take 122a plural suffix (generally "s") when in a plural context, as they are when 123they follow a number other than "one" (including, oddly enough, "zero"). 124Chinese has no such rules, and so has just the one phrase where English 125has two. But, no problem, you can have this one Chinese phrase appear 126as the translation for the two English phrases in the "zh" gettext 127lexicon for your program. 128 129Emboldened by this, you dive into the second phrase that your software 130needs to output: "Your query matched 10 files in 4 directories.". You notice 131that if you want to treat phrases as indivisible, as the gettext 132manual wisely advises, you need four cases now, instead of two, to 133cover the permutations of singular and plural on the two items, 134$dir_count and $file_count. So you try this: 135 136 printf( $file_count == 1 ? 137 ( $directory_count == 1 ? 138 gettext("Your query matched %g file in %g directory.") : 139 gettext("Your query matched %g file in %g directories.") ) : 140 ( $directory_count == 1 ? 141 gettext("Your query matched %g files in %g directory.") : 142 gettext("Your query matched %g files in %g directories.") ), 143 $file_count, $directory_count, 144 ); 145 146(The case of "1 file in 2 [or more] directories" could, I suppose, 147occur in the case of symlinking or something of the sort.) 148 149It occurs to you that this is not the prettiest code you've ever 150written, but this seems the way to go. You mail off to the 151translators asking for translations for these four cases. The 152Chinese guy replies with the one phrase that these all translate to in 153Chinese, and that phrase has two "%g"s in it, as it should -- but 154there's a problem. He translates it word-for-word back: "In %g 155directories contains %g files match your query." The %g 156slots are in an order reverse to what they are in English. You wonder 157how you'll get gettext to handle that. 158 159But you put it aside for the moment, and optimistically hope that the 160other translators won't have this problem, and that their languages 161will be better behaved -- i.e., that they will be just like English. 162 163But the Arabic translator is the next to write back. First off, your 164code for "I scanned %g directory." or "I scanned %g directories." 165assumes there's only singular or plural. But, to use linguistic 166jargon again, Arabic has grammatical number, like English (but unlike 167Chinese), but it's a three-term category: singular, dual, and plural. 168In other words, the way you say "directory" depends on whether there's 169one directory, or I<two> of them, or I<more than two> of them. Your 170test of C<($directory == 1)> no longer does the job. And it means 171that where English's grammatical category of number necessitates 172only the two permutations of the first sentence based on "directory 173[singular]" and "directories [plural]", Arabic has three -- and, 174worse, in the second sentence ("Your query matched %g file in %g 175directory."), where English has four, Arabic has nine. You sense 176an unwelcome, exponential trend taking shape. 177 178Your Italian translator emails you back and says that "I searched 0 179directories" (a possible English output of your program) is stilted, 180and if you think that's fine English, that's your problem, but that 181I<just will not do> in the language of Dante. He insists that where 182$directory_count is 0, your program should produce the Italian text 183for "I I<didn't> scan I<any> directories.". And ditto for "I didn't 184match any files in any directories", although he says the last part 185about "in any directories" should probably just be left off. 186 187You wonder how you'll get gettext to handle this; to accomodate the 188ways Arabic, Chinese, and Italian deal with numbers in just these few 189very simple phrases, you need to write code that will ask gettext for 190different queries depending on whether the numerical values in 191question are 1, 2, more than 2, or in some cases 0, and you still haven't 192figured out the problem with the different word order in Chinese. 193 194Then your Russian translator calls on the phone, to I<personally> tell 195you the bad news about how really unpleasant your life is about to 196become: 197 198Russian, like German or Latin, is an inflectional language; that is, nouns 199and adjectives have to take endings that depend on their case 200(i.e., nominative, accusative, genitive, etc...) -- which is roughly a matter of 201what role they have in syntax of the sentence -- 202as well as on the grammatical gender (i.e., masculine, feminine, neuter) 203and number (i.e., singular or plural) of the noun, as well as on the 204declension class of the noun. But unlike with most other inflected languages, 205putting a number-phrase (like "ten" or "forty-three", or their Arabic 206numeral equivalents) in front of noun in Russian can change the case and 207number that noun is, and therefore the endings you have to put on it. 208 209He elaborates: In "I scanned %g directories", you'd I<expect> 210"directories" to be in the accusative case (since it is the direct 211object in the sentnce) and the plural number, 212except where $directory_count is 1, then you'd expect the singular, of 213course. Just like Latin or German. I<But!> Where $directory_count % 21410 is 1 ("%" for modulo, remember), assuming $directory count is an 215integer, and except where $directory_count % 100 is 11, "directories" 216is forced to become grammatically singular, which means it gets the 217ending for the accusative singular... You begin to visualize the code 218it'd take to test for the problem so far, I<and still work for Chinese 219and Arabic and Italian>, and how many gettext items that'd take, but 220he keeps going... But where $directory_count % 10 is 2, 3, or 4 221(except where $directory_count % 100 is 12, 13, or 14), the word for 222"directories" is forced to be genitive singular -- which means another 223ending... The room begins to spin around you, slowly at first... But 224with I<all other> integer values, since "directory" is an inanimate 225noun, when preceded by a number and in the nominative or accusative 226cases (as it is here, just your luck!), it does stay plural, but it is 227forced into the genitive case -- yet another ending... And 228you never hear him get to the part about how you're going to run into 229similar (but maybe subtly different) problems with other Slavic 230languages like Polish, because the floor comes up to meet you, and you 231fade into unconsciousness. 232 233 234The above cautionary tale relates how an attempt at localization can 235lead from programmer consternation, to program obfuscation, to a need 236for sedation. But careful evaluation shows that your choice of tools 237merely needed further consideration. 238 239=head2 The Linguistic View 240 241=over 242 243"It is more complicated than you think." 244 245-- The Eighth Networking Truth, from RFC 1925 246 247=back 248 249The field of Linguistics has expended a great deal of effort over the 250past century trying to find grammatical patterns which hold across 251languages; it's been a constant process 252of people making generalizations that should apply to all languages, 253only to find out that, all too often, these generalizations fail -- 254sometimes failing for just a few languages, sometimes whole classes of 255languages, and sometimes nearly every language in the world except 256English. Broad statistical trends are evident in what the "average 257language" is like as far as what its rules can look like, must look 258like, and cannot look like. But the "average language" is just as 259unreal a concept as the "average person" -- it runs up against the 260fact no language (or person) is, in fact, average. The wisdom of past 261experience leads us to believe that any given language can do whatever 262it wants, in any order, with appeal to any kind of grammatical 263categories wants -- case, number, tense, real or metaphoric 264characteristics of the things that words refer to, arbitrary or 265predictable classifications of words based on what endings or prefixes 266they can take, degree or means of certainty about the truth of 267statements expressed, and so on, ad infinitum. 268 269Mercifully, most localization tasks are a matter of finding ways to 270translate whole phrases, generally sentences, where the context is 271relatively set, and where the only variation in content is I<usually> 272in a number being expressed -- as in the example sentences above. 273Translating specific, fully-formed sentences is, in practice, fairly 274foolproof -- which is good, because that's what's in the phrasebooks 275that so many tourists rely on. Now, a given phrase (whether in a 276phrasebook or in a gettext lexicon) in one language I<might> have a 277greater or lesser applicability than that phrase's translation into 278another language -- for example, strictly speaking, in Arabic, the 279"your" in "Your query matched..." would take a different form 280depending on whether the user is male or female; so the Arabic 281translation "your[feminine] query" is applicable in fewer cases than 282the corresponding English phrase, which doesn't distinguish the user's 283gender. (In practice, it's not feasable to have a program know the 284user's gender, so the masculine "you" in Arabic is usually used, by 285default.) 286 287But in general, such surprises are rare when entire sentences are 288being translated, especially when the functional context is restricted 289to that of a computer interacting with a user either to convey a fact 290or to prompt for a piece of information. So, for purposes of 291localization, translation by phrase (generally by sentence) is both the 292simplest and the least problematic. 293 294=head2 Breaking gettext 295 296=over 297 298"It Has To Work." 299 300-- First Networking Truth, RFC 1925 301 302=back 303 304Consider that sentences in a tourist phrasebook are of two types: ones 305like "How do I get to the marketplace?" that don't have any blanks to 306fill in, and ones like "How much do these ___ cost?", where there's 307one or more blanks to fill in (and these are usually linked to a 308list of words that you can put in that blank: "fish", "potatoes", 309"tomatoes", etc.) The ones with no blanks are no problem, but the 310fill-in-the-blank ones may not be really straightforward. If it's a 311Swahili phrasebook, for example, the authors probably didn't bother to 312tell you the complicated ways that the verb "cost" changes its 313inflectional prefix depending on the noun you're putting in the blank. 314The trader in the marketplace will still understand what you're saying if 315you say "how much do these potatoes cost?" with the wrong 316inflectional prefix on "cost". After all, I<you> can't speak proper Swahili, 317I<you're> just a tourist. But while tourists can be stupid, computers 318are supposed to be smart; the computer should be able to fill in the 319blank, and still have the results be grammatical. 320 321In other words, a phrasebook entry takes some values as parameters 322(the things that you fill in the blank or blanks), and provides a value 323based on these parameters, where the way you get that final value from 324the given values can, properly speaking, involve an arbitrarily 325complex series of operations. (In the case of Chinese, it'd be not at 326all complex, at least in cases like the examples at the beginning of 327this article; whereas in the case of Russian it'd be a rather complex 328series of operations. And in some languages, the 329complexity could be spread around differently: while the act of 330putting a number-expression in front of a noun phrase might not be 331complex by itself, it may change how you have to, for example, inflect 332a verb elsewhere in the sentence. This is what in syntax is called 333"long-distance dependencies".) 334 335This talk of parameters and arbitrary complexity is just another way 336to say that an entry in a phrasebook is what in a programming language 337would be called a "function". Just so you don't miss it, this is the 338crux of this article: I<A phrase is a function; a phrasebook is a 339bunch of functions.> 340 341The reason that using gettext runs into walls (as in the above 342second-person horror story) is that you're trying to use a string (or 343worse, a choice among a bunch of strings) to do what you really need a 344function for -- which is futile. Preforming (s)printf interpolation 345on the strings which you get back from gettext does allow you to do I<some> 346common things passably well... sometimes... sort of; but, to paraphrase 347what some people say about C<csh> script programming, "it fools you 348into thinking you can use it for real things, but you can't, and you 349don't discover this until you've already spent too much time trying, 350and by then it's too late." 351 352=head2 Replacing gettext 353 354So, what needs to replace gettext is a system that supports lexicons 355of functions instead of lexicons of strings. An entry in a lexicon 356from such a system should I<not> look like this: 357 358 "J'ai trouv\xE9 %g fichiers dans %g r\xE9pertoires" 359 360[\xE9 is e-acute in Latin-1. Some pod renderers would 361scream if I used the actual character here. -- SB] 362 363but instead like this, bearing in mind that this is just a first stab: 364 365 sub I_found_X1_files_in_X2_directories { 366 my( $files, $dirs ) = @_[0,1]; 367 $files = sprintf("%g %s", $files, 368 $files == 1 ? 'fichier' : 'fichiers'); 369 $dirs = sprintf("%g %s", $dirs, 370 $dirs == 1 ? "r\xE9pertoire" : "r\xE9pertoires"); 371 return "J'ai trouv\xE9 $files dans $dirs."; 372 } 373 374Now, there's no particularly obvious way to store anything but strings 375in a gettext lexicon; so it looks like we just have to start over and 376make something better, from scratch. I call my shot at a 377gettext-replacement system "Maketext", or, in CPAN terms, 378Locale::Maketext. 379 380When designing Maketext, I chose to plan its main features in terms of 381"buzzword compliance". And here are the buzzwords: 382 383=head2 Buzzwords: Abstraction and Encapsulation 384 385The complexity of the language you're trying to output a phrase in is 386entirely abstracted inside (and encapsulated within) the Maketext module 387for that interface. When you call: 388 389 print $lang->maketext("You have [quant,_1,piece] of new mail.", 390 scalar(@messages)); 391 392you don't know (and in fact can't easily find out) whether this will 393involve lots of figuring, as in Russian (if $lang is a handle to the 394Russian module), or relatively little, as in Chinese. That kind of 395abstraction and encapsulation may encourage other pleasant buzzwords 396like modularization and stratification, depending on what design 397decisions you make. 398 399=head2 Buzzword: Isomorphism 400 401"Isomorphism" means "having the same structure or form"; in discussions 402of program design, the word takes on the special, specific meaning that 403your implementation of a solution to a problem I<has the same 404structure> as, say, an informal verbal description of the solution, or 405maybe of the problem itself. Isomorphism is, all things considered, 406a good thing -- it's what problem-solving (and solution-implementing) 407should look like. 408 409What's wrong the with gettext-using code like this... 410 411 printf( $file_count == 1 ? 412 ( $directory_count == 1 ? 413 "Your query matched %g file in %g directory." : 414 "Your query matched %g file in %g directories." ) : 415 ( $directory_count == 1 ? 416 "Your query matched %g files in %g directory." : 417 "Your query matched %g files in %g directories." ), 418 $file_count, $directory_count, 419 ); 420 421is first off that it's not well abstracted -- these ways of testing 422for grammatical number (as in the expressions like C<foo == 1 ? 423singular_form : plural_form>) should be abstracted to each language 424module, since how you get grammatical number is language-specific. 425 426But second off, it's not isomorphic -- the "solution" (i.e., the 427phrasebook entries) for Chinese maps from these four English phrases to 428the one Chinese phrase that fits for all of them. In other words, the 429informal solution would be "The way to say what you want in Chinese is 430with the one phrase 'For your question, in Y directories you would 431find X files'" -- and so the implemented solution should be, 432isomorphically, just a straightforward way to spit out that one 433phrase, with numerals properly interpolated. It shouldn't have to map 434from the complexity of other languages to the simplicity of this one. 435 436=head2 Buzzword: Inheritance 437 438There's a great deal of reuse possible for sharing of phrases between 439modules for related dialects, or for sharing of auxiliary functions 440between related languages. (By "auxiliary functions", I mean 441functions that don't produce phrase-text, but which, say, return an 442answer to "does this number require a plural noun after it?". Such 443auxiliary functions would be used in the internal logic of functions 444that actually do produce phrase-text.) 445 446In the case of sharing phrases, consider that you have an interface 447already localized for American English (probably by having been 448written with that as the native locale, but that's incidental). 449Localizing it for UK English should, in practical terms, be just a 450matter of running it past a British person with the instructions to 451indicate what few phrases would benefit from a change in spelling or 452possibly minor rewording. In that case, you should be able to put in 453the UK English localization module I<only> those phrases that are 454UK-specific, and for all the rest, I<inherit> from the American 455English module. (And I expect this same situation would apply with 456Brazilian and Continental Portugese, possbily with some I<very> 457closely related languages like Czech and Slovak, and possibly with the 458slightly different "versions" of written Mandarin Chinese, as I hear exist in 459Taiwan and mainland China.) 460 461As to sharing of auxiliary functions, consider the problem of Russian 462numbers from the beginning of this article; obviously, you'd want to 463write only once the hairy code that, given a numeric value, would 464return some specification of which case and number a given quanitified 465noun should use. But suppose that you discover, while localizing an 466interface for, say, Ukranian (a Slavic language related to Russian, 467spoken by several million people, many of whom would be relieved to 468find that your Web site's or software's interface is available in 469their language), that the rules in Ukranian are the same as in Russian 470for quantification, and probably for many other grammatical functions. 471While there may well be no phrases in common between Russian and 472Ukranian, you could still choose to have the Ukranian module inherit 473from the Russian module, just for the sake of inheriting all the 474various grammatical methods. Or, probably better organizationally, 475you could move those functions to a module called C<_E_Slavic> or 476something, which Russian and Ukranian could inherit useful functions 477from, but which would (presumably) provide no lexicon. 478 479=head2 Buzzword: Concision 480 481Okay, concision isn't a buzzword. But it should be, so I decree that 482as a new buzzword, "concision" means that simple common things should 483be expressible in very few lines (or maybe even just a few characters) 484of code -- call it a special case of "making simple things easy and 485hard things possible", and see also the role it played in the 486MIDI::Simple language, discussed elsewhere in this issue [TPJ#13]. 487 488Consider our first stab at an entry in our "phrasebook of functions": 489 490 sub I_found_X1_files_in_X2_directories { 491 my( $files, $dirs ) = @_[0,1]; 492 $files = sprintf("%g %s", $files, 493 $files == 1 ? 'fichier' : 'fichiers'); 494 $dirs = sprintf("%g %s", $dirs, 495 $dirs == 1 ? "r\xE9pertoire" : "r\xE9pertoires"); 496 return "J'ai trouv\xE9 $files dans $dirs."; 497 } 498 499You may sense that a lexicon (to use a non-committal catch-all term for a 500collection of things you know how to say, regardless of whether they're 501phrases or words) consisting of functions I<expressed> as above would 502make for rather long-winded and repetitive code -- even if you wisely 503rewrote this to have quantification (as we call adding a number 504expression to a noun phrase) be a function called like: 505 506 sub I_found_X1_files_in_X2_directories { 507 my( $files, $dirs ) = @_[0,1]; 508 $files = quant($files, "fichier"); 509 $dirs = quant($dirs, "r\xE9pertoire"); 510 return "J'ai trouv\xE9 $files dans $dirs."; 511 } 512 513And you may also sense that you do not want to bother your translators 514with having to write Perl code -- you'd much rather that they spend 515their I<very costly time> on just translation. And this is to say 516nothing of the near impossibility of finding a commercial translator 517who would know even simple Perl. 518 519In a first-hack implementation of Maketext, each language-module's 520lexicon looked like this: 521 522 %Lexicon = ( 523 "I found %g files in %g directories" 524 => sub { 525 my( $files, $dirs ) = @_[0,1]; 526 $files = quant($files, "fichier"); 527 $dirs = quant($dirs, "r\xE9pertoire"); 528 return "J'ai trouv\xE9 $files dans $dirs."; 529 }, 530 ... and so on with other phrase => sub mappings ... 531 ); 532 533but I immediately went looking for some more concise way to basically 534denote the same phrase-function -- a way that would also serve to 535concisely denote I<most> phrase-functions in the lexicon for I<most> 536languages. After much time and even some actual thought, I decided on 537this system: 538 539* Where a value in a %Lexicon hash is a contentful string instead of 540an anonymous sub (or, conceivably, a coderef), it would be interpreted 541as a sort of shorthand expression of what the sub does. When accessed 542for the first time in a session, it is parsed, turned into Perl code, 543and then eval'd into an anonymous sub; then that sub replaces the 544original string in that lexicon. (That way, the work of parsing and 545evaling the shorthand form for a given phrase is done no more than 546once per session.) 547 548* Calls to C<maketext> (as Maketext's main function is called) happen 549thru a "language session handle", notionally very much like an IO 550handle, in that you open one at the start of the session, and use it 551for "sending signals" to an object in order to have it return the text 552you want. 553 554So, this: 555 556 $lang->maketext("You have [quant,_1,piece] of new mail.", 557 scalar(@messages)); 558 559basically means this: look in the lexicon for $lang (which may inherit 560from any number of other lexicons), and find the function that we 561happen to associate with the string "You have [quant,_1,piece] of new 562mail" (which is, and should be, a functioning "shorthand" for this 563function in the native locale -- English in this case). If you find 564such a function, call it with $lang as its first parameter (as if it 565were a method), and then a copy of scalar(@messages) as its second, 566and then return that value. If that function was found, but was in 567string shorthand instead of being a fully specified function, parse it 568and make it into a function before calling it the first time. 569 570* The shorthand uses code in brackets to indicate method calls that 571should be performed. A full explanation is not in order here, but a 572few examples will suffice: 573 574 "You have [quant,_1,piece] of new mail." 575 576The above code is shorthand for, and will be interpreted as, 577this: 578 579 sub { 580 my $handle = $_[0]; 581 my(@params) = @_; 582 return join '', 583 "You have ", 584 $handle->quant($params[1], 'piece'), 585 "of new mail."; 586 } 587 588where "quant" is the name of a method you're using to quantify the 589noun "piece" with the number $params[0]. 590 591A string with no brackety calls, like this: 592 593 "Your search expression was malformed." 594 595is somewhat of a degerate case, and just gets turned into: 596 597 sub { return "Your search expression was malformed." } 598 599However, not everything you can write in Perl code can be written in 600the above shorthand system -- not by a long shot. For example, consider 601the Italian translator from the beginning of this article, who wanted 602the Italian for "I didn't find any files" as a special case, instead 603of "I found 0 files". That couldn't be specified (at least not easily 604or simply) in our shorthand system, and it would have to be written 605out in full, like this: 606 607 sub { # pretend the English strings are in Italian 608 my($handle, $files, $dirs) = @_[0,1,2]; 609 return "I didn't find any files" unless $files; 610 return join '', 611 "I found ", 612 $handle->quant($files, 'file'), 613 " in ", 614 $handle->quant($dirs, 'directory'), 615 "."; 616 } 617 618Next to a lexicon full of shorthand code, that sort of sticks out like a 619sore thumb -- but this I<is> a special case, after all; and at least 620it's possible, if not as concise as usual. 621 622As to how you'd implement the Russian example from the beginning of 623the article, well, There's More Than One Way To Do It, but it could be 624something like this (using English words for Russian, just so you know 625what's going on): 626 627 "I [quant,_1,directory,accusative] scanned." 628 629This shifts the burden of complexity off to the quant method. That 630method's parameters are: the numeric value it's going to use to 631quantify something; the Russian word it's going to quantify; and the 632parameter "accusative", which you're using to mean that this 633sentence's syntax wants a noun in the accusative case there, although 634that quantification method may have to overrule, for grammatical 635reasons you may recall from the beginning of this article. 636 637Now, the Russian quant method here is responsible not only for 638implementing the strange logic necessary for figuring out how Russian 639number-phrases impose case and number on their noun-phrases, but also 640for inflecting the Russian word for "directory". How that inflection 641is to be carried out is no small issue, and among the solutions I've 642seen, some (like variations on a simple lookup in a hash where all 643possible forms are provided for all necessary words) are 644straightforward but I<can> become cumbersome when you need to inflect 645more than a few dozen words; and other solutions (like using 646algorithms to model the inflections, storing only root forms and 647irregularities) I<can> involve more overhead than is justifiable for 648all but the largest lexicons. 649 650Mercifully, this design decision becomes crucial only in the hairiest 651of inflected languages, of which Russian is by no means the I<worst> case 652scenario, but is worse than most. Most languages have simpler 653inflection systems; for example, in English or Swahili, there are 654generally no more than two possible inflected forms for a given noun 655("error/errors"; "kosa/makosa"), and the 656rules for producing these forms are fairly simple -- or at least, 657simple rules can be formulated that work for most words, and you can 658then treat the exceptions as just "irregular", at least relative to 659your ad hoc rules. A simpler inflection system (simpler rules, fewer 660forms) means that design decisions are less crucial to maintaining 661sanity, whereas the same decisions could incur 662overhead-versus-scalability problems in languages like Russian. It 663may I<also> be likely that code (possibly in Perl, as with 664Lingua::EN::Inflect, for English nouns) has already 665been written for the language in question, whether simple or complex. 666 667Moreover, a third possibility may even be simpler than anything 668discussed above: "Just require that all possible (or at least 669applicable) forms be provided in the call to the given language's quant 670method, as in:" 671 672 "I found [quant,_1,file,files]." 673 674That way, quant just has to chose which form it needs, without having 675to look up or generate anything. While possibly not optimal for 676Russian, this should work well for most other languages, where 677quantification is not as complicated an operation. 678 679=head2 The Devil in the Details 680 681There's plenty more to Maketext than described above -- for example, 682there's the details of how language tags ("en-US", "i-pwn", "fi", 683etc.) or locale IDs ("en_US") interact with actual module naming 684("BogoQuery/Locale/en_us.pm"), and what magic can ensue; there's the 685details of how to record (and possibly negotiate) what character 686encoding Maketext will return text in (UTF8? Latin-1? KOI8?). There's 687the interesting fact that Maketext is for localization, but nowhere 688actually has a "C<use locale;>" anywhere in it. For the curious, 689there's the somewhat frightening details of how I actually 690implement something like data inheritance so that searches across 691modules' %Lexicon hashes can parallel how Perl implements method 692inheritance. 693 694And, most importantly, there's all the practical details of how to 695actually go about deriving from Maketext so you can use it for your 696interfaces, and the various tools and conventions for starting out and 697maintaining individual language modules. 698 699That is all covered in the documentation for Locale::Maketext and the 700modules that come with it, available in CPAN. After having read this 701article, which covers the why's of Maketext, the documentation, 702which covers the how's of it, should be quite straightfoward. 703 704=head2 The Proof in the Pudding: Localizing Web Sites 705 706Maketext and gettext have a notable difference: gettext is in C, 707accessible thru C library calls, whereas Maketext is in Perl, and 708really can't work without a Perl interpreter (although I suppose 709something like it could be written for C). Accidents of history (and 710not necessarily lucky ones) have made C++ the most common language for 711the implementation of applications like word processors, Web browsers, 712and even many in-house applications like custom query systems. Current 713conditions make it somewhat unlikely that the next one of any of these 714kinds of applications will be written in Perl, albeit clearly more for 715reasons of custom and inertia than out of consideration of what is the 716right tool for the job. 717 718However, other accidents of history have made Perl a well-accepted 719language for design of server-side programs (generally in CGI form) 720for Web site interfaces. Localization of static pages in Web sites is 721trivial, feasable either with simple language-negotiation features in 722servers like Apache, or with some kind of server-side inclusions of 723language-appropriate text into layout templates. However, I think 724that the localization of Perl-based search systems (or other kinds of 725dynamic content) in Web sites, be they public or access-restricted, 726is where Maketext will see the greatest use. 727 728I presume that it would be only the exceptional Web site that gets 729localized for English I<and> Chinese I<and> Italian I<and> Arabic 730I<and> Russian, to recall the languages from the beginning of this 731article -- to say nothing of German, Spanish, French, Japanese, 732Finnish, and Hindi, to name a few languages that benefit from large 733numbers of programmers or Web viewers or both. 734 735However, the ever-increasing internationalization of the Web (whether 736measured in terms of amount of content, of numbers of content writers 737or programmers, or of size of content audiences) makes it increasingly 738likely that the interface to the average Web-based dynamic content 739service will be localized for two or maybe three languages. It is my 740hope that Maketext will make that task as simple as possible, and will 741remove previous barriers to localization for languages dissimilar to 742English. 743 744 __END__ 745 746Sean M. Burke (sburkeE<64>cpan.org) has a Master's in linguistics 747from Northwestern University; he specializes in language technology. 748Jordan Lachler (lachlerE<64>unm.edu) is a PhD student in the Department of 749Linguistics at the University of New Mexico; he specializes in 750morphology and pedagogy of North American native languages. 751 752=head2 References 753 754Alvestrand, Harald Tveit. 1995. I<RFC 1766: Tags for the 755Identification of Languages.> 756C<ftp://ftp.isi.edu/in-notes/rfc1766.txt> 757[Now see RFC 3066.] 758 759Callon, Ross, editor. 1996. I<RFC 1925: The Twelve 760Networking Truths.> 761C<ftp://ftp.isi.edu/in-notes/rfc1925.txt> 762 763Drepper, Ulrich, Peter Miller, 764and FranE<ccedil>ois Pinard. 1995-2001. GNU 765C<gettext>. Available in C<ftp://prep.ai.mit.edu/pub/gnu/>, with 766extensive docs in the distribution tarball. [Since 767I wrote this article in 1998, I now see that the 768gettext docs are now trying more to come to terms with 769plurality. Whether useful conclusions have come from it 770is another question altogether. -- SMB, May 2001] 771 772Forbes, Nevill. 1964. I<Russian Grammar.> Third Edition, revised 773by J. C. Dumbreck. Oxford University Press. 774 775=cut 776 777#End 778 779