about summary refs log tree commit diff
path: root/manual/message.texi
blob: 7640e21acfb7722ed5c10b787da860762b5a7ff7 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
@node Message Translation
@chapter Message Translation

The program's interface with the human should be designed in a way to
ease the human the task.  One of the possibilities is to use messages in
whatever language the user prefers.

Printing messages in different languages can be implemented in different
ways.  One could add all the different languages in the source code and
add among the variants every time a message has to be printed.  This is
certainly no good solution since extending the set of languages is
difficult (the code must be changed) and the code itself can become
really big with dozens of message sets.

A better solution is to keep the message sets for each language are kept
in separate files which are loaded at runtime depending on the language
selection of the user.

The GNU C Library provides two different sets of functions to support
message translation.  The problem is that neither of the interfaces is
officially defined by the POSIX standard.  The @code{catgets} family of
functions is defined in the X/Open standard but this is drived from
industry decisions and therefore not necessarily is based on reasinable
decisions.

As mentioned above the message catalog handling provides easy
extendibility by using external data files which contain the message
translations.  I.e., these files contain for each of the messages used
in the program a translation for the appropriate language.  So the tasks
of the message handling functions functions are

@itemize @bullet
@item
locate the external data file with the appropriate translations.
@item
load the data and make it possible to address the messages
@item
map a given key to the translated message
@end itemize

The two approaches mainly differ in the implementation of this last
step.  The design decisions made for this influences the whole rest.

@menu
* Message catalogs a la X/Open::  The @code{catgets} family of functions.
* The Uniforum approach::         The @code{gettext} family of functions.
@end menu


@node Message catalogs a la X/Open
@section X/Open Message Catalog Handling

The @code{catgets} functions are based on the simple scheme:

@quotation
Associate every message to translate in the source code with a unique
identifier.  To retrieve a message from a catalog file solely the
identifier is used.
@end quotation

This means for the author of the program that s/he will have to make
sure the meaning of the identifier in the program code and in the
message catalogs are always the same.

Before a message can be translated the catalog file must be located.
The user of the program must be able to guide the responsible function
to find whatever catalog the user wants.  This is separated from what
the programmer had in mind.

All the types, constants and funtions for the @code{catgets} functions
are defined/declared in the @file{nl_types.h} header file.

@menu
* The catgets Functions::      The @code{catgets} function family.
* The message catalog files::  Format of the message catalog files.
* The gencat program::         How to generate message catalogs files which
                                can be used by the functions.
* Common Usage::               How to use the @code{catgets} interface.
@end menu


@node The catgets Functions
@subsection The @code{catgets} function family

@comment nl_types.h
@comment X/Open
@deftypefun nl_catd catopen (const char *@var{cat_name}, int @var{flag})
The @code{catgets} function tries to locate the message data file names
@var{cat_name} and loads it when found.  The return value is of an
opaque type and can be used in calls to the other functions to refer to
this loaded catalog.

The return value is @code{(nl_catd) -1} in case the function failed and
no catalog was loaded.  The global variable @var{errno} contains a code
for the error causing the failure.  But even if the function call
succeeded this does not mean that all messages can be translated.

Locating the catalog file must happen in a way which lets the user of
the program influence the decision.  It is up to the user to decide
about the language to use and sometimes it is useful to use alternate
catalog files.  All this can be specified by the user by setting some
enviroment variables.

The first problem is to find out where all the message catalogs are
stored.  Every program could have its own place to keep all the
different files but usually the catalog files are grouped by languages
and the catalogs for all programs are kept in the same place.

@cindex NLSPATH environment variable
To tell the @code{catopen} function where the catalog for the program
can be found the user can set the environment variable @code{NLSPATH} to
a value which describes her/his choice.  Since this value must be usable
for different languages and locales it cannot be a simple string.
Instead it is a format string (similar to @code{printf}'s).  An example
is

@smallexample
/usr/share/locale/%L/%N:/usr/share/locale/%L/LC_MESSAGES/%N
@end smallexample

First one can see that more than one directory can be specified (with
the usual syntax of separating them by colons).  The next things to
observe are the format string, @code{%L} and @code{%N} in this case.
The @code{catopen} function knows about several of them and the
replacement for all of them is of course different.

@table @code
@item %N
This format element is substituted with the name of the catalog file.
This is the value of the @var{cat_name} argument given to
@code{catgets}.

@item %L
This format element is substituted with the name of the currently
selected locale for translating messages.  How this is determined is
explained below.

@item %l
(This is the lowercase ell.) This format element is substituted with the
language element of the locale name.  The string decsribing the selected
locale is expected to have the form
@code{@var{lang}[_@var{terr}[.@var{codeset}]]} and this format uses the
first part @var{lang}.

@item %t
This format element is substituted by the territory part @var{terr} of
the name of the currently selected locale.  See the explanation of the
format above.

@item %c
This format element is substituted by the codeset part @var{codeset} of
the name of the currently selected locale.  See the explanation of the
format above.

@item %%
Since @code{%} is used in a meta character there must be a way to
express the @code{%} character in the result itself.  Using @code{%%}
does this just like it works for @code{printf}.
@end table


Using @code{NLSPATH} allows to specify arbitrary directories to be
searched for message catalogs while still allowing different languages
to be used.  If the @code{NLSPATH} environment variable is not set the
default value is

@smallexample
@var{prefix}/share/locale/%L/%N:@var{prefix}/share/locale/%L/LC_MESSAGES/%N
@end smallexample

@noindent
where @var{prefix} is given to @code{configure} while installing the GNU
C Library (this value is in many cases @code{/usr} or the empty string).

The remaining problem is to decide which must be used.  The value
decides about the substitution of the format elements mentioned above.
First of all the user can specify a path in the message catalog name
(i.e., the name contains a slash character).  In this situation the
@code{NLSPATH} environment variable is not used.  The catalog must exist
as specified in the program, perhaps relative to the current working
directory.  This situation in not desirable and catalogs names never
should be written this way.  Beside this, this behaviour is not portable
to all other platforms providing the @code{catgets} interface.

@cindex LC_ALL environment variable
@cindex LC_MESSAGES environment variable
@cindex LANG environment variable
Otherwise the values of environment variables from the standard
environemtn are examined (@pxref{Standard Environment}).  Which
variables are examined is decided by the @var{flag} parameter of
@code{catopen}.  If the value is @code{NL_CAT_LOCALE} (which is defined
in @file{nl_types.h}) then the @code{catopen} function examines the
environment variable @code{LC_ALL}, @code{LC_MESSAGES}, and @code{LANG}
in this order.  The first variable which is set in the current
environment will be used.

If @var{flag} is zero only the @code{LANG} environment variable is
examined.  This is a left-over from the early days of this function
where the other environment variable were not known.

In any case the environment variable should have a value of the form
@code{@var{lang}[_@var{terr}[.@var{codeset}]]} as explained above.  If
no environment variable is set the @code{"C"} locale is used which
prevents any translation.

The return value of the function is in any case a valid string.  Either
it is a translation from a message catalog or it is the same as the
@var{string} parameter.  So a piece of code to decide whether a
translation actually happened must look like this:

@smallexample
@{
  char *trans = catgets (desc, set, msg, input_string);
  if (trans == input_string)
    @{
      /* Something went wrong.  */
    @}
@}
@end smallexample

@noindent
When an error occured the global variable @var{errno} is set to

@table @var
@item EBADF
The catalog does not exist.
@item ENOMSG
The set/message touple does not name an existing element in the
message catalog.
@end table

While it sometimes can be useful to test for errors programs normally
will avoid any test.  If the translation is not available it is no big
problem if the original, untranslated message is printed.  Either the
user understands this as well or s/he will look for the reason why the
messages are not translated.
@end deftypefun

Please note that the currently selected locale does not depend on a call
to the @code{setlocale} function.  It is not necessary that the locale
data files for this locale exist and calling @code{setlocale} succeeds.
The @code{catopen} function directly reads the values of the environment
variables.


@deftypefun {char *} catgets (nl_catd @var{catalog_desc}, int @var{set}, int @var{message}, const char *@var{string})
The function @code{catgets} has to be used to access the massage catalog
previously opened using the @code{catopen} function.  The
@var{catalog_desc} parameter must be a value previously returned by
@code{catopen}.

The next two parameters, @var{set} and @var{message}, reflect the
internal organization of the message catalog files.  This will be
explained in detail below.  For now it is interesting to know that a
catalog can consists of several set and the messages in each thread are
individually numbered using numbers.  Neither the set number nor the
message number must be consecutive.  They can be arbitrarily chosen.
But each message (unless equal to another one) must have its own unique
pair of set and message number.

Since it is not guaranteed that the message catalog for the language
selected by the user exists the last parameter @var{string} helps to
handle this case gracefully.  If no matching string can be found
@var{string} is returned.  This means for the programmer that

@itemize @bullet
@item
the @var{string} parameters should contain reasonable text (this also
helps to understand the program seems otherwise there would be no hint
on the string which is expected to be returned.
@item
all @var{string} arguments should be written in the same language.
@end itemize
@end deftypefun

It is somewhat uncomfortable to write a program using the @code{catgets}
functions if no supporting functionality is available.  Since each
set/message number touple must be unique the programmer must keep lists
of the messages at the same time the code is written.  And the work
between several people working on the same project must be coordinated.
In @ref{Common Usage} we will see some how these problems can be relaxed
a bit.

@deftypefun int catclose (nl_catd @var{catalog_desc})
The @code{catclose} function can be used to free the resources
associated with a message catalog which previously was opened by a call
to @code{catopen}.  If the resources can be successfully freed the
function returns @code{0}.  Otherwise it return @code{@minus{}1} and the
global variable @var{errno} is set.  Errors can occur if the catalog
descriptor @var{catalog_desc} is not valid in which case @var{errno} is
set to @code{EBADF}.
@end deftypefun


@node The message catalog files
@subsection  Format of the message catalog files

The only reasonable way the translate all the messages of a function and
store the result in a message catalog file which can be read by the
@code{catopen} function is to write all the message text to the
translator and let her/him translate them all.  I.e., we must have a
file with entries which associate the set/message touple with a specific
translation.  This file format is specified in the X/Open standard and
is as follows:

@itemize @bullet
@item
Lines containing only whitespace characters or empty lines are ignored.

@item
Lines which contain as the first non-whitespace character a @code{$}
followed by a whitespace character are comment and are also ignored.

@item
If a line contains as the first non-whitespace characters the sequence
@code{$set} followed by a whitespace character an additional argument
is required to follow.  This argument can either be:

@itemize @minus
@item
a number.  In this case the value of this number determines the set
to which the following messages are added.

@item
an identifier consisting of alphanumeric characters plus the underscore
character.  In this case the set get automatically a number assigned.
This value is one added to the largest set number which so far appeared.

How to use the symbolic names is explained in section @ref{Common Usage}.

It is an error if a symbol name appears more than once.  All following
messages are placed in a set with this number.
@end itemize

@item
If a line contains as the first non-whitespace characters the sequence
@code{$delset} followed by a whitespace character an additional argument
is required to follow.  This argument can either be:

@itemize @minus
@item
a number.  In this case the value of this number determines the set
which will be deleted.

@item
an identifier consisting of alphanumeric characters plus the underscore
character.  This symbolic identifier must match a name for a set which
previously was defined.  It is an error if the name is unknown.
@end itemize

In both cases all messages in the specified set will be removed.  They
will not appear in the output.  But if this set is later again selected
with a @code{$set} command again messages could be added and these
messages will appear in the output.

@item
If a line contains after leading whitespaces the sequence
@code{$quote}, the quoting character used for this input file is
changed to the first non-whitespace character following the
@code{$quote}.  If no non-whitespace character is present before the
line ends quoting is disable.

By default no quoting character is used.  In this mode strings are
terminated with the first unescaped line break.  If there is a
@code{$quote} sequence present newline need not be escaped.  Instead a
string is terminated with the first unescaped appearence of the quote
character.

A common usage of this feature would be to set the quote character to
@code{"}.  Then any appearence of the @code{"} in the strings must
be escaped using the backslash (i.e., @code{\"} must be written).

@item
Any other line must start with a number or an alphanumeric identifier
(with the underscore character included).  The following characters
(starting at the first non-whitespace character) will form the string
which gets associated with the currently selected set and the message
number represented by the number and identifier respectively.

If the start of the line is a number the message number is obvious.  It
is an error if the same message number already appeared for this set.

If the leading token was an identifier the message number gets
automatically assigned.  The value is the current maximum messages
number for this set plus one.  It is an error if the identifier was
already used for a message in this set.  It is ok to reuse the
identifier for a message in another thread.  How to use the symbolic
identifiers will be explained below (@pxref{Common Usage}).  There is
one limitation with the identifier: it must not be @code{Set}.  The
reason will be explained below.

Please note that you must use a quoting character if a message contains
leading whitespace.  Since one cannot guarantee this never happens it is
probably a good idea to always use quoting.

The text of the messages can contain escape characters.  The usual bunch
of characters known from the @w{ISO C} language are recognized
(@code{\n}, @code{\t}, @code{\v}, @code{\b}, @code{\r}, @code{\f},
@code{\\}, and @code{\@var{nnn}}, where @var{nnn} is the octal coding of
a character code).
@end itemize

@strong{Important:} The handling of identifiers instead of numbers for
the set and messages is a GNU extension.  Systems strictly following the
X/Open specification do not have this feature.  An example for a message
catalog file is this:

@smallexample
$ This is a leading comment.
$quote "

$set SetOne
1 Message with ID 1.
two "   Message with ID \"two\", which gets the value 2 assigned"

$set SetTwo
$ Since the last set got the nubmer 1 assigned this set has number 2.
4000 "The numbers can be arbitrary, they need not start at one."
@end smallexample

This small example shows various aspects:
@itemize @bullet
@item
Lines 1 and 9 are comments since they start with @code{$} followed by
a whitespace.
@item
The quoting character is set to @code{"}.  Otherwise the quotes in the
message definition would have to be left away and in this case the
message with the identifier @code{two} would loose its leading whitespace.
@item
Mixing numbered messages with message having symbolic names is no
problem and the numering happens automatically.
@end itemize


While this file format is pretty easy it is not the best possible for
use in a running program.  The @code{catopen} function would have to
parser the file and handle syntactic errors gracefully.  This is not so
easy and the whole process is pretty slow.  Therefore the @code{catgets}
functions expect the data in another more compact and ready-to-use file
format.  There is a special programm @code{gencat} which is explained in
detail in the next section.

Files in this other format are not human readable.  To be easy to use by
programs it is a binary file.  But the format is byte order independent
so translation files can be shared by systems of arbitrary architecture
(as long as they use the GNU C Library).

Details about the binary file format are not important to know since
these files are always created by the @code{gencat} program.  The
sources of the GNU C Library also provide the sources for the
@code{gencat} program and so the interested reader can look throught
these source files to learn about the file format.


@node The gencat program
@subsection Generate Message Catalogs files

@cindex gencat
The @code{gencat} program is specified in the X/Open standard and the
GNU implementation follows this specification and so allows to process
all correctly formed input files.  Additionally some extension are
implemented which help to work in a more reasonable way with the the
@code{catgets} functions.

The @code{gencat} program can be invoked in two ways:

@example
`gencat [@var{Option}]@dots{} [@var{Output-File} [@var{Input-File}]@dots{}]`
@end example

This is the interface defined in the X/Open standard.  If no
@var{Input-File} parameter is given input will be read from standard
input.  Multiple input files will be read as if they are concatenated.
If @var{Output-File} is also missing, the output will be written to
standard output.  To provide the interface one is used from other
programs a second interface is provided.

@smallexample
`gencat [@var{Option}]@dots{} -o @var{Output-File} [@var{Input-File}]@dots{}`
@end smallexample

The option @samp{-o} is used to specify the output file and all file
arguments are used as input files.

Beside this one can use @file{-} or @file{/dev/stdin} for
@var{Input-File} to denote the standard input.  Corresponding one can
use @file{-} and @file{/dev/stdout} for @var{Output-File} to denote
standard output.  Using @file{-} as a file name is allowed in X/Open
while using the device names is a GNU extension.

The @code{gencat} program works by concatenating all input files and
then @strong{merge} the resulting collection of message sets with a
possiblity existing output file.  This is done by removing all messages
with set/message number touples matching any of the generated messages
from the output file and then adding all the new messages.  To
regenerate a catalog file while ignoring the old contents therefore
requires to remove the output file if it exists.  If the output is
written to standard output no merging takes place.

@noindent
The following table shows the options understood by the @code{gencat}
program.  The X/Open standard does not specify any option for the
program so all of these are GNU extensions.

@table @samp
@item -V
@itemx --version
Print the version information and exit.
@item -h
@itemx --help
Print a usage message listing all available options, then exit successfully.
@item --new
Do never merge the new messages from the input files with the old content
of the output files.  The old content of the output file is discarded.
@item -H
@itemx --header=name
This option is used to emit the symbolic names given to sets and
messages in the input files for use in the program.  Details about how
to use this are given in the next section.  The @var{name} parameter to
this option specifies the name of the output file.  It will contain a
number of C preprocessor @code{#define}s to associate a name with a
number.

Please note that the generated file only contains the symbols from the
input files.  If the output is merged with the previous content of the
output file the possibly existing symbols from the file(s) which
generated the old output files are not in the generated header file.
@end table


@node Common Usage
@subsection How to use the @code{catgets} interface

The @code{catgets} functions can be used in two different ways.  By
following slavishly the X/Open specs and not relying on the extension
and by using the GNU extensions.  We will take a look at the former
method first to understand the benefits of extensions.

@subsubsection Not using symbolic symbolic names

Since the X/Open format of the message catalog files does not allow
symbol names we have to work with numbers all the time.  When we start
writing a program we have to replace all appearences of translatable
strings with someting like

@smallexample
catgets (catdesc, set, msg, "string")
@end smallexample

@noindent
@var{catgets} is retrieved from a call to @code{catopen} which is
normally done once at the program start.  The @code{"string"} is the
string we want to translate.  The problems start with the set and
message numbers.

In a bigger program several programmers usually work at the same time on
the program and so coordinating the number allocation is crucial.
Though no two different strings must be indexed by the same touple of
numbers it is highly desireable to reuse the numbers for equal strings
with equal translations (please note that there might be strings which
are equal in one language but have different translations due to
difference contexts).

The allocation process can be relaxed a bit by different set numbers for
different parts of the program.  So the number of developers who have to
coordinate the allocation can be reduced.  But still lists must be keep
track of the allocation and errors can easily happen.  These errors
cannot be discovered by the compiler or the @code{catgets} functions.
Only the user of the program might see wrong messages printed.  In the
worst cases the messages are so irritating that they cannot be
recognized as wrong.  Think about the translations for @code{"true"} and
@code{"false"} being exchanged.  This could result in a desaster.


@subsubsection Using symbolic names

The problems mentioned in the last section derive from the fact that:

@enumerate
@item
the numbers are allocated once and due to the possibly frequent use of
them it is difficult to change a number later.
@item
the numbers do not allow to guess anything about the string and
therefore collisions can easily happen.
@end enumerate

By constantly using symbolic names and by providing a method which maps
the string content to a symbolic name (however this will happen) one can
prevent both problems above.  The cost of this is that the programmer
has to write a complete message catalog file while s/he is writing the
program itself.

This is necessary since the symbolic names must be mapped to numbers
before the program sources can be compiled.  In the last section it was
described how to generate a header containing the mapping of the names.
E.g., for the example message file given in the last section we could
call the @code{gencat} program as follow (assume @file{ex.msg} contains
the sources).

@smallexample
gencat -H ex.h -o ex.cat ex.msg
@end smallexample

@noindent
This generates a header file with the following content:

@smallexample
#define SetTwoSet 0x2   /* u.msg:8 */

#define SetOneSet 0x1   /* u.msg:4 */
#define SetOnetwo 0x2   /* u.msg:6 */
@end smallexample

As can be seen the various symbols given in the source file are mangled
to generate unique identifiers and these identifiers get numbers
assigned.  Reading the source file and knowing about the rules will
allow to predict the content of the header file (it is deterministic)
but this is not necessary.  The @code{gencat} program can take care for
everything.  All the programmer has to do is to put the generated header
file in the dependency list of the source files of her/his project and
to add a rules to regenerate the header of any of the input files
change.

One word about the symbol mangling.  Every symbol consists of two parts:
the name of the message set plus the name of the message or the special
string @code{Set}.  So @code{SetOnetwo} means this macro can be used to
access the translation with identifier @code{two} in the message set
@code{SetOne}.

The other names denote the names of the message sets.  The special
string @code{Set} is used in the place of the message identifier.

If in the code the second string of the set @code{SetOne} is used the C
code should look like this:

@smallexample
catgets (catdesc, SetOneSet, SetOnetwo,
         "   Message with ID \"two\", which gets the value 2 assigned")
@end smallexample

Writing the function this way will allow to change the message number
and even the set number without requiring any change in the C source
code.  (The text of the string is normally not the same; this is only
for this example.)


@subsubsection How does to this allow to develop

To illustrate the usual way to work with the symbolic version numbers
here is a little example.  Assume we want to write the very complex and
famous greeting program.  We start by writing the code as usual:

@smallexample
#include <stdio.h>
int
main (void)
@{
  printf ("Hello, world!\n");
  return 0;
@}
@end smallexample

Now we want to internationalize the message and therefore replace the
message with whatever the user wants.

@smallexample
#include <nl_types.h>
#include <stdio.h>
#include "msgnrs.h"
int
main (void)
@{
  nl_catd catdesc = catopen ("hello.cat", NL_CAT_LOCALE);
  printf (catgets (catdesc, SetMainSet, SetMainHello, "Hello, world!\n"));
  catclose (catdesc);
  return 0;
@}
@end smallexample

We see how the catalog object is opened and the returned descriptor used
in the other function calls.  It is not really necessary to check for
failure of any of the functions since even in these situations the
functions will behave reasonable.  They simply will be return a
translation.

What remains unspecified here are the constants @code{SetMainSet} and
@code{SetMainHello}.  These are the symbolic names describing the
message.  To get the actual definitions which match the information in
the catalog file we have to create the message catalog source file and
process it using the @code{gencat} program.

@smallexample
$ Messages for the famous greeting program.
$quote "

$set Main
Hello "Hallo, Welt!\n"
@end smallexample

Now we can start building the program (assume the message catalog source
file is named @file{hello.msg} and the program source file @file{hello.c}):

@smallexample
@cartouche
% gencat -H msgnrs.h -o hello.cat hello.msg
% cat msgnrs.h
#define MainSet 0x1     /* hello.msg:4 */
#define MainHello 0x1   /* hello.msg:5 */
% gcc -o hello hello.c -I.
% cp hello.cat /usr/share/locale/de/LC_MESSAGES
% echo $LC_ALL
de
% ./hello
Hallo, Welt!
%
@end cartouche
@end smallexample

The call of the @code{gencat} program creates the missing header file
@file{msgnrs.h} as well as the message catalog binary.  The former is
used in the compilation of @file{hello.c} while the later is placed in a
directory in which the @code{catopen} function will try to locate it.
Please check the @code{LC_ALL} environment variable and the default path
for @code{catopen} presented in the description above.


@node The Uniforum approach
@section The Uniforum approach to Message Translation

Sun Microsystems tried to standardize a different approach to message
translation in the Uniforum group.  There never was a real standard
defined but still the interface was used in Sun's operation systems.
Since this approach fits better in the development process of free
software it is also used throughout the GNU package and the GNU
@file{gettext} package provides support for this outside the GNU C
Library.

The code of the @file{libintl} from GNU @file{gettext} is the same as
the code in the GNU C Library.  So the documentation in the GNU
@file{gettext} manual is also valid for the functionality here.  The
following text will describe the library functions in detail.  But the
numerous helper programs are not described in this manual.  Instead
people should read the GNU @file{gettext} manual
(@pxref{Top,,GNU gettext utilities,gettext,Native Language Support Library and Tools}).
We will only give a short overview.

Though the @code{catgets} functions are available by default on more
systems the @code{gettext} interface is at least as portable as the
former.  The GNU @file{gettext} package can be used wherever the
functions are not available.


@menu
* Message catalogs with gettext::  The @code{gettext} family of functions.
* Helper programs for gettext::    Programs to handle message catalogs
                                    for @code{gettext}.
@end menu


@node Message catalogs with gettext
@subsection The @code{gettext} family of functions

The paradigms underlying the @code{gettext} approach to message
translations is different from that of the @code{catgets} functions the
basic functionally is equivalent.  There are functions of the following
categories:

@menu
* Translation with gettext::    What has to be done to translate a message.
* Locating gettext catalog::    How to determine which catalog to be used.
* Using gettextized software::  The possibilities of the user to influence
                                 the way @code{gettext} works.
@end menu

@node Translation with gettext
@subsubsection What has to be done to translate a message?

The @code{gettext} functions have a very simple interface.  The most
basic function just takes the string which shall be translated as the
argument and it returns the translation.  This is fundamentally
different from the @code{catgets} approach where an extra key is
necessary and the original string is only used for the error case.

If the string which has to be translated is the only argument this of
course means the string itself is the key.  I.e., the translation will
be selected based on the original string.  The message catalogs must
therefore contain the original strings plus one translation for any such
string.  The task of the @code{gettext} function is it to compare the
argument string with the available strings in the catalog and return the
appropriate translation.  Of course this process is optimized so that
this process is not more expensive than an access using an atomic key
like in @code{catgets}.

The @code{gettext} approach has some advantages but also some
disadvantages.  Please see the GNU @file{gettext} manual for a detailed
discussion of the pros and cons.

All the definitions and declarations for @code{gettext} can be found in
the @file{libintl.h} header file.  On systems where these functions are
not part of the C library they can be found in a separate library named
@file{libintl.a} (or accordingly different for shared libraries).

@deftypefun {char *} gettext (const char *@var{msgid})
The @code{gettext} function searches the currently selected message
catalogs for a string which is equal to @var{msgid}.  If there is such a
string available it is returned.  Otherwise the argument string
@var{msgid} is returned.

Please note that all though the return value is @code{char *} the
returned string must not be changed.  This broken type results from the
history of the function and does not reflect the way the function should
be used.

Please note that above we wrote ``message catalogs'' (plural).  This is
a speciality of the GNU implementation of these functions and we will
say more about this in section @xref{Locating gettext catalog} when we
talk about the ways message catalogs are selected.

The @code{gettext} function does not modify the value of the global
@var{errno} variable.  This is necessary to make it possible to write
something like

@smallexample
  printf (gettext ("Operation failed: %m\n"));
@end smallexample

Here the @var{errno} value is used in the @code{printf} function while
processing the @code{%m} format element and if the @code{gettext}
function would change this value (it is called before @code{printf} is
called) we wouls get a wrong message.

So there is no easy way to detect a missing message catalog beside
comparing the argument string with the result.  But it is normally the
task of the user to react on missing catalogs.  The program cannot guess
when a message catalog is really necessary since for a user who s peaks
the language the program was developed in does not need any translation.
@end deftypefun

The remaining two functions to access the message catalog add some
functionality to select a message catalog which is not the default one.
This is important if parts of the program are developed independently.
Every part can have its own message catalog and all of them can be used
at the same time.  The C library itself is an example: internally it
uses the @code{gettext} functions but since it must not depend on a
currently selected default message catalog it must specify all ambiguous
information.

@deftypefun {char *} dgettext (const char *@var{domainname}, const char *@var{msgid})
The @code{dgettext} functions acts just like the @code{gettext}
function.  It only takes an additional first argument @var{domainname}
which guides the selection of the message catalogs which are searched
for the translation.  If the @var{domainname} parameter is the null
pointer the @code{dgettext} function is exactly equivalent to
@code{gettext} since the default value for the domain name is used.

As for @code{gettext} the return value type is @code{char *} which is an
anachronism.  The returned string must never be modfied.
@end deftypefun

@deftypefun {char *} dcgettext (const char *@var{domainname}, const char *@var{msgid}, int @var{category})
The @code{dcgettext} adds another argument to those which
@code{dgettext} takes.  This argument @var{category} specifies the last
piece of information needed to localize the message catalog.  I.e., the
domain name and the locale category exactly specify which message
catalog has to be used (relative to a given directory, see below).

The @code{dgettext} function can be expressed in terms of
@code{dcgettext} by using

@smallexample
dcgettext (domain, string, LC_MESSAGES)
@end smallexample

@noindent
instead of

@smallexample
dgettext (domain, string)
@end smallexample

This also shows which values are expected for the third parameter.  One
has to use the available selectors for the categories available in
@file{locale.h}.  Normally the available values are @code{LC_CTYPE},
@code{LC_COLLATE}, @code{LC_MESSAGES}, @code{LC_MONETARY},
@code{LC_NUMERIC}, and @code{LC_TIME}.  Please note that @code{LC_ALL}
must not be used and even though the names might suggest this, there is
no relation to the environments variables of this name.

The @code{dcgettext} function is only implemented for compatibility with
other systems which have @code{gettext} functions.  There is not really
any situation where it is necessary (or useful) to use a different value
but @code{LC_MESSAGES} in for the @var{category} parameter.  We are
dealing with messages here and any other choice can only be irritating.

As for @code{gettext} the return value type is @code{char *} which is an
anachronism.  The returned string must never be modfied.
@end deftypefun

When using the three functions above in a program it is a frequent case
that the @var{msgid} argument is a constant string.  So it is worth to
optimize this case.  Thinking shortly about this one will realize that
as long as no new message catalog is loaded the translation of a message
will not change.  I.e., the algorithm to determine the translation is
deterministic.

Exactly this is what the optimizations implemented in the
@file{libintl.h} header will use.  Whenver a program is compiler with
the GNU C compiler, optimization is selected and the @var{msgid}
argument to @code{gettext}, @code{dgettext} or @code{dcgettext} is a
constant string the actual function call will only be done the first
time the message is used and then always only if any new message catalog
was loaded and so the result of the translation lookup might be
different.  See the @file{libintl.h} header file for details.  For the
user it is only important to know that the result is always the same,
independent of the compiler or compiler options in use.


@node Locating gettext catalog
@subsubsection How to determine which catalog to be used

The functions to retrieve the translations for a given mesage have a
remarkable simple interface.  But to provide the user of the program
still the opportunity to select exactly the translation s/he wants and
also to provide the programmer the possibility to influence the way to
locate the search for catalogs files there is a quite complicated
underlying mechanism which controls all this.  The code is complicated
the use is easy.

Basically we have two different tasks to perform which can also be
performed by the @code{catgets} functions:

@enumerate
@item
Locate the set of message catalogs.  There are a number of files for
different languages and which all belong to the package.  Usually they
are all stored in the filesystem below a certain directory.

There can be arbitrary many packages installed and they can follow
different guidelines for the placement of their files.

@item
Relative to the location specified by the package the actual translation
files must be searched, based on the wishes of the user.  I.e., for each
language the user selects the program should be able to locate the
appropriate file.
@end enumerate

This is the functionality required by the specifications for
@code{gettext} and this is also what the @code{catgets} functions are
able to do.  But there are some problems unresolved:

@itemize @bullet
@item
The language to be used can be specified in several different ways.
There is no generally accepted standard for this and the user always
expects the program understand what s/he means.  E.g., to select the
German translation one could write @code{de}, @code{german}, or
@code{deutsch} and the program should always react the same.

@item
Sometimes the specification of the user is too detailed.  If s/he, e.g.,
specifies @code{de_DE.ISO-8859-1} which means German, spoken in Germany,
coded using the @w{ISO 8859-1} character set there is the possibility
that a message catalog matching this exactly is not available.  But
there could be a catalog matching @code{de} and if the character set
used on the machine is always @w{ISO 8859-1} there is no reason why this
later message catalog should not be used.  (We call this @dfn{message
inheritance}.)

@item
If a catalog for a wanted language is not available it is not always the
second best choice to fall back on the language of the developer and
simply not translate any message.  Instead a user might be better able
to read the messages in another language and so the user of the program
should be able to define an precedence order of languages.
@end itemize

We can devide the configuration actions in two parts: the one is
performed by the programmer, the other by the user.  We will start with
the functions the programmer can use since the user configuration will
be based on this.

As the functions described in the last sections already mention separate
sets of messages can be selected by a @dfn{domain name}.  This is a
simple string which should be unique for each program part with uses a
separate domain.  It is possible to use in one program arbitrary many
domains at the same time.  E.g., the GNU C Library itself uses a domain
named @code{libc} while the program using the C Library could use a
domain named @code{foo}.  The important point is that at any time
exactly one domain is active.  This is controlled with the following
function.

@deftypefun {char *} textdomain (const char *@var{domainname})
The @code{textdomain} function sets the default domain, which is used in
all future @code{gettext} calls, to @var{domainname}.  Please note that
@code{dgettext} and @code{dcgettext} calls are not influenced if the
@var{domainname} parameter of these functions is not the null pointer.

Before the first call to @code{textdomain} the default domain is
@code{messages}.  This is the name specified in the fpsecification of
the @code{gettext} API.  This name is as good as any other name.  No
program should ever really use a domain with this name since this can
only lead to problems.

The function returns the value which is from now on taken as the default
domain.  If the system went out of memory the returned value is
@code{NULL} and the global variable @var{errno} is set to @code{ENOMEM}.
Despite the return value type being @code{char *} the return string must
not be changed.  It is allocated internally by the @code{textdomain}
function.

If the @var{domainname} parameter is the null pointer no new default
domain is set.  Instead the currently selected default domain is
returned.

If the @var{domainname} parameter is the empty string the default domain
is reset to its initial value, the domain with the name @code{messages}.
This possibility is questionable to use since the domain @code{messages}
really never should be used.
@end deftypefun

@deftypefun {char *} bindtextdomain (const char *@var{domainname}, const char *@var{dirname})
The @code{bindtextdomain} function can be used to specify the directly
which contains the message catalogs for domain @var{domainname} for the
different languages.  To be correct, this is the directory where the
hierachy of directories is expected.  Details are explained below.

For the programmer it is important to note that the translations which
come with the program have be placed in a directory hierachy starting
at, say, @file{/foo/bar}.  Then the program should make a
@code{bindtextdomain} call to bind the domain for the current program to
this directory.  So it is made sure the catalogs are found.  A correctly
running program does not depend on the user setting an environment
variable.

The @code{bindtextdomain} function can be used several times and if the
@var{domainname} argument is different the previously boundd domains
will not be overwritten.

If the @var{dirname} parameter is the null pointer @code{bindtextdomain}
returns the currently selected directory for the domain with the name
@var{domainname}.

the @code{bindtextdomain} function returns a pointer to a string
containing the name of the selected directory name.  The string is
allocated internally in the function and must not be changed by the
user.  If the system went out of core during the execution of
@code{bindtextdomain} the return value is @code{NULL} and the global
variable @var{errno} is set accordingly.
@end deftypefun


@node Using gettextized software
@subsubsection User influence on @code{gettext}

The last sections described what the programmer can do to
internationalize the messages of the program.  But it is finally up to
the user to select the message s/he wants to see.  S/He must understand
them.

The POSIX locale model uses the environment variables @code{LC_COLLATE},
@code{LC_CTYPE}, @code{LC_MESSAGES}, @code{LC_MONETARY}, @code{NUMERIC},
and @code{LC_TIME} to select the locale which is to be used.  This way
the user can influence lots of functions.  As we mentioned above the
@code{gettext} functions also take advantage of this.

To understand how this happens it is necessary to take a look at the
various components of the filename which gets computed to locate a
message catalog.  It is composed as follows:

@smallexample
@var{dir_name}/@var{locale}/LC_@var{category}/@var{domain_name}.mo
@end smallexample

The default value for @var{dir_name} is system specific.  It is computed
from the value given as the prefix while configuring the C library.
This value normally is @file{/usr} or @file{/}.  For the former the
complete @var{dir_name} is:

@smallexample
/usr/share/locale
@end smallexample

We can use @file{/usr/share} since the @file{.mo} files containing the
message catalogs are system independent, all systems can use the same
files.  If the program executed the @code{bindtextdomain} function for
the message domain that is currently handled the @code{dir_name}
component is the exactly the value which was given to the function as
the second parameter.  I.e., @code{bindtextdomain} allows to overwrite
the only system depdendent and fixed value to make it possible to
address file everywhere in the filesystem.

The @var{category} is the name of the locale category which was selected
in the program code.  For @code{gettext} and @code{dgettext} this is
always @code{LC_MESSAGES}, for @code{dcgettext} this is selected by the
value of the third parameter.  As said above it should be avoided to
ever use a category other than @code{LC_MESSAGES}.

The @var{locale} component is computed based on the category used.  Just
like for the @code{setlocale} function here comes the user selection
into the play.  Some environment variables are examined in a fixed order
and the first environment variable set determines the return value of
the lookup process.  In detail, for the category @code{LC_xxx} the
following variables in this order are examined:

@table @code
@item LANGUAGE
@item LC_ALL
@item LC_xxx
@item LANG
@end table

This looks very familiar.  With the exception of the @code{LANGUAGE}
environment variable this is exactly the lookup order the
@code{setlocale} function uses.  But why introducing the @code{LANGUAGE}
variable?

The reason is that the syntax of the values these variables can have is
different to what is expected by the @code{setlocale} function.  If we
would set @code{LC_ALL} to a value following the extended syntax that
would mean the @code{setlocale} function will never be able to use the
value of this variable as well.  An additional variable removes this
problem plus we can select the language independently of the locale
setting which sometimes is useful.

While for the @code{LC_xxx} variables the value should consist of
exactly one specification of a locale the @code{LANGUAGE} variable's
value can consist of a colon separated list of locale names.  The
attentive reader will realize that this is the way we manage to
implement one of our additional demands above: we want to be able to
specify an ordered list of language.

Back to the constructed filename we have only one component missing.
The @var{domain_name} part is the name which was either registered using
the @code{textdomain} function or which was given to @code{dgettext} or
@code{dcgettext} as the first parameter.  Now it becomes obvious that a
good choice for the domain name in the program code is a string which is
closely related to the program/package name.  E.g., for the GNU C
Library the domain name is @code{libc}.

@noindent
A limit piece of example code should show how the programmer is supposed
to work:

@smallexample
@{
  textdomain ("test-package");
  bindtextdomain ("test-package", "/usr/local/share/locale");
  puts (gettext ("Hello, world!");
@}
@end smallexample

At the program start the default domain is @code{messages}.  The
@code{textdomain} call changes this to @code{test-package}.  The
@code{bindtextdomain} call specifies that the message catalogs for the
domain @code{test-package} can be found below the directory
@file{/usr/local/share/locale}.

If now the user set in her/his environment the variable @code{LANGUAGE}
to @code{de} the @code{gettext} function will try to use the
translations from the file

@smallexample
/usr/local/share/locale/de/LC_MESSAGES/test-package.mo
@end smallexample

From the above descriptions it should be clear which component of this
filename is determined fromby which source.

@c Describe:
@c * message inheritence
@c * locale aliasing
@c * character set dependence


@node Helper programs for gettext
@subsection Programs to handle message catalogs for @code{gettext}

@c Describe:
@c * msgfmt
@c * xgettext
@c Mention:
@c * other programs from GNU gettext