about summary refs log tree commit diff
path: root/manual/charset.texi
diff options
context:
space:
mode:
Diffstat (limited to 'manual/charset.texi')
-rw-r--r--manual/charset.texi136
1 files changed, 75 insertions, 61 deletions
diff --git a/manual/charset.texi b/manual/charset.texi
index deae7af08a..89a54d8e13 100644
--- a/manual/charset.texi
+++ b/manual/charset.texi
@@ -15,7 +15,7 @@ limitations of this approach became more apparent as more people
 grappled with non-Roman character sets, where not all the characters
 that make up a language's character set can be represented by @math{2^8}
 choices.  This chapter shows the functionality which was added to the C
-library to correctly support multiple character sets.
+library to support multiple character sets.
 
 @menu
 * Extended Char Intro::              Introduction to Extended Characters.
@@ -46,13 +46,13 @@ through whatever communication channel.  Examples of external
 representations include files lying in a directory that are going to be
 read and parsed.
 
-Traditionally there was no difference between the two representations.
-It was equally comfortable and useful to use the same one-byte
+Traditionally there has been no difference between the two representations.
+It was equally comfortable and useful to use the same single-byte
 representation internally and externally.  This changes with more and
 larger character sets.
 
 One of the problems to overcome with the internal representation is
-handling text which is externally encoded using different character
+handling text that is externally encoded using different character
 sets.  Assume a program which reads two texts and compares them using
 some metric.  The comparison can be usefully done only if the texts are
 internally kept in a common format.
@@ -69,14 +69,28 @@ than four bytes seem not to be necessary).
 As shown in some other part of this manual,
 @c !!! Ahem, wide char string functions are not yet covered -- drepper
 there exists a completely new family of functions which can handle texts
-of this kind in memory.  The most commonly used character set for such
-internal wide character representations are Unicode and @w{ISO 10646}.
-The former is a subset of the latter and used when wide characters are
-chosen to by 2 bytes (@math{= 16} bits) wide.  The standard names of the
-@cindex UCS2
-@cindex UCS4
-encodings used in these cases are UCS2 (@math{= 16} bits) and UCS4
-(@math{= 32} bits).
+of this kind in memory.  The most commonly used character sets for such
+internal wide character representations are Unicode and @w{ISO 10646}
+(also known as UCS for Universal Character Set). Unicode was originally
+planned as a 16-bit character set, whereas @w{ISO 10646} was designed to
+be a 31-bit large code space. The two standards are practically identical.
+They have the same character repertoire and code table, but Unicode specifies
+added semantics.  At the moment, only characters in the first @code{0x10000}
+code positions (the so-called Basic Multilingual Plane, BMP) have been
+assigned, but the assignment of more specialized characters outside this
+16-bit space is already in progress. A number of encodings have been
+defined for Unicode and @w{ISO 10646} characters:
+@cindex UCS-2
+@cindex UCS-4
+@cindex UTF-8
+@cindex UTF-16
+UCS-2 is a 16-bit word that can only represent characters
+from the BMP, UCS-4 is a 32-bit word than can represent any Unicode
+and @w{ISO 10646} character, UTF-8 is an ASCII compatible encoding where
+ASCII characters are represented by ASCII bytes and non-ASCII characters
+by sequences of 2-6 non-ASCII bytes, and finally UTF-16 is an extension
+of UCS-2 in which pairs of certain UCS-2 words can be used to encode
+non-BMP characters up to @code{0x10ffff}.
 
 To represent wide characters the @code{char} type is not suitable.  For
 this reason the @w{ISO C} standard introduces a new type which is
@@ -93,18 +107,18 @@ for multibyte character strings.  The type is defined in @file{stddef.h}.
 
 The @w{ISO C90} standard, where this type was introduced, does not say
 anything specific about the representation.  It only requires that this
-type is capable to store all elements of the basic character set.
+type is capable of storing all elements of the basic character set.
 Therefore it would be legitimate to define @code{wchar_t} as
 @code{char}.  This might make sense for embedded systems.
 
 But for GNU systems this type is always 32 bits wide.  It is therefore
-capable to represent all UCS4 value therefore covering all of @w{ISO
-10646}.  Some Unix systems define @code{wchar_t} as a 16 bit type and
+capable of representing all UCS-4 values and  therefore covering all of
+@w{ISO 10646}.  Some Unix systems define @code{wchar_t} as a 16-bit type and
 thereby follow Unicode very strictly.  This is perfectly fine with the
 standard but it also means that to represent all characters from Unicode
-and @w{ISO 10646} one has to use surrogate character which is in fact a
-multi-wide-character encoding.  But this contradicts the purpose of the
-@code{wchar_t} type.
+and @w{ISO 10646} one has to use UTF-16 surrogate characters which is in
+fact a multi-wide-character encoding.  But this contradicts the purpose
+of the @code{wchar_t} type.
 @end deftp
 
 @comment wchar.h
@@ -119,8 +133,8 @@ defined as @code{char} the type @code{wint_t} must be defined as
 @code{int} due to the parameter promotion.
 
 @pindex wchar.h
-This type is defined in @file{wchar.h} and got introduced in the second
-amendment to @w{ISO C90}.
+This type is defined in @file{wchar.h} and got introduced in
+@w{Amendment 1} to @w{ISO C90}.
 @end deftp
 
 As there are for the @code{char} data type there also exist macros
@@ -133,7 +147,7 @@ type @code{wchar_t}.
 The macro @code{WCHAR_MIN} evaluates to the minimum value representable
 by an object of type @code{wint_t}.
 
-This macro got introduced in the second amendment to @w{ISO C90}.
+This macro got introduced in @w{Amendment 1} to @w{ISO C90}.
 @end deftypevr
 
 @comment wchar.h
@@ -142,7 +156,7 @@ This macro got introduced in the second amendment to @w{ISO C90}.
 The macro @code{WCHAR_MIN} evaluates to the maximum value representable
 by an object of type @code{wint_t}.
 
-This macro got introduced in the second amendment to @w{ISO C90}.
+This macro got introduced in @w{Amendment 1} to @w{ISO C90}.
 @end deftypevr
 
 Another special wide character value is the equivalent to @code{EOF}.
@@ -180,7 +194,7 @@ are used.
 @end smallexample
 
 @pindex wchar.h
-This macro was introduced in the second amendment to @w{ISO C90} and is
+This macro was introduced in @w{Amendment 1} to @w{ISO C90} and is
 defined in @file{wchar.h}.
 @end deftypevr
 
@@ -198,7 +212,7 @@ oriented character set.
 @cindex multibyte character
 @cindex EBCDIC
    For all the above reasons, an external encoding which is different
-from the internal encoding is often used if the latter is UCS2 or UCS4.
+from the internal encoding is often used if the latter is UCS-2 or UCS-4.
 The external encoding is byte-based and can be chosen appropriately for
 the environment and for the texts to be handled.  There exist a variety
 of different character sets which can be used for this external
@@ -215,7 +229,7 @@ system calls have to be converted first anyhow.
 
 @itemize @bullet
 @item
-The simplest character sets are one-byte character sets.  There can be
+The simplest character sets are single-byte character sets.  There can be
 only up to 256 characters (for @w{8 bit} character sets) which is not
 sufficient to cover all languages but might be sufficient to handle a
 specific text.  Another reason to choose this is because of constraints
@@ -240,7 +254,7 @@ big advantage that whenever one can identify the beginning of the byte
 sequence of a character one can interpret a text correctly.  Examples of
 character sets using this policy are the various EUC character sets
 (used by Sun's operations systems, EUC-JP, EUC-KR, EUC-TW, and EUC-CN)
-or SJIS (Shift JIS, a Japanese encoding).
+or SJIS (Shift-JIS, a Japanese encoding).
 
 But there are also character sets using a state which is valid for more
 than one character and has to be changed by another byte sequence.
@@ -257,23 +271,23 @@ acute accent, following by lower-case `a') to get the ``small a with
 acute'' character.  To get the acute accent character on its on one has
 to write @code{0xc2 0x20} (the non-spacing acute followed by a space).
 
-This type of characters sets is quite frequently used in embedded
-systems such as video text.
+This type of character set is used in some embedded systems such as
+teletex.
 
 @item
 @cindex UTF-8
-Instead of converting the Unicode or @w{ISO 10646} text used internally
+Instead of converting the Unicode or @w{ISO 10646} text used internally,
 it is often also sufficient to simply use an encoding different than
-UCS2/UCS4.  The Unicode and @w{ISO 10646} standards even specify such an
+UCS-2/UCS-4.  The Unicode and @w{ISO 10646} standards even specify such an
 encoding: UTF-8.  This encoding is able to represent all of @w{ISO
-10464} 31 bits in a byte string of length one to seven.
+10464} 31 bits in a byte string of length one to six.
 
 @cindex UTF-7
 There were a few other attempts to encode @w{ISO 10646} such as UTF-7
 but UTF-8 is today the only encoding which should be used.  In fact,
-UTF-8 will hopefully soon be the only external which has to be
+UTF-8 will hopefully soon be the only external encoding that has to be
 supported.  It proves to be universally usable and the only disadvantage
-is that it favor Roman languages very much by making the byte string
+is that it favors Roman languages by making the byte string
 representation of other scripts (Cyrillic, Greek, Asian scripts) longer
 than necessary if using a specific character set for these scripts.
 Methods like the Unicode compression scheme can alleviate these
@@ -324,7 +338,7 @@ developing libraries (as opposed to applications).
 The second family of functions got introduced in the early Unix standards
 (XPG2) and is still part of the latest and greatest Unix standard:
 @w{Unix 98}.  It is also the most powerful and useful set of functions.
-But we will start with the functions defined in the second amendment to
+But we will start with the functions defined in @w{Amendment 1} to
 @w{ISO C90}.
 
 @node Restartable multibyte conversion
@@ -377,7 +391,7 @@ We already said above that the currently selected locale for the
 by the functions we are about to describe.  Each locale uses its own
 character set (given as an argument to @code{localedef}) and this is the
 one assumed as the external multibyte encoding.  The wide character
-character set always is UCS4, at least on GNU systems.
+character set always is UCS-4, at least on GNU systems.
 
 A characteristic of each multibyte character set is the maximum number
 of bytes which can be necessary to represent one character.  This
@@ -456,8 +470,8 @@ about the @dfn{shift state} needed from one call to a conversion
 function to another.
 
 @pindex wchar.h
-This type is defined in @file{wchar.h}.  It got introduced in the second
-amendment to @w{ISO C90}.
+This type is defined in @file{wchar.h}.  It got introduced in
+@w{Amendment 1} to @w{ISO C90}.
 @end deftp
 
 To use objects of this type the programmer has to define such objects
@@ -495,7 +509,7 @@ object is in the initial state the return value is nonzero.  Otherwise
 it is zero.
 
 @pindex wchar.h
-This function was introduced in the second amendment to @w{ISO C90} and
+This function was introduced in @w{Amendment 1} to @w{ISO C90} and
 is declared in @file{wchar.h}.
 @end deftypefun
 
@@ -559,7 +573,7 @@ which the state information is taken and the function also does not use
 any static state.
 
 @pindex wchar.h
-This function was introduced in the second amendment of @w{ISO C90} and
+This function was introduced in @w{Amendment 1} to @w{ISO C90} and
 is declared in @file{wchar.h}.
 @end deftypefun
 
@@ -608,7 +622,7 @@ value of this function is this character.  Otherwise the return value is
 @code{EOF}.
 
 @pindex wchar.h
-This function was introduced in the second amendment of @w{ISO C90} and
+This function was introduced in @w{Amendment 1} to @w{ISO C90} and
 is declared in @file{wchar.h}.
 @end deftypefun
 
@@ -655,7 +669,7 @@ a valid multibyte character also no value is stored, the global variable
 @code{(size_t) -1}.  The conversion state is afterwards undefined.
 
 @pindex wchar.h
-This function was introduced in the second amendment to @w{ISO C90} and
+This function was introduced in @w{Amendment 1} to @w{ISO C90} and
 is declared in @file{wchar.h}.
 @end deftypefun
 
@@ -733,7 +747,7 @@ object pointed to by @var{ps}.  If @var{ps} is a null pointer, a state
 object local to @code{mbrlen} is used.
 
 @pindex wchar.h
-This function was introduced in the second amendment to @w{ISO C90} and
+This function was introduced in @w{Amendment 1} to @w{ISO C90} and
 is declared in @file{wchar.h}.
 @end deftypefun
 
@@ -839,7 +853,7 @@ character.  So the caller has to make sure that there is enough space
 available, otherwise buffer overruns can occur.
 
 @pindex wchar.h
-This function was introduced in the second amendment to @w{ISO C} and is
+This function was introduced in @w{Amendment 1} to @w{ISO C90} and is
 declared in @file{wchar.h}.
 @end deftypefun
 
@@ -977,7 +991,7 @@ byte in the input string was reached) or the address of the byte
 following the last converted multibyte character.
 
 @pindex wchar.h
-This function was introduced in the second amendment to @w{ISO C} and is
+This function was introduced in @w{Amendment 1} to @w{ISO C90} and is
 declared in @file{wchar.h}.
 @end deftypefun
 
@@ -1058,7 +1072,7 @@ the initial shift state in case the terminating NUL wide character was
 converted.
 
 @pindex wchar.h
-This function was introduced in the second amendment to @w{ISO C} and is
+This function was introduced in @w{Amendment 1} to @w{ISO C90} and is
 declared in @file{wchar.h}.
 @end deftypefun
 
@@ -1231,8 +1245,8 @@ file_mbsrtowcs (int input, int output)
 @node Non-reentrant Conversion
 @section Non-reentrant Conversion Function
 
-The functions described in the last chapter are defined in the second
-amendment to @w{ISO C90}.  But the original @w{ISO C90} standard also
+The functions described in the last chapter are defined in
+@w{Amendment 1} to @w{ISO C90}.  But the original @w{ISO C90} standard also
 contained functions for character set conversion.  The reason that they
 are not described in the first place is that they are almost entirely
 useless.
@@ -1369,8 +1383,8 @@ The function @code{mblen} is declared in @file{stdlib.h}.
 
 For convenience reasons the @w{ISO C90} standard defines also functions
 to convert entire strings instead of single characters.  These functions
-suffer from the same problems as their reentrant counterparts from the
-second amendment to @w{ISO C90}; see @ref{Converting Strings}.
+suffer from the same problems as their reentrant counterparts from
+@w{Amendment 1} to @w{ISO C90}; see @ref{Converting Strings}.
 
 @comment stdlib.h
 @comment ISO
@@ -1513,7 +1527,7 @@ common that they operate on character sets which are not directly
 specified by the functions.  The multibyte encoding used is specified by
 the currently selected locale for the @code{LC_CTYPE} category.  The
 wide character set is fixed by the implementation (in the case of GNU C
-library it always is UCS4 encoded @w{ISO 10646}.
+library it always is UCS-4 encoded @w{ISO 10646}.
 
 This has of course several problems when it comes to general character
 conversion:
@@ -1806,12 +1820,12 @@ file2wcs (int fd, const char *charset, wchar_t *outbuf, size_t avail)
   int result = 0;
   iconv_t cd;
 
-  cd = iconv_open ("UCS4", charset);
+  cd = iconv_open ("UCS-4", charset);
   if (cd == (iconv_t) -1)
     @{
       /* @r{Something went wrong.}  */
       if (errno == EINVAL)
-        error (0, 0, "conversion from `%s' to `UCS4' no available",
+        error (0, 0, "conversion from '%s' to 'UCS-4' not available",
                charset);
       else
         perror ("iconv_open");
@@ -2024,7 +2038,7 @@ will succeed but how to find @math{@cal{B}}?
 
 Unfortunately, the answer is: there is no general solution.  On some
 systems guessing might help.  On those systems most character sets can
-convert to and from UTF8 encoded @w{ISO 10646} or Unicode text.
+convert to and from UTF-8 encoded @w{ISO 10646} or Unicode text.
 Beside this only some very system-specific methods can help.  Since the
 conversion functions come from loadable modules and these modules must
 be stored somewhere in the filesystem, one @emph{could} try to find them
@@ -2082,7 +2096,7 @@ wanted conversion.
 
 @cindex triangulation
 This is achieved by providing for each character set a conversion from
-and to UCS4 encoded @w{ISO 10646}.  Using @w{ISO 10646} as an
+and to UCS-4 encoded @w{ISO 10646}.  Using @w{ISO 10646} as an
 intermediate representation it is possible to @dfn{triangulate}, i.e.,
 converting with an intermediate representation.
 
@@ -2210,15 +2224,15 @@ ending with @code{//}.  There often is a character set named
 @code{INTERNAL} mentioned.  From the discussion above and the chosen
 name it should have become clear that this is the name for the
 representation used in the intermediate step of the triangulation.  We
-have said that this is UCS4 but actually it is not quite right.  The
-UCS4 specification also includes the specification of the byte ordering
-used.  Since a UCS4 value consists of four bytes a stored value is
+have said that this is UCS-4 but actually it is not quite right.  The
+UCS-4 specification also includes the specification of the byte ordering
+used.  Since a UCS-4 value consists of four bytes a stored value is
 effected by byte ordering.  The internal representation is @emph{not}
-the same as UCS4 in case the byte ordering of the processor (or at least
-the running process) is not the same as the one required for UCS4.  This
+the same as UCS-4 in case the byte ordering of the processor (or at least
+the running process) is not the same as the one required for UCS-4.  This
 is done for performance reasons as one does not want to perform
 unnecessary byte-swapping operations if one is not interested in actually
-seeing the result in UCS4.  To avoid trouble with endianess the internal
+seeing the result in UCS-4.  To avoid trouble with endianess the internal
 representation consistently is named @code{INTERNAL} even on big-endian
 systems where the representations are identical.
 
@@ -2570,7 +2584,7 @@ One interesting thing is the initialization of the @code{__min_} and
 character can consist of one to four bytes.  Therefore the
 @code{MIN_NEEDED_FROM} and @code{MAX_NEEDED_FROM} macros are defined
 this way.  The output is always the @code{INTERNAL} character set (aka
-UCS4) and therefore each character consists of exactly four bytes.  For
+UCS-4) and therefore each character consists of exactly four bytes.  For
 the conversion from @code{INTERNAL} to ISO-2022-JP we have to take into
 account that escape sequences might be necessary to switch the character
 sets.  Therefore the @code{__max_needed_to} element for this direction