diff options
Diffstat (limited to 'manual/charset.texi')
-rw-r--r-- | manual/charset.texi | 68 |
1 files changed, 34 insertions, 34 deletions
diff --git a/manual/charset.texi b/manual/charset.texi index 68aecd3f1e..147d9c579a 100644 --- a/manual/charset.texi +++ b/manual/charset.texi @@ -31,7 +31,7 @@ library to support multiple character sets. @node Extended Char Intro @section Introduction to Extended Characters -A variety of solutions is available to overcome the differences between +A variety of solutions are available to overcome the differences between character sets with a 1:1 relation between bytes and characters and character sets with ratios of 2:1 or 4:1. The remainder of this section gives a few examples to help understand the design decisions @@ -202,7 +202,7 @@ defined in @file{wchar.h}. @end deftypevr -These internal representations present problems when it comes to storing +These internal representations present problems when it comes to storage and transmittal. Because each single wide character consists of more than one byte, they are affected by byte-ordering. Thus, machines with different endianesses would see different values when accessing the same @@ -389,7 +389,7 @@ the conversion is necessary take a look at the @code{iconv} functions @subsection Selecting the conversion and its properties We already said above that the currently selected locale for the -@code{LC_CTYPE} category decides about the conversion that is performed +@code{LC_CTYPE} category decides the conversion that is performed by the functions we are about to describe. Each locale uses its own character set (given as an argument to @code{localedef}) and this is the one assumed as the external multibyte encoding. The wide character @@ -549,7 +549,7 @@ necessary output code (@pxref{Converting Strings}). Please note that with @theglibc{} it is not necessary to perform this extra action for the conversion from multibyte text to wide character text since the wide character encoding is not stateful. But there is nothing mentioned in -any standard that prohibits making @code{wchar_t} using a stateful +any standard that prohibits making @code{wchar_t} use a stateful encoding. @node Converting a Character @@ -559,7 +559,7 @@ The most fundamental of the conversion functions are those dealing with single characters. Please note that this does not always mean single bytes. But since there is very often a subset of the multibyte character set that consists of single byte sequences, there are -functions to help with converting bytes. Frequently, ASCII is a subpart +functions to help with converting bytes. Frequently, ASCII is a subset of the multibyte character set. In such a scenario, each ASCII character stands for itself, and all other characters have at least a first byte that is beyond the range @math{0} to @math{127}. @@ -596,7 +596,7 @@ and is declared in @file{wchar.h}. Despite the limitation that the single byte value is always interpreted in the initial state, this function is actually useful most of the time. Most characters are either entirely single-byte character sets or they -are extension to ASCII. But then it is possible to write code like this +are extensions to ASCII. But then it is possible to write code like this (not that this specific example is very useful): @smallexample @@ -643,7 +643,7 @@ value of this function is this character. Otherwise the return value is is declared in @file{wchar.h}. @end deftypefun -There are more general functions to convert single character from +There are more general functions to convert single characters from multibyte representation to wide characters and vice versa. These functions pose no limit on the length of the multibyte representation and they also do not require it to be in the initial state. @@ -731,7 +731,7 @@ bytes is adjusted. The only non-obvious thing about @code{mbrtowc} might be the way memory is allocated for the result. The above code uses the fact that there -can never be more wide characters in the converted results than there are +can never be more wide characters in the converted result than there are bytes in the multibyte input string. This method yields a pessimistic guess about the size of the result, and if many wide character strings have to be constructed this way or if the strings are long, the extra @@ -813,7 +813,7 @@ Therefore, the @code{mbrlen} function will never read invalid memory. Now that this function is available (just to make this clear, this function is @emph{not} part of @theglibc{}) we can compute the -number of wide character required to store the converted multibyte +number of wide characters required to store the converted multibyte character string @var{s} using @smallexample @@ -879,7 +879,7 @@ multibyte'') converts a single wide character into a multibyte string corresponding to that wide character. If @var{s} is a null pointer, the function resets the state stored in -the objects pointed to by @var{ps} (or the internal @code{mbstate_t} +the object pointed to by @var{ps} (or the internal @code{mbstate_t} object) to the initial state. This can also be achieved by a call like this: @@ -1020,7 +1020,7 @@ extensions that can help in some important situations. @deftypefun size_t mbsrtowcs (wchar_t *restrict @var{dst}, const char **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps}) @safety{@prelim{}@mtunsafe{@mtasurace{:mbsrtowcs/!ps}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsmem{} @acsfd{}}} The @code{mbsrtowcs} function (``multibyte string restartable to wide -character string'') converts a NUL-terminated multibyte character +character string'') converts the NUL-terminated multibyte character string at @code{*@var{src}} into an equivalent wide character string, including the NUL wide character at the end. The conversion is started using the state information from the object pointed to by @var{ps} or @@ -1061,7 +1061,7 @@ declared in @file{wchar.h}. The definition of the @code{mbsrtowcs} function has one important limitation. The requirement that @var{dst} has to be a NUL-terminated string provides problems if one wants to convert buffers with text. A -buffer is normally no collection of NUL-terminated strings but instead a +buffer is not normally a collection of NUL-terminated strings but instead a continuous collection of lines, separated by newline characters. Now assume that a function to convert one line from a buffer is needed. Since the line is not NUL-terminated, the source pointer cannot directly point @@ -1078,7 +1078,7 @@ guess. @cindex stateful There is still a problem with the method of NUL-terminating a line right after the newline character, which could lead to very strange results. -As said in the description of the @code{mbsrtowcs} function above the +As said in the description of the @code{mbsrtowcs} function above, the conversion state is guaranteed to be in the initial shift state after processing the NUL byte at the end of the input string. But this NUL byte is not really part of the text (i.e., the conversion state after @@ -1110,7 +1110,7 @@ multibyte string'') converts the NUL-terminated wide character string at stores the result in the array pointed to by @var{dst}. The NUL wide character is also converted. The conversion starts in the state described in the object pointed to by @var{ps} or by a state object -locally to @code{wcsrtombs} in case @var{ps} is a null pointer. If +local to @code{wcsrtombs} in case @var{ps} is a null pointer. If @var{dst} is a null pointer, the conversion is performed as usual but the result is not available. If all characters of the input string were successfully converted and if @var{dst} is not a null pointer, the @@ -1123,13 +1123,13 @@ variable @code{errno} to @code{EILSEQ}, and returns @code{(size_t) -1}. Another reason for a premature stop is if @var{dst} is not a null pointer and the next converted character would require more than @var{len} bytes in total to the array @var{dst}. In this case (and if -@var{dest} is not a null pointer) the pointer pointed to by @var{src} is +@var{dst} is not a null pointer) the pointer pointed to by @var{src} is assigned a value pointing to the wide character right after the last one successfully converted. Except in the case of an encoding error the return value of the @code{wcsrtombs} function is the number of bytes in all the multibyte -character sequences stored in @var{dst}. Before returning the state in +character sequences stored in @var{dst}. Before returning, the state in the object pointed to by @var{ps} (or the internal object in case @var{ps} is a null pointer) is updated to reflect the state after the last conversion. The state is the initial shift state in case the @@ -1158,11 +1158,11 @@ This new parameter specifies how many bytes at most can be used from the multibyte character string. In other words, the multibyte character string @code{*@var{src}} need not be NUL-terminated. But if a NUL byte is found within the @var{nmc} first bytes of the string, the conversion -stops here. +stops there. This function is a GNU extension. It is meant to work around the problems mentioned above. Now it is possible to convert a buffer with -multibyte character text piece for piece without having to care about +multibyte character text piece by piece without having to care about inserting NUL bytes and the effect of NUL bytes on the conversion state. @end deftypefun @@ -1603,7 +1603,7 @@ common that they operate on character sets that are not directly specified by the functions. The multibyte encoding used is specified by the currently selected locale for the @code{LC_CTYPE} category. The wide character set is fixed by the implementation (in the case of @theglibc{} -it is always UCS-4 encoded @w{ISO 10646}. +it is always UCS-4 encoded @w{ISO 10646}). This has of course several problems when it comes to general character conversion: @@ -1681,7 +1681,7 @@ This data type is an abstract type defined in @file{iconv.h}. The user must not assume anything about the definition of this type; it must be completely opaque. -Objects of this type can get assigned handles for the conversions using +Objects of this type can be assigned handles for the conversions using the @code{iconv} functions. The objects themselves need not be freed, but the conversions for which the handles stand for have to. @end deftp @@ -1716,7 +1716,7 @@ returns @code{(iconv_t) -1}. In this case the global variable @item EMFILE The process already has @code{OPEN_MAX} file descriptors open. @item ENFILE -The system limit of open file is reached. +The system limit of open files is reached. @item ENOMEM Not enough memory to carry out the operation. @item EINVAL @@ -1778,7 +1778,7 @@ the @code{iconv_open} function. If the function call was successful the return value is @math{0}. Otherwise it is @math{-1} and @code{errno} is set appropriately. -Defined error are: +Defined errors are: @table @code @item EBADF @@ -1847,7 +1847,7 @@ stop is that the output buffer is full. And the third reason is that the input contains invalid characters. In all of these cases the buffer pointers after the last successful -conversion, for input and output buffer, are stored in @var{inbuf} and +conversion, for the input and output buffers, are stored in @var{inbuf} and @var{outbuf}, and the available room in each buffer is stored in @var{inbytesleft} and @var{outbytesleft}. @@ -2087,7 +2087,7 @@ possibilities. This does not mean 200 different character sets are supported; for example, conversions from one character set to a set of 10 others might count as 10 conversions. Together with the other direction this makes 20 conversion possibilities used up by one character set. One -can imagine the thin coverage these platform provide. Some Unix vendors +can imagine the thin coverage these platforms provide. Some Unix vendors even provide only a handful of conversions, which renders them useless for almost all uses. @@ -2133,7 +2133,7 @@ will succeed, but how to find @math{@cal{B}}? Unfortunately, the answer is: there is no general solution. On some systems guessing might help. On those systems most character sets can -convert to and from UTF-8 encoded @w{ISO 10646} or Unicode text. Beside +convert to and from UTF-8 encoded @w{ISO 10646} or Unicode text. Besides this only some very system-specific methods can help. Since the conversion functions come from loadable modules and these modules must be stored somewhere in the filesystem, one @emph{could} try to find them @@ -2143,7 +2143,7 @@ and whether there is an indirect route from @math{@cal{A}} to This example shows one of the design errors of @code{iconv} mentioned above. It should at least be possible to determine the list of available -conversion programmatically so that if @code{iconv_open} says there is no +conversions programmatically so that if @code{iconv_open} says there is no such conversion, one could make sure this also is true for indirect routes. @@ -2235,7 +2235,7 @@ achieve the same result as when using the real character set name. This is quite important as a character set has often many different names. There is normally an official name but this need not correspond to -the most popular name. Beside this many character sets have special +the most popular name. Besides this many character sets have special names that are somehow constructed. For example, all character sets specified by the ISO have an alias of the form @code{ISO-IR-@var{nnn}} where @var{nnn} is the registration number. This allows programs that @@ -2371,7 +2371,7 @@ itself). @itemx const char *__modname @itemx int __counter All these elements of the structure are used internally in the C library -to coordinate loading and unloading the shared. One must not expect any +to coordinate loading and unloading the shared object. One must not expect any of the other elements to be available or initialized. @item const char *__from_name @@ -2438,7 +2438,7 @@ These elements specify the output buffer for the conversion step. The @code{__outbuf} element points to the beginning of the buffer, and @code{__outbufend} points to the byte following the last byte in the buffer. The conversion function must not assume anything about the size -of the buffer but it can be safely assumed the there is room for at +of the buffer but it can be safely assumed there is room for at least one complete character in the output buffer. Once the conversion is finished, if the conversion is the last step, the @@ -2673,7 +2673,7 @@ Next, a data structure, which contains the necessary information about which conversion is selected, is allocated. The data structure @code{struct iso2022jp_data} is locally defined since, outside the module, this data is not used at all. Please note that if all four -conversions this modules supports are requested there are four data +conversions this module supports are requested there are four data blocks. One interesting thing is the initialization of the @code{__min_} and @@ -2686,7 +2686,7 @@ the conversion from @code{INTERNAL} to ISO-2022-JP we have to take into account that escape sequences might be necessary to switch the character sets. Therefore the @code{__max_needed_to} element for this direction gets assigned @code{MAX_NEEDED_FROM + 2}. This takes into account the -two bytes needed for the escape sequences to single the switching. The +two bytes needed for the escape sequences to signal the switching. The asymmetry in the maximum values for the two directions can be explained easily: when reading ISO-2022-JP text, escape sequences can be handled alone (i.e., it is not necessary to process a real character since the @@ -2694,7 +2694,7 @@ effect of the escape sequence can be recorded in the state information). The situation is different for the other direction. Since it is in general not known which character comes next, one cannot emit escape sequences to change the state in advance. This means the escape -sequences that have to be emitted together with the next character. +sequences have to be emitted together with the next character. Therefore one needs more room than only for the character itself. The possible return values of the initialization function are: @@ -2740,7 +2740,7 @@ conversion function. @comment gconv.h @comment GNU @deftypevr {Data type} int {(*__gconv_fct)} (struct __gconv_step *, struct __gconv_step_data *, const char **, const char *, size_t *, int) -The conversion function can be called for two basic reason: to convert +The conversion function can be called for two basic reasons: to convert text or to reset the state. From the description of the @code{iconv} function it can be seen why the flushing mode is necessary. What mode is selected is determined by the sixth argument, an integer. This @@ -2817,7 +2817,7 @@ therefore will look similar to this: But this is not yet all. Once the function call returns the conversion function might have some more to do. If the return value of the function is @code{__GCONV_EMPTY_INPUT}, more room is available in the output -buffer. Unless the input buffer is empty the conversion, functions start +buffer. Unless the input buffer is empty, the conversion functions start all over again and process the rest of the input buffer. If the return value is not @code{__GCONV_EMPTY_INPUT}, something went wrong and we have to recover from this. |