about summary refs log tree commit diff
diff options
context:
space:
mode:
-rw-r--r--ChangeLog11
-rw-r--r--manual/charset.texi71
-rw-r--r--manual/examples/mbstouwcs.c49
3 files changed, 88 insertions, 43 deletions
diff --git a/ChangeLog b/ChangeLog
index 9b30d0ce3b..c36b56a8e9 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,16 @@
 2018-04-05  Florian Weimer  <fweimer@redhat.com>
 
+	* manual/examples/mbstouwcs.c (mbstouwcs): Fix loop termination,
+	integer overflow, memory leak on error, and indeterminate errno
+	value.  Add a null wide character to terminate the result string.
+	* manual/charset.texi (Converting a Character): Mention embedded
+	null bytes in the mbrtowc input string.  Explain what happens in
+	the -2 result case.  Do not claim that mbrtowc is simple or
+	obvious to use.  Adjust the description of the code example.  Use
+	@code, not @var, for concrete variables.
+
+2018-04-05  Florian Weimer  <fweimer@redhat.com>
+
 	* manual/examples/mbstouwcs.c: New file.
 	* manual/charset.texi (Converting a Character): Include it.
 
diff --git a/manual/charset.texi b/manual/charset.texi
index 6831ebec27..a63d67045f 100644
--- a/manual/charset.texi
+++ b/manual/charset.texi
@@ -643,8 +643,8 @@ and they also do not require it to be in the initial state.
 @cindex stateful
 The @code{mbrtowc} function (``multibyte restartable to wide
 character'') converts the next multibyte character in the string pointed
-to by @var{s} into a wide character and stores it in the wide character
-string pointed to by @var{pwc}.  The conversion is performed according
+to by @var{s} into a wide character and stores it in the location
+pointed to by @var{pwc}.  The conversion is performed according
 to the locale currently selected for the @code{LC_CTYPE} category.  If
 the conversion for the character set used in the locale requires a state,
 the multibyte string is interpreted in the state represented by the
@@ -652,7 +652,7 @@ object pointed to by @var{ps}.  If @var{ps} is a null pointer, a static,
 internal state variable used only by the @code{mbrtowc} function is
 used.
 
-If the next multibyte character corresponds to the NUL wide character,
+If the next multibyte character corresponds to the null wide character,
 the return value of the function is @math{0} and the state object is
 afterwards in the initial state.  If the next @var{n} or fewer bytes
 form a correct multibyte character, the return value is the number of
@@ -665,50 +665,59 @@ by @var{pwc} if @var{pwc} is not null.
 If the first @var{n} bytes of the multibyte string possibly form a valid
 multibyte character but there are more than @var{n} bytes needed to
 complete it, the return value of the function is @code{(size_t) -2} and
-no value is stored.  Please note that this can happen even if @var{n}
-has a value greater than or equal to @code{MB_CUR_MAX} since the input
-might contain redundant shift sequences.
+no value is stored in @code{*@var{pwc}}.  The conversion state is
+updated and all @var{n} input bytes are consumed and should not be
+submitted again.  Please note that this can happen even if @var{n} has a
+value greater than or equal to @code{MB_CUR_MAX} since the input might
+contain redundant shift sequences.
 
 If the first @code{n} bytes of the multibyte string cannot possibly form
 a valid multibyte character, no value is stored, the global variable
 @code{errno} is set to the value @code{EILSEQ}, and the function returns
 @code{(size_t) -1}.  The conversion state is afterwards undefined.
 
+As specified, the @code{mbrtowc} function could deal with multibyte
+sequences which contain embedded null bytes (which happens in Unicode
+encodings such as UTF-16), but @theglibc{} does not support such
+multibyte encodings.  When encountering a null input byte, the function
+will either return zero, or return @code{(size_t) -1)} and report a
+@code{EILSEQ} error.  The @code{iconv} function can be used for
+converting between arbitrary encodings.  @xref{Generic Conversion
+Interface}.
+
 @pindex wchar.h
 @code{mbrtowc} was introduced in @w{Amendment 1} to @w{ISO C90} and
 is declared in @file{wchar.h}.
 @end deftypefun
 
-Use of @code{mbrtowc} is straightforward.  A function that copies a
-multibyte string into a wide character string while at the same time
-converting all lowercase characters into uppercase could look like this
-(this is not the final version, just an example; it has no error
-checking, and sometimes leaks memory):
+A function that copies a multibyte string into a wide character string
+while at the same time converting all lowercase characters into
+uppercase could look like this:
 
 @smallexample
 @include mbstouwcs.c.texi
 @end smallexample
 
-The use of @code{mbrtowc} should be clear.  A single wide character is
-stored in @code{@var{tmp}[0]}, and the number of consumed bytes is stored
-in the variable @var{nbytes}.  If the conversion is successful, the
-uppercase variant of the wide character is stored in the @var{result}
-array and the pointer to the input string and the number of available
-bytes is adjusted.
-
-The only non-obvious thing about @code{mbrtowc} might be the way memory
-is allocated for the result.  The above code uses the fact that there
-can never be more wide characters in the converted result than there are
-bytes in the multibyte input string.  This method yields a pessimistic
-guess about the size of the result, and if many wide character strings
-have to be constructed this way or if the strings are long, the extra
-memory required to be allocated because the input string contains
-multibyte characters might be significant.  The allocated memory block can
-be resized to the correct size before returning it, but a better solution
-might be to allocate just the right amount of space for the result right
-away.  Unfortunately there is no function to compute the length of the wide
-character string directly from the multibyte string.  There is, however, a
-function that does part of the work.
+In the inner loop, a single wide character is stored in @code{wc}, and
+the number of consumed bytes is stored in the variable @code{nbytes}.
+If the conversion is successful, the uppercase variant of the wide
+character is stored in the code{result} array and the pointer to the
+input string and the number of available bytes is adjusted.  If the
+@code{mbrtowc} function returns zero, the null input byte has not been
+converted, so it must be stored explicitly in the result.
+
+The above code uses the fact that there can never be more wide
+characters in the converted result than there are bytes in the multibyte
+input string.  This method yields a pessimistic guess about the size of
+the result, and if many wide character strings have to be constructed
+this way or if the strings are long, the extra memory required to be
+allocated because the input string contains multibyte characters might
+be significant.  The allocated memory block can be resized to the
+correct size before returning it, but a better solution might be to
+allocate just the right amount of space for the result right away.
+Unfortunately there is no function to compute the length of the wide
+character string directly from the multibyte string.  There is, however,
+a function that does part of the work.
 
 @deftypefun size_t mbrlen (const char *restrict @var{s}, size_t @var{n}, mbstate_t *@var{ps})
 @standards{ISO, wchar.h}
diff --git a/manual/examples/mbstouwcs.c b/manual/examples/mbstouwcs.c
index 5d223da2ae..c94e1fa790 100644
--- a/manual/examples/mbstouwcs.c
+++ b/manual/examples/mbstouwcs.c
@@ -1,3 +1,4 @@
+#include <stdbool.h>
 #include <stdlib.h>
 #include <string.h>
 #include <wchar.h>
@@ -7,22 +8,46 @@
 wchar_t *
 mbstouwcs (const char *s)
 {
-  size_t len = strlen (s);
-  wchar_t *result = malloc ((len + 1) * sizeof (wchar_t));
+  /* Include the null terminator in the conversion.  */
+  size_t len = strlen (s) + 1;
+  wchar_t *result = reallocarray (NULL, len, sizeof (wchar_t));
+  if (result == NULL)
+    return NULL;
+
   wchar_t *wcp = result;
-  wchar_t tmp[1];
   mbstate_t state;
-  size_t nbytes;
-
   memset (&state, '\0', sizeof (state));
-  while ((nbytes = mbrtowc (tmp, s, len, &state)) > 0)
+
+  while (true)
     {
-      if (nbytes >= (size_t) -2)
-        /* Invalid input string.  */
-        return NULL;
-      *wcp++ = towupper (tmp[0]);
-      len -= nbytes;
-      s += nbytes;
+      wchar_t wc;
+      size_t nbytes = mbrtowc (&wc, s, len, &state);
+      if (nbytes == 0)
+        {
+          /* Terminate the result string.  */
+          *wcp = L'\0';
+          break;
+        }
+      else if (nbytes == (size_t) -2)
+        {
+          /* Truncated input string.  */
+          errno = EILSEQ;
+          free (result);
+          return NULL;
+        }
+      else if (nbytes == (size_t) -1)
+        {
+          /* Some other error (including EILSEQ).  */
+          free (result);
+          return NULL;
+        }
+      else
+        {
+          /* A character was converted.  */
+          *wcp++ = towupper (wc);
+          len -= nbytes;
+          s += nbytes;
+        }
     }
   return result;
 }