0001007: iconv function not allowed to fail to convert valid sequences

ID	Project	Category	View Status	Date Submitted	Last Update

0001007	1003.1(2013)/Issue7+TC1	Base Definitions and Headers	public	2015-11-16 20:37	2024-06-11 09:02

Reporter	sebor	Assigned To	ajosey
Priority	normal	Severity	Objection	Type	Omission
Status	Closed	Resolution	Accepted As Marked

Name	Martin Sebor
Organization
User Reference
Section	iconv function
Page Number	1109-1110
Line Number
Interp Status	---
Final Accepted Text	0001007:0003330


Summary	0001007: iconv function not allowed to fail to convert valid sequences
Description	The specification for the iconv() function assumes that every input sequence that is valid in the source codeset is convertible to some sequence in the destination codeset. In particular, the specification doesn't allow the function to fail when a valid sequence in the source codeset cannot be represented in the destination codeset. As an example where this assumption doesn't hold, consider a conversion from UTF-8 to ISO-8859 where a large number of source characters don't have equivalents in the destination codeset. A survey of a subset of existing implementations shows that they fail with EILSEQ in such cases, despite the specification defining the error condition as "Input conversion stopped due to an input byte that does not belong to the input codeset."
Desired Action	The standard needs to be corrected to require the iconv function to fail when "a valid sequence of characters in the source codeset doesn't have a corresponding valid sequence of characters in the destination codeset." Ideally, the function would make it possible to distinguish this conversion failure from other kinds of failure, including EILSEQ, by setting errno to a value that's distinct from all the currently used error values. Unfortunately, that doesn't seem to be how existing implementations behave.
Tags	issue8

steffen 2015-11-16 22:38 reporter bugnote:0002965	The description does say If iconv( ) encounters a character in the input buffer that is valid, but for which an identical character does not exist in the target codeset, iconv( ) shall perform an implementation-defined conversion on this character. It would be nice if the "implementation-defined conversion" could be made configurable, so that a user could choose the action himself, e.g., "stop-before", "stop-after", "replace" and "skip", or whatever. For the "stop" cases an error would be reported. Unfortunately iconv doesn't take a struct that could be extended so that "stop-before" and "stop-after" could be united and/or report the number of input bytes that form the character in question. At least UTF-8 is self-synchronizing. When you look around really odd things happen, like placing reset sequences on error etc. -- how could that help by itself. Simply skipping over malformed input with a second converter and continuing with the old state is just as weird.

shware_systems 2015-11-17 21:39 reporter bugnote:0002968 Last edited: 2015-11-17 21:41	EILSEQ is also specified in mbrtowc() as what is to be returned in errno when an input sequence doesn't have a defined wchar representation, as a subcase of the -1 return code. I expect those versions of iconv() you tested return it to be consistent with that usage. I agree it can be argued as an improper overload of the code with iconv(), as what non-zero value like -1 to return in inputbytesleft to announce the difference isn't specified. It can also be thought of as a quality of implementation issue, as that hook is implicitly there to be taken advantage of. Whether this is permitted as a substitution of nothing or an input ending \0 rather than a conversion to a graphic char like '?' is open to interpretation also, it looks. Optional behaviors are expected to be encoded in the strings passed to iconv_open() for turning on. That is the 'structure' provided, anyways, as all the valid values are implementation-defined.

sebor 2015-11-18 00:03 reporter bugnote:0002969	I don't think mbtowc can (or is expected to) fail to convert a valid sequence into the internal wchar_t encoding. The wchar_t encoding is either implementation-defined or Unicode (when the __STDC_ISO_10646__ macro is defined). Unicode is a superset of all other encodings so a conversion error should never occur. For other internal encodings, I believe it's up to the implementation to choose one such that this holds as well. Unlike mbtowc, the iconv function lets the user specify both the source and the destination encodings, both of which can be and often are multibyte and without a 1-to-1 mapping. iconv can then easily fail to convert some valid sequences. Since this is a common condition, the function should be allowed (and IMO required) to report such failures. The problem is that the specification doesn't make reporting this basic condition possible (there's no provision for it in the spec) or portably detectable (no error condition describes it).

milanbv 2016-02-07 11:07 reporter bugnote:0003068	The issue description is a good summary of the situation IMHO. Unfortunately, while some implementations violate the standard (GNU libiconv, Mac OS X, glibc), others follow it (the BSDs), which makes it impossible to rely on any behavior nor to standardize it. A good way forward to make implementations consistent would be to standardize on a function like GNU libiconv's iconvctl(): https://www.gnu.org/savannah-checkouts/gnu/libiconv/documentation/libiconv-1.13/iconvctl.3.html This solution would preserve backward-compatibility with existing (and inconsistent) implementations behaviours, but would allow enabling a different behaviour for the currently problematic cases: - whether to raise an error or continue on invalid conversion - if the former, which error code (to allow distinguishing from EILSEQ, which is supposed to be used for invalid input only) - if the latter, what character to use as a fallback (including nothing, for the "discard" or "ignore" behaviour) - optionally, whether to use transliteration

shware_systems 2016-07-28 23:14 reporter bugnote:0003311	Re: 2969, as a sidebar mbrtowc() is expected to fail with any locale created by localedef that uses a user provided charmap file where a code point has a label not listed in the standard. For the labels the standard lists a generic mapping is possible, but not required, otherwise no without platform specific extensions. This is due to the charmap format not having a means to map by label or value(s) a code point definition to the encoding of wchar the platform uses. This is one of the corrollaries to XBD 6 leaving the wchar encoding explicitly unspecified, not implementation-defined, in V6TC1. I brought it up back then; my alternative resolution that would allow mapping by label wasn't agreed to. It applies to Unicode also, which is a superset of many, certainly, but by no means all encodings. Any range definition in a charmap file has no Unicode counterpart, for one. There are code sets that use code points Unicode doesn't, and won't, include as well that can be represented by a charmap file.

geoffclare 2016-07-29 09:13 reporter bugnote:0003313	Solaris 11 and glibc both support an extension to iconv_open() to control conversion behaviour, by appending "//" and a special string to a codeset name. According to https://www.gnu.org/savannah-checkouts/gnu/libiconv/documentation/libiconv-1.13/iconv_open.3.html glibc only supports appending to tocode and it only supports the strings "//TRANSLIT" and "//IGNORE". Solaris 11 supports appending to both fromcode and tocode and supports many more strings. We should consider adding a minimal subset of these strings as a way to solve the problem reported here. This could be done on its own or in tandem with adding the iconvctl() function, but adding iconvctl() would require a sponsor. A few points of detail regarding the strings: 1. The Solaris 11 man page says that //IGNORE is shorthand for //NON_IDENTICAL_DISCARD//ILLEGAL_DISCARD. It's not clear whether appending //IGNORE to tocode activates //ILLEGAL_DISCARD or not, since that controls the handling of invalid characters in the input. Hopefully //ILLEGAL_DISCARD is only activated by //IGNORE when appended to fromcode, in which case the behaviour when appended to tocode should be the same as glibc (i.e. an invalid character in the input still causes EILSEQ). 2. The Solaris 11 man page says that when characters are discarded because there is no identical character in the target codeset, they are still counted in the return value of iconv(). We will need to check whether glibc does the same. 3. Solaris does not appear to have a // string that can be used to make iconv() return an error when an input character has no identical character in the target codeset. 4. When //TRANSLIT is in effect, there can still presumably be input characters for which no suitable transliteration to the target codeset is possible. It's not clear what happens in this case.

steffen 2016-07-30 13:56 reporter bugnote:0003314	Good afternoon. For item 4. it seems they use a replacement character if no transliteration is possible. I haven't checked wether they use the Unicode replacement character if available in the target character set, but i only of the question mark ? and that one being used for that purpose. If i run the program below in an UTF-8 locale on a current Linux (GLibC 2.23-5) i get: ?0[steffen@wales tmp]$ ./i1007 -f utf8 -s IGNORE -t ASCII 'hallöchen' Input (utf8): 10 bytes: hallöchen EILSEQ after 10/0 () Output (ASCII//IGNORE): 8 bytes: hallchen Accumulated iconv(3) return value: 0 ?1[steffen@wales tmp]$ ./i1007 -f utf8 -s TRANSLIT -t ASCII 'hallöchen' Input (utf8): 10 bytes: hallöchen Output (ASCII//TRANSLIT): 9 bytes: hall?chen Accumulated iconv(3) return value: 1 ?0[steffen@wales tmp]$ ./i1007 -f utf8 -s TRANSLIT//IGNORE -t ASCII 'hallöchen' Input (utf8): 10 bytes: hallöchen Output (ASCII//TRANSLIT//IGNORE): 9 bytes: hall?chen Accumulated iconv(3) return value: 1 ?0[steffen@wales tmp]$ ./i1007 -f utf8 -s IGNORE//TRANSLIT -t ASCII 'hallöchen' Input (utf8): 10 bytes: hallöchen EILSEQ after 10/0 () Output (ASCII//IGNORE//TRANSLIT): 8 bytes: hallchen Accumulated iconv(3) return value: 0 ?1[steffen@wales tmp]$ ./i1007 -f utf8 -t ASCII 'hallöchen' Input (utf8): 10 bytes: hallöchen EILSEQ after 4/6 (öchen) EILSEQ after 5/5 (�chen) Output (ASCII): 8 bytes: hallchen Accumulated iconv(3) return value: 0 -- >8 -- 8< -- /@ i1007.c; http://austingroupbugs.net/view.php?id=1007 / #include <errno.h> #include <getopt.h> #include <iconv.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> /* sysexits.h / #define a_EX_USAGE 64 #define a_EX_DATAERR 65 #define a_EX_UNAVAILABLE 69 int main(int argc, char argv){ char tobuf[80], oubuf[80], inb, oub; size_t inl, inl_orig, oul, totret; iconv_t ic; char const tocode, fromcode, suffix; int i; /* / tocode = fromcode = suffix = NULL; while((i = getopt(argc, argv, "f:s:t:")) >= 0) switch(i){ case 'f': fromcode = optarg; break; case 's': suffix = optarg; break; case 't': tocode = optarg; break; default: i = a_EX_USAGE; goto jleave; } argc -= optind; argv += optind; / / i = a_EX_USAGE; if(argc != 1){ fprintf(stderr, "Synopsis: i1007 " "[-f romcode] [-s uffix] [-t ocode] INPUT\n"); goto jleave; } if(tocode == NULL){ fprintf(stderr, "Need a -t[ocode]\n"); goto jleave; } if(fromcode == NULL){ fprintf(stderr, "Need a -f[romcode]\n"); goto jleave; } if(suffix != NULL){ i = snprintf(tobuf, sizeof tobuf, "%s//%s", tocode, suffix); if((size_t)i >= sizeof tobuf){ fprintf(stderr, "-t/-s option combination too long\n"); i = a_EX_DATAERR; goto jleave; } tocode = tobuf; } / / if((ic = iconv_open(tocode, fromcode)) == (iconv_t)-1){ fprintf(stderr, "Cannot convert from %s to %s: %s\n", fromcode, tocode, strerror(errno)); i = a_EX_UNAVAILABLE; goto jleave; } i = EXIT_SUCCESS; inl_orig = inl = strlen(inb = argv); printf("Input (%s): %zu bytes: %s\n", fromcode, inl, inb); memset(oub = oubuf, '\0', sizeof oubuf); oul = sizeof oubuf; for(totret = 0; inl > 0;){ size_t j; j = iconv(ic, &inb, &inl, &oub, &oul); if(j == (size_t)-1){ i = EXIT_FAILURE; switch(errno){ case EILSEQ: fprintf(stderr, "EILSEQ after %zu/%zu (%s)\n", inl_orig - inl, inl, inb); if(inl > 0){ --inl; ++inb; }else oub = '\0'; break; case E2BIG: case EINVAL: fprintf(stderr, "E2BIG/EINVAL, forced stop\n"); oub = '\0'; inl = 0; /* Force stop */ break; } continue; } totret += j; } oul = strlen(oub = oubuf); printf("Output (%s): %zu bytes: %s\n", tocode, oul, oub); printf("Accumulated iconv(3) return value: %zu\n", totret); iconv_close(ic); jleave: return i; }

geoffclare 2016-08-02 16:09 reporter bugnote:0003323	I have run Steffen's program on Solaris 11 and got some answers to points 1 and 4 in my previous bugnote. 1. Appending //IGNORE to tocode does activate //ILLEGAL_DISCARD. This is in an ISO8859-1 terminal session (so the ö is not valid UTF-8): $ ./agbug1007 -f UTF-8 -s IGNORE -t ASCII 'hallöchen' Input (UTF-8): 9 bytes: hallöchen Output (ASCII//IGNORE): 8 bytes: hallchen Accumulated iconv(3) return value: 0 $ ./agbug1007 -f UTF-8 -s NON_IDENTICAL_DISCARD -t ASCII 'hallöchen' Input (UTF-8): 9 bytes: hallöchen EILSEQ after 4/5 (öchen) Output (ASCII//NON_IDENTICAL_DISCARD): 8 bytes: hallchen Accumulated iconv(3) return value: 0 However, I did the same thing on Linux and discovered that glibc also ignores invalid characters in the input with //IGNORE appended to tocode. So the behaviour is the same after all, it's just not what the glibc manpage implied. 4. With //TRANSLIT in effect, input characters for which no suitable transliteration to the target codeset exists are converted the same way they are without //TRANSLIT. This is in a UTF-8 terminal session: $ ./agbug1007 -f UTF-8 -s TRANSLIT -t ASCII 'hallöchen' Input (UTF-8): 10 bytes: hallöchen Output (ASCII//TRANSLIT): 9 bytes: hall?chen Accumulated iconv(3) return value: 1 $ ./agbug1007 -f UTF-8 -t ASCII 'hallöchen' Input (UTF-8): 10 bytes: hallöchen Output (ASCII): 9 bytes: hall?chen Accumulated iconv(3) return value: 1 I can also see an answer on point 2 in Steffen's output: when //IGNORE is used, glibc does not count the discarded characters in the return value of iconv(). However, there is something odd about that output: why does iconv() give an EILSEQ error at the end of the string?

steffen 2016-08-03 11:45 reporter bugnote:0003324	\|However, there is something odd about that output: why does iconv() give an EILSEQ error at the end of the string? That seems to be a bug in the GNU C library on a current (last week) ArchLinux. The same works just smoothly on "SunOS unstable11s 5.11 11.2 sun4u sparc SUNW,SPARC-Enterprise". In my opinion appending "indicators" to "fromcode" is not a good idea, so i understand why GNU seems to solely use "tocode" for this purpose: the input is constant and any possible modification happens on the output side. TRANSLIT seems to have the nice effect to correctly replace invalid sequences with replacement characters, and the right number thereof --- in the code that i use to deal with iconv(3) as it is standardized right now the non-convertible UTF-8 sequence in the example i gave would be replaced with two replacement characters instead of only one, because i think i cannot skip over a complete multibyte sequence with standart tools. So having TRANSLIT would be a real improvement! The Solaris 11 manual says for TRANSLIT (an alias) For non-identical characters, if applicable, the iconv() code conversion transliterates each of such characters into one or more characters of the target codeset best resembling the input character. The number of such non-identical characters are counted and returned as the return value of iconv(). GNU simply says The iconv() function returns the number of characters converted in a nonreversible way during this call; reversible conversions are not counted. I like the former more, but the latter has been written in spare time. :) Returning to the example i gave, and in respect to TRANSLIT and "transliterates" and "best resembling the input character", i wonder what this means given that "ö" should really be "transliterated" to "oe", this is LATIN1 and thus a really basic, and very widely used character set. That is to say that "proper conversion of non-convertible byte sequences to replacement characters" would be a better description, and would then result in predictable and reproduceable output.

~~Don Cragun~~ 2016-08-04 15:48 viewer bugnote:0003325	This issue was discussed during The Austin Group 2016-08-04 conference call. We agree that much of what is being suggested would be good to add to the standard, but before we can do that we need actual text specifying what should be added to the standard, a sponsor for any new interface(s), and general agreement from the parties who would need to implement the additional requirements that the suggested text would pass balloting. We propose that such text be based on a subset of the current Solaris 11 implementation as described by their current iconv_open(3c), iconvctl(3c), iconvstr(3c), and iconv(3c) man pages. These man pages can be found here: iconv_open(3c): https://docs.oracle.com/cd/E23824_01/html/821-1465/iconv-open-3c.html#scrolltoc iconvctl(3c): https://docs.oracle.com/cd/E23824_01/html/821-1465/iconvctl-3c.html#scrolltoc iconvstr(3c): https://docs.oracle.com/cd/E23824_01/html/821-1465/iconvstr-3c.html#scrolltoc iconv(3c): https://docs.oracle.com/cd/E23824_01/html/821-1465/iconv-3c.html#REFMAN3Aiconv-3c

eblake 2016-08-04 20:57 manager bugnote:0003327	I've created glibc bug https://sourceware.org/bugzilla/show_bug.cgi?id=20437 to see if we can get any preferences from glibc maintainers on how much of a common subset is worth standardizing

geoffclare 2016-08-10 14:58 reporter bugnote:0003330 Last edited: 2016-08-10 15:02	These are the changes I believe are needed to add the common subset of Solaris 11 and glibc iconv_open() extensions (//IGNORE and //TRANSLIT), plus Solaris's //NON_IDENTICAL_DISCARD. This is intended as a starting point to which further changes can be added if the group so decides. On page 1109 line 37310 section iconv() Change: If a sequence of input bytes does not form a valid character in the specified codeset, conversion shall stop after the previous successfully converted character. to: If a sequence of input bytes does not form a valid character or shift sequence in the input codeset: If the //IGNORE indicator suffix was specified when the conversion descriptor cd was opened and the byte sequence is immediately followed by a valid character or shift sequence, the sequence of bytes shall be discarded and conversion shall continue from the immediately following valid character or shift sequence. This shall not be treated as an error. If the //IGNORE indicator suffix was not specified when the conversion descriptor cd was opened, conversion shall stop after the previous successfully converted character or shift sequence. On page 1109 line 37323 section iconv() Change: If iconv() encounters a character in the input buffer that is valid, but for which an identical character does not exist in the target codeset, iconv() shall perform an implementation-defined conversion on this character. to: If iconv() encounters a character in the input buffer that is valid, but for which an identical character does not exist in the output codeset: If either the //IGNORE or the //NON_IDENTICAL_DISCARD indicator suffix was specified when the conversion descriptor cd was opened, the character shall be discarded but shall still be counted in the return value of the iconv() call. If the //TRANSLIT indicator suffix was specified when the conversion descriptor cd was opened, an implementation-defined transliteration shall be performed, if possible, to convert the character into one or more characters of the output codeset that best resemble the input character. The character shall be counted as one character in the return value of the iconv() call, regardless of the number of output characters. If no indicator suffix was specified when the conversion descriptor cd was opened, or the //TRANSLIT indicator suffix was specified but no transliteration of the character is possible, iconv() shall perform an implementation-defined conversion on the character and it shall be counted in the return value of the iconv() call. On page 1109 line 37327 section iconv() Change: The iconv() function shall update the variables pointed to by the arguments to reflect the extent of the conversion and return the number of non-identical conversions performed. If the entire string in the input buffer is converted, the value pointed to by inbytesleft shall be 0. to: The iconv() function shall update the variables pointed to by the arguments to reflect the extent of the conversion and shall return the number of input characters that could not be converted to an identical output character. If the entire string in the input buffer is converted, except for any byte sequences discarded as a result of the //IGNORE indicator suffix, the value pointed to by inbytesleft shall be 0. On page 1110 line 37364 section iconv() Add a new paragraph to APPLICATION USAGE: When the //IGNORE indicator suffix was used to open the conversion descriptor, iconv() does not provide any indication of whether any invalid input byte sequences were discarded. Applications which need to detect the discarding of invalid input byte sequences can open the conversion descriptor without using //IGNORE and then call iconv() in a loop such that if it returns an EILSEQ error, the application increments the variable pointed to by inbuf and decrements the variable pointed to by inbytesleft before the next call. This technique can also be used by applications which need to use //TRANSLIT but also discard invalid input byte sequences. On page 1113 line 37421 section iconv_open() Change: Settings of fromcode and tocode and their permitted combinations are implementation-defined. to: The codeset names that can be specified in fromcode and tocode and their permitted combinations are implementation-defined. Any one of the following indicator suffixes can be appended to the codeset name in tocode: //NON_IDENTICAL_DISCARD Discard input characters for which an identical character does not exist in the output codeset. //IGNORE Discard input bytes that do not form a valid character or shift sequence, and discard input characters for which an identical character does not exist in the output codeset. //TRANSLIT Transliterate input characters for which an identical character does not exist in the output codeset into one or more characters of the output codeset that best resemble the input character. See the description of iconv() for details of how these indicator suffixes alter the conversion performed by iconv(). Additional implementation-defined indicator suffixes may be supported. On page 1113 line 37434 section iconv_open() Change the EINVAL description from: The conversion specified by fromcode and tocode is not supported by the implementation. to: Conversion from the codeset specified in fromcode to the codeset specified in tocode is not supported by the implementation, or the codeset name in tocode is followed by an indicator suffix that is unrecognized or not supported. On page 1113 line 37445 section iconv_open() Add a new paragraph to APPLICATION USAGE: Some implementations of iconv_open() allow appending multiple indicator suffixes to the codeset name in tocode, and some allow appending an indicator suffix (or suffixes) in both fromcode and tocode. Portable applications should append at most one indicator suffix, and append it only in tocode.

shware_systems 2016-08-11 14:36 reporter bugnote:0003334	The following, as an element of //IGNORE, I feel should be included also, and should be simple for glibc to add. An application may not care about illegal sequences, but may want to pop up a prompt for how to transliterate non-identical sequences so would need to get notified when they occur. Having just //IGNORE doesn't permit this usage. //ILLEGAL_DISCARD When specified, during subsequent iconv() code conversion, a sequence of illegal input bytes that does not form a valid character in the codeset specified by the fromcode argument is silently discarded as if there are no such illegal bytes in the input buffer and the conversion continues.

Date Modified	Username	Field	Change
2015-11-16 20:37	sebor	New Issue
2015-11-16 20:37	sebor	Status	New => Under Review
2015-11-16 20:37	sebor	Assigned To	=> ajosey
2015-11-16 20:37	sebor	Name	=> Martin Sebor
2015-11-16 20:37	sebor	Section	=> iconv function
2015-11-16 20:37	sebor	Page Number	=> 1109-1110
2015-11-16 22:38	steffen	Note Added: 0002965
2015-11-17 17:02	geoffclare	Project	1003.1(2008)/Issue 7 => 1003.1(2013)/Issue7+TC1
2015-11-17 21:39	shware_systems	Note Added: 0002968
2015-11-17 21:41	shware_systems	Note Edited: 0002968
2015-11-18 00:03	sebor	Note Added: 0002969
2016-02-07 11:07	milanbv	Note Added: 0003068
2016-07-28 23:14	shware_systems	Note Added: 0003311
2016-07-29 09:13	geoffclare	Note Added: 0003313
2016-07-30 13:56	steffen	Note Added: 0003314
2016-08-02 16:09	geoffclare	Note Added: 0003323
2016-08-03 11:45	steffen	Note Added: 0003324
2016-08-04 15:48	~~Don Cragun~~	Note Added: 0003325
2016-08-04 20:57	eblake	Note Added: 0003327
2016-08-10 14:58	geoffclare	Note Added: 0003330
2016-08-10 15:02	geoffclare	Note Edited: 0003330
2016-08-11 14:36	shware_systems	Note Added: 0003334
2022-07-25 15:41	geoffclare	Interp Status	=> ---
2022-07-25 15:41	geoffclare	Final Accepted Text	=> 0001007:0003330
2022-07-25 15:41	geoffclare	Status	Under Review => Resolved
2022-07-25 15:41	geoffclare	Resolution	Open => Accepted As Marked
2022-07-25 15:41	geoffclare	Tag Attached: issue8
2022-08-19 15:28	geoffclare	Status	Resolved => Applied
2023-03-06 16:35	nick	Relationship added	related to 0001635
2024-06-11 09:02	agadmin	Status	Applied => Closed

View Issue Details

Relationships

Activities

Issue History