0001635: iconv: please be more explicit in input-not-convertible case

Notes
(0006164) steffen (reporter) 2023-02-21 18:20	P.S.: to eloberate some more, any implementation which wants to conform to the upcoming POSIX standard needs to pass the configured state (//IGNORE, //TRANSLIT ..) along its internal code path, so that this addition would be an "easy" one. Adding this very helpful addition to make the behaviour of the most widely used (and standards-incompatible) implementation explicitly addressable allows future programmers to know what they are doing. (By other means than performing compile-time tests and fixate behaviour according to the compile-time-tested environment.) I truly believe in the honourable Bruno Haible's approach, if it is addressable like that. The only other implementation that allows programmers to realize their own output conversion failure stuff is that of AIX (looking at gnulib), because that uses the NUL byte as an implementation-defined conversion, but that, of course, will fail for data which contains embedded NULs (which for one is otherwise graspable by iconv(3), and second will fail for all multi-byte as in 16-bit or 32-bit encodings per se). In general real life situation it very bad, just look at the efforts of character set conversion in the widely used and widely available libarchive. This addition could make things a bit better.

(0006812) bhaible (reporter) 2024-06-11 23:42	Regarding the case "when a valid sequence in the source codeset cannot be represented in the destination codeset." Here's how the various implementations behave (in case "when a valid sequence in the source codeset cannot be represented in the destination codeset"): * GNU libc and GNU libiconv and win-iconv (https://github.com/win-iconv/win-iconv): [^] - They fail the conversion with EILSEQ, when the to_codeset did not have a //TRANSLIT or //IGNORE suffix. - If the to_codeset had a //IGNORE suffix, the character is discarded, i.e. produces 0 bytes in the output. - If the to_codeset had a //TRANSLIT suffix, then a transliteration is attempted. It may do substitutions such as ½ → 1/2 or å → aa. Transliterations between scripts (e.g. from cyrillic to latin script) are generally not done. * musl libc uses produces a '' character in the output. FreeBSD, NetBSD produce a '?' character in the output. * Solaris attempts a transliteration if enabled, otherwise it produces a '?' character in the output. * IRIX produces a NUL character in the output. * macOS 14 iconv always does transliteration, - regardless whether a //TRANSLIT suffix was present in to_codeset or not, - regardless whether a //IGNORE suffix was present in to_codeset or not, - regardless whether iconvctl ICONV_SET_TRANSLITERATE was done on the conversion descriptor, - regardless whether iconvctl ICONV_SET_DISCARD_ILSEQ was done on the conversion descriptor, - regardless whether iconvctl ICONV_SET_ILSEQ_INVALID was done on the conversion descriptor. The transliteration result depends on the input character. In some cases, the result is merely a '?' character. And the return value (count of "non-identical conversions") is always 0.

(0006813) bhaible (reporter) 2024-06-11 23:47	The documentation for GNU libc had not been up-to-date for many years, but has been extended in 2023: https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=470b97d59797b040e28103a0ba0f616d95f0ed93 [^] https://man.archlinux.org/man/iconv.3 [^]

(0006814) bhaible (reporter) 2024-06-12 00:16	Let me explain * why in GNU libc, GNU libiconv, and win-iconv, when //TRANSLIT or //IGNORE is not specified, the iconv() conversion stops when it encounters an unconvertible character, * why this would also be a useful behaviour for the other iconv() implementations. The output of iconv() conversion is generally not piped to /dev/null, but instead becomes visible to a user in one form or the other. Therefore, applications that make use of iconv must not use low-quality conversions, only high-quality conversions. For example, the string "Frédéric", when mapped to ASCII, could become "Frederic" (humanly acceptable, high quality) or "Frdric" or "Fr?d?ric" (not acceptable, because too low quality). An application that wants to use iconv() for character set conversion therefore needs an easy way to be alerted by the iconv() implementation that some "non-identical conversion" has been performed, so that it can test whether that non-identical conversion is of high or of low quality. With an implementation that, like GNU libc or GNU libiconv, reports the situation by stopping the conversion loop and returning in inbytesleft a pointer to the unconvertible character, the application can call iconv() on the entire input at once, and profit from the efficient conversion loop inside the iconv() engine. With an implementation that, like POSIX currently specifies it, always returns the conversion result for the entire input, the application does get an indication that some non-identical conversion was performed (through the return value of iconv()), but - it does not get an indication where the unconvertible character is, - it cannot back up to that point, because in the presence of stateful encodings you cannot just "jump back" to an arbitrary input pointer. The only way the application has is to feed the input bytes to iconv() slowly (the first byte, then the first 2 bytes, then the first 3 bytes, and so on), until iconv() converts a character. This is the only* way to do a conversion and check its quality as it goes. Of course, as Steffen Nurpmeso noted, this is inefficient. Implementations like the one in FreeBSD are optimized for converting as much as possible in one swoop, and invoking iconv() as many times as there are bytes in the input does not make use of this optimized implementation. The net result is that only for conversions to UTF-8 or UCS, the application may make use of the optimized implementation, and for the other ones (conversions to any encoding that is not full Unicode) the optimizations in the implementation are not used. Conclusion: For all* iconv implementation, it would be useful to stop when a "non-identical conversion" is about to happen, when neither //TRANSLIT nor //IGNORE has been specified.

(0006815) bhaible (reporter) 2024-06-12 00:24	The reporter of this defect is right regarding defect 1007: The intent of defect 1007 was to allow applications to make a more efficient use of the iconv() engine, while distinguishing high-quality conversions from low-quality conversions. But the resolution of defect 1007, to include the facilities that are present in Solaris iconv(), does not resolve the problem: It does not allow iconv() to fail upon non-identical conversions, and thus it does not allow the application to make use of an optimized conversion loop, while distinguishing high-quality conversions from low-quality conversions.

(0006817) eblake (manager) 2024-06-13 16:37	Based on the discussions of the 2024-06-13 call, the Austin Group understands the desire to have a means for an iconv() implementation that stops early when a transliteration is not possible, despite recognizing valid characters in the input. Would it work to utilize a different errno in this sequence, perhaps ENOTSUP or EPROTO, to make it easier for applications to distinguish between a stop because of unrecognized input (EILSEQ) vs unrepresentable output (the new errno)? Note: 0006812 mentioned ICONV_SET_* flags on MacOS; that appears to be used with a non-standard interface iconvctl(), and while it may be possible to standardize that interface and a new ICONV_SET_* flag as a means for for opting into the new errno value behavior, it seems like a much bigger request at this time Regarding the different defaults, we could add //NOTRANSLIT and make it unspecified whether //TRANSLIT or //NOTRANSLIT is the default.

Issue History
Date Modified	Username	Field	Change
2023-02-21 00:14	steffen	New Issue
2023-02-21 00:14	steffen	Name	=> steffen
2023-02-21 00:14	steffen	Section	=> iconv
2023-02-21 00:14	steffen	Page Number	=> 1123
2023-02-21 00:14	steffen	Line Number	=> 38014
2023-02-21 18:20	steffen	Note Added: 0006164
2023-03-06 16:35	nick	Relationship added	related to 0001007
2024-06-11 23:42	bhaible	Note Added: 0006812
2024-06-11 23:47	bhaible	Note Added: 0006813
2024-06-12 00:16	bhaible	Note Added: 0006814
2024-06-12 00:24	bhaible	Note Added: 0006815
2024-06-13 16:37	eblake	Note Added: 0006817

Aardvark Mark IV