|Anonymous | Login||2023-12-07 09:28 UTC|
|Main | My View | View Issues | Change Log | Docs|
|Viewing Issue Simple Details|
|ID||Category||Severity||Type||Date Submitted||Last Update|
|0001635||[1003.1(2016/18)/Issue7+TC2] Base Definitions and Headers||Editorial||Clarification Requested||2023-02-21 00:14||2023-03-06 16:35|
|Final Accepted Text|
|Summary||0001635: iconv: please be more explicit in input-not-convertible case|
issue 1007 resolves this to
If iconv() encounters a character in the input buffer that is valid, but for which an identical character does not exist in the output codeset:
If either the //IGNORE or the //NON_IDENTICAL_DISCARD indicator suffix was specified when the conversion descriptor cd was opened, the character shall be discarded but shall still be counted in the return value of the iconv() call.
If the //TRANSLIT indicator suffix was specified when the conversion descriptor cd was opened, an implementation-defined transliteration shall be performed, if possible, to convert the character into one or more characters of the output codeset that best resemble the input character. The character shall be counted as one character in the return value of the iconv() call, regardless of the number of output characters.
If no indicator suffix was specified when the conversion descriptor cd was opened, or the //TRANSLIT indicator suffix was specified but no transliteration of the character is possible, iconv() shall perform an implementation-defined conversion on the character and it shall be counted in the return value of the iconv() call.
However, as Martin Sebor stated in the issue description,
The specification for the iconv() function assumes that every input sequence that is valid in the source codeset is convertible to some sequence in the destination codeset. In particular, the specification doesn't allow the function to fail when a valid sequence in the source codeset cannot be represented in the destination codeset. As an example where this assumption doesn't hold, consider a conversion from UTF-8 to ISO-8859 where a large number of source characters don't have equivalents in the destination codeset.
A survey of a subset of existing implementations shows that they fail with EILSEQ in such cases, despite the specification defining the error condition as "Input conversion stopped due to an input byte that does not belong to the input codeset."
And this is true, GNU C library and GNU libiconv seem to fail output conversion immediately with the same EILSEQ error that denotes invalid input data.
(A much more drastic error, .. is it!?!)
Please be more explicit and denote that implementations exist which behave like GNU C-lib iconv / libiconv.
That is to say that "implementation defined conversion" may mean no conversion at all, but an immediate stop.
It would be tremendous if the standard could define hands that programmers can react upon, because, due to restriction of the iconv interface, it is impossible to decide what the error was.
A programmer does know nothing of input nor output character set, how many bytes may make up a character, how many were consumed / produced, whether conversion replacements where stored, or not. (In practice all others known to me do place some character and continue.)
This refers to GNU library bug report
where the honourable author of GNU iconv, and YES!, the GNU approach has lots of merits!, but it should be possible to differentiate in between the errors,
Better even would be an explicit //CONVERR-STOP-WITH-ENODATA modifier.
refers to gnulib source files where the same approach is implemented portably, it seems, and the cost is tremendous, because of all the shortcomings of the iconv interface!
Like approaching cautiously byte-by-byte until a conversion succeeds!
for (insize = 1; inptr + insize <= inptr_end; insize++)
res = iconv (cd,
(ICONV_CONST char **) &inptr, &insize,
if (!(res == (size_t)(-1) && errno == EINVAL))
/* iconv can eat up a shift sequence but give EINVAL while attempting
to convert the first character. E.g. libiconv does this. */
if (inptr > inptr_before)
res = 0;
This is ridiculous!
|Tags||No tags attached.|
P.S.: to eloberate some more, any implementation which wants to conform to the upcoming POSIX standard needs to pass the configured state (//IGNORE, //TRANSLIT ..) along its internal code path, so that this addition would be an "easy" one.
Adding this very helpful addition to make the behaviour of the most widely used (and standards-incompatible) implementation explicitly addressable allows future programmers to know what they are doing.
(By other means than performing compile-time tests and fixate behaviour according to the compile-time-tested environment.)
I truly believe in the honourable Bruno Haible's approach, if it is addressable like that.
The only other implementation that allows programmers to realize their own output conversion failure stuff is that of AIX (looking at gnulib), because that uses the NUL byte as an implementation-defined conversion, but that, of course, will fail for data which contains embedded NULs (which for one is otherwise graspable by iconv(3), and second will fail for all multi-byte as in 16-bit or 32-bit encodings per se).
In general real life situation it very bad, just look at the efforts of character set conversion in the widely used and widely available libarchive.
This addition could make things a bit better.
|2023-02-21 00:14||steffen||New Issue|
|2023-02-21 00:14||steffen||Name||=> steffen|
|2023-02-21 00:14||steffen||Section||=> iconv|
|2023-02-21 00:14||steffen||Page Number||=> 1123|
|2023-02-21 00:14||steffen||Line Number||=> 38014|
|2023-02-21 18:20||steffen||Note Added: 0006164|
|2023-03-06 16:35||nick||Relationship added||related to 0001007|
|Mantis 1.1.6[^] Copyright © 2000 - 2008 Mantis Group|