Austin Group Defect Tracker

Aardvark Mark IV


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0001635 [1003.1(2016/18)/Issue7+TC2] Base Definitions and Headers Editorial Clarification Requested 2023-02-21 00:14 2023-03-06 16:35
Reporter steffen View Status public  
Assigned To
Priority normal Resolution Open  
Status New  
Name steffen
Organization
User Reference
Section iconv
Page Number 1123
Line Number 38014
Interp Status ---
Final Accepted Text
Summary 0001635: iconv: please be more explicit in input-not-convertible case
Description issue 1007 resolves this to

    If iconv() encounters a character in the input buffer that is valid, but for which an identical character does not exist in the output codeset:

        If either the //IGNORE or the //NON_IDENTICAL_DISCARD indicator suffix was specified when the conversion descriptor cd was opened, the character shall be discarded but shall still be counted in the return value of the iconv() call.

        If the //TRANSLIT indicator suffix was specified when the conversion descriptor cd was opened, an implementation-defined transliteration shall be performed, if possible, to convert the character into one or more characters of the output codeset that best resemble the input character. The character shall be counted as one character in the return value of the iconv() call, regardless of the number of output characters.

        If no indicator suffix was specified when the conversion descriptor cd was opened, or the //TRANSLIT indicator suffix was specified but no transliteration of the character is possible, iconv() shall perform an implementation-defined conversion on the character and it shall be counted in the return value of the iconv() call.

However, as Martin Sebor stated in the issue description,

     The specification for the iconv() function assumes that every input sequence that is valid in the source codeset is convertible to some sequence in the destination codeset. In particular, the specification doesn't allow the function to fail when a valid sequence in the source codeset cannot be represented in the destination codeset. As an example where this assumption doesn't hold, consider a conversion from UTF-8 to ISO-8859 where a large number of source characters don't have equivalents in the destination codeset.

     A survey of a subset of existing implementations shows that they fail with EILSEQ in such cases, despite the specification defining the error condition as "Input conversion stopped due to an input byte that does not belong to the input codeset."

And this is true, GNU C library and GNU libiconv seem to fail output conversion immediately with the same EILSEQ error that denotes invalid input data.
(A much more drastic error, .. is it!?!)
Desired Action Please be more explicit and denote that implementations exist which behave like GNU C-lib iconv / libiconv.
That is to say that "implementation defined conversion" may mean no conversion at all, but an immediate stop.

It would be tremendous if the standard could define hands that programmers can react upon, because, due to restriction of the iconv interface, it is impossible to decide what the error was.
A programmer does know nothing of input nor output character set, how many bytes may make up a character, how many were consumed / produced, whether conversion replacements where stored, or not. (In practice all others known to me do place some character and continue.)

This refers to GNU library bug report

  https://sourceware.org/bugzilla/show_bug.cgi?id=29913 [^]

where the honourable author of GNU iconv, and YES!, the GNU approach has lots of merits!, but it should be possible to differentiate in between the errors,

  Better even would be an explicit //CONVERR-STOP-WITH-ENODATA modifier.

refers to gnulib source files where the same approach is implemented portably, it seems, and the cost is tremendous, because of all the shortcomings of the iconv interface!
Like approaching cautiously byte-by-byte until a conversion succeeds!

      for (insize = 1; inptr + insize <= inptr_end; insize++)
        {
          res = iconv (cd,
                       (ICONV_CONST char **) &inptr, &insize,
                       &outptr, &outsize);
          if (!(res == (size_t)(-1) && errno == EINVAL))
            break;
          /* iconv can eat up a shift sequence but give EINVAL while attempting
             to convert the first character. E.g. libiconv does this. */
          if (inptr > inptr_before)
            {
              res = 0;
              break;
            }
        }

This is ridiculous!
Tags No tags attached.
Attached Files

- Relationships
related to 0001007Appliedajosey 1003.1(2013)/Issue7+TC1 iconv function not allowed to fail to convert valid sequences 

-  Notes
(0006164)
steffen (reporter)
2023-02-21 18:20

P.S.: to eloberate some more, any implementation which wants to conform to the upcoming POSIX standard needs to pass the configured state (//IGNORE, //TRANSLIT ..) along its internal code path, so that this addition would be an "easy" one.

Adding this very helpful addition to make the behaviour of the most widely used (and standards-incompatible) implementation explicitly addressable allows future programmers to know what they are doing.
(By other means than performing compile-time tests and fixate behaviour according to the compile-time-tested environment.)

I truly believe in the honourable Bruno Haible's approach, if it is addressable like that.

The only other implementation that allows programmers to realize their own output conversion failure stuff is that of AIX (looking at gnulib), because that uses the NUL byte as an implementation-defined conversion, but that, of course, will fail for data which contains embedded NULs (which for one is otherwise graspable by iconv(3), and second will fail for all multi-byte as in 16-bit or 32-bit encodings per se).

In general real life situation it very bad, just look at the efforts of character set conversion in the widely used and widely available libarchive.
This addition could make things a bit better.

- Issue History
Date Modified Username Field Change
2023-02-21 00:14 steffen New Issue
2023-02-21 00:14 steffen Name => steffen
2023-02-21 00:14 steffen Section => iconv
2023-02-21 00:14 steffen Page Number => 1123
2023-02-21 00:14 steffen Line Number => 38014
2023-02-21 18:20 steffen Note Added: 0006164
2023-03-06 16:35 nick Relationship added related to 0001007


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker