0001635: iconv: please be more explicit in input-not-convertible case - Austin Group Defect Tracker

Austin Group Defect Tracker

Aardvark Mark IV

Viewing Issue Simple Details [ Jump to Notes ]

[ Issue History ] [ Print ]

ID

Category

Severity

Type

Date Submitted

Last Update

0001635

[1003.1(2016/18)/Issue7+TC2] Base Definitions and Headers

Editorial

Clarification Requested

2023-02-21 00:14

2023-03-06 16:35

Reporter

steffen

View Status

public

Assigned To

Priority

normal

Resolution

Open

Status

New

Name

steffen

Organization

User Reference

Section

iconv

Page Number

1123

Line Number

38014

Interp Status

---

Final Accepted Text

Summary

0001635: iconv: please be more explicit in input-not-convertible case

Description

issue 1007 resolves this to

    If iconv() encounters a character in the input buffer that is valid, but for which an identical character does not exist in the output codeset:

        If either the //IGNORE or the //NON_IDENTICAL_DISCARD indicator suffix was specified when the conversion descriptor cd was opened, the character shall be discarded but shall still be counted in the return value of the iconv() call.

        If the //TRANSLIT indicator suffix was specified when the conversion descriptor cd was opened, an implementation-defined transliteration shall be performed, if possible, to convert the character into one or more characters of the output codeset that best resemble the input character. The character shall be counted as one character in the return value of the iconv() call, regardless of the number of output characters.

        If no indicator suffix was specified when the conversion descriptor cd was opened, or the //TRANSLIT indicator suffix was specified but no transliteration of the character is possible, iconv() shall perform an implementation-defined conversion on the character and it shall be counted in the return value of the iconv() call.

However, as Martin Sebor stated in the issue description,

     The specification for the iconv() function assumes that every input sequence that is valid in the source codeset is convertible to some sequence in the destination codeset. In particular, the specification doesn't allow the function to fail when a valid sequence in the source codeset cannot be represented in the destination codeset. As an example where this assumption doesn't hold, consider a conversion from UTF-8 to ISO-8859 where a large number of source characters don't have equivalents in the destination codeset.

     A survey of a subset of existing implementations shows that they fail with EILSEQ in such cases, despite the specification defining the error condition as "Input conversion stopped due to an input byte that does not belong to the input codeset."

And this is true, GNU C library and GNU libiconv seem to fail output conversion immediately with the same EILSEQ error that denotes invalid input data.
(A much more drastic error, .. is it!?!)

Desired Action

Please be more explicit and denote that implementations exist which behave like GNU C-lib iconv / libiconv.
That is to say that "implementation defined conversion" may mean no conversion at all, but an immediate stop.

It would be tremendous if the standard could define hands that programmers can react upon, because, due to restriction of the iconv interface, it is impossible to decide what the error was.
A programmer does know nothing of input nor output character set, how many bytes may make up a character, how many were consumed / produced, whether conversion replacements where stored, or not. (In practice all others known to me do place some character and continue.)

This refers to GNU library bug report

  https://sourceware.org/bugzilla/show_bug.cgi?id=29913 [^]

where the honourable author of GNU iconv, and YES!, the GNU approach has lots of merits!, but it should be possible to differentiate in between the errors,

  Better even would be an explicit //CONVERR-STOP-WITH-ENODATA modifier.

refers to gnulib source files where the same approach is implemented portably, it seems, and the cost is tremendous, because of all the shortcomings of the iconv interface!
Like approaching cautiously byte-by-byte until a conversion succeeds!

      for (insize = 1; inptr + insize <= inptr_end; insize++)
        {
          res = iconv (cd,
                       (ICONV_CONST char **) &inptr, &insize,
                       &outptr, &outsize);
          if (!(res == (size_t)(-1) && errno == EINVAL))
            break;
          /* iconv can eat up a shift sequence but give EINVAL while attempting
             to convert the first character. E.g. libiconv does this. */
          if (inptr > inptr_before)
            {
              res = 0;
              break;
            }
        }

This is ridiculous!

Tags

No tags attached.

Relationships

related to

Applied

ajosey

1003.1(2013)/Issue7+TC1

iconv function not allowed to fail to convert valid sequences

Notes
(0006164) steffen (reporter) 2023-02-21 18:20	P.S.: to eloberate some more, any implementation which wants to conform to the upcoming POSIX standard needs to pass the configured state (//IGNORE, //TRANSLIT ..) along its internal code path, so that this addition would be an "easy" one. Adding this very helpful addition to make the behaviour of the most widely used (and standards-incompatible) implementation explicitly addressable allows future programmers to know what they are doing. (By other means than performing compile-time tests and fixate behaviour according to the compile-time-tested environment.) I truly believe in the honourable Bruno Haible's approach, if it is addressable like that. The only other implementation that allows programmers to realize their own output conversion failure stuff is that of AIX (looking at gnulib), because that uses the NUL byte as an implementation-defined conversion, but that, of course, will fail for data which contains embedded NULs (which for one is otherwise graspable by iconv(3), and second will fail for all multi-byte as in 16-bit or 32-bit encodings per se). In general real life situation it very bad, just look at the efforts of character set conversion in the widely used and widely available libarchive. This addition could make things a bit better.

Issue History
Date Modified	Username	Field	Change
2023-02-21 00:14	steffen	New Issue
2023-02-21 00:14	steffen	Name	=> steffen
2023-02-21 00:14	steffen	Section	=> iconv
2023-02-21 00:14	steffen	Page Number	=> 1123
2023-02-21 00:14	steffen	Line Number	=> 38014
2023-02-21 18:20	steffen	Note Added: 0006164
2023-03-06 16:35	nick	Relationship added	related to 0001007

Mantis 1.1.6[^]

Copyright © 2000 - 2008 Mantis Group