Austin Group Defect Tracker

Aardvark Mark IV


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0001008 [1003.1(2013)/Issue7+TC1] System Interfaces Objection Clarification Requested 2015-11-16 22:24 2024-06-11 09:02
Reporter steffen View Status public  
Assigned To ajosey
Priority normal Resolution Accepted As Marked  
Status Closed  
Name steffen
Organization
User Reference
Section Vol.2, System Interfaces, iconv
Page Number 1109
Line Number 37302 ff.
Interp Status ---
Final Accepted Text Note: 0003326
Summary 0001008: 1. clarify iconv(3) reset usage; 2. truly support Unicode character input
Description For the (iconv_t, NULL, NULL, &OBUF, &OBUF_LEN) usage case, POSIX says

  When iconv( ) is called in this way,..
  [it] shall place, into the output buffer, the byte sequence to change the output buffer to its initial shift state.

POSIX states at a different place (Vol. 1, Base Definitions, 6.4.1 State-Dependent Character Encodings, 2.; p. 133, l. 3830 ff.)

  A utility that divides, truncates, or extracts substrings from statefully encoded data shall produce output that contains locking shifts at the beginning or end of the resulting data, if appropriate, to retain correct state information.

Effectively a string must be "atomic" regarding its locking state, otherwise it could not be used by itself. Therefore a reset sequence has to be placed if "normal US-ASCII" text is about to follow (e.g., after a RFC 2047 encoded word in e-mail header that uses stateful encoding).

I wonder wether placing this reset sequence shouldn't be a mandatory task before iconv_close(), since only like that

  $ cat file1 file2 > file3

would work according to above wording if file1 has been created via strings that have been converted via iconv() - POSIX doesn't say that a newline character causes locking shift state reset.

Making the (iconv_t, NULL, NULL, &OBUF, &OBUF_LEN) case mandatory before iconv_close() would enable character set conversion to reliably detect and compose ISO 10646 / Unicode constructs like decomposed character sequences and even graphem cluster boundaries.
These "techniques" are basic concepts of Unicode and their understanding may be mandatory in order to be able to perform a correct input charset to output charset conversion.
On the mailing list examples have been given, one replication of which can be found in [1].

  [1] http://austingroupbugs.net/view.php?id=249#c2923 [^]

It has to be noted that today "hacks" exist to overcome the fact that the envisaged new requirement, e.g., many iconv implementations ship with a special "UTF-8-MAc" character set that i think does nothing but support decomposed characters. ..Ok it seems Apple has chosen not to honour the Unicode standard completely but to not decompose some character ranges due to some internal compatibility problems [2]. Beside that it is decomposed Unicode.

  [2] https://developer.apple.com/library/mac/qa/qa1173/_index.html [^]
Desired Action 1. It should be clarified wether it is necessary to explicitly place a reset sequence after input processing is complete, before iconv_close(). Since iconv() doesn't know that the end of the input is reached, it could otherwise not ensure that the resulting data is valid according to the POSIX specification.

2. If the above is true and clarification will be applied in the envisaged way, POSIX should enhance the iconv description so that it not only talks about state-dependent encodings but also considers Unicode / ISO 10646 text processing requirements, since output character set character composition may be possible only after applying Unicode composition and graphem cluster boundary detection to input data, which may require to hold back data output unless text processing detects a true boundary that can be emitted to the output character set.
Tags tc3-2008
Attached Files

- Relationships

-  Notes
(0002966)
steffen (reporter)
2015-11-17 16:59

1. Excuse me please, this should have been comitted to Issue7+TC1; i searched before i have started editing, and obviously i forgot to switch back the form. I don't know how this could be changed except by opening another issue.

2. I also think, apart from the above, that

37302 For state-dependent encodings, the conversion descriptor cd is placed into its initial shift state by
37303 a call for which inbuf is a null pointer, or for which inbuf points to a null pointer. When iconv( ) is
37304 called in this way, and if outbuf is not a null pointer or a pointer to a null pointer, and outbytesleft
37305 points to a positive value, iconv( ) shall place, into the output buffer, the byte sequence to change
37306 the output buffer to its initial shift state.

should be changed to

  For state-dependent encodings, the conversion descriptor cd is placed into its initial shift state by a call for which inbuf is a null pointer, or for which inbuf points to a null pointer. When iconv( ) is called in this way, and if outbuf is not a null pointer or a pointer to a null pointer, and outbytesleft points to a positive value, iconv( ) shall place, into the output buffer, the byte sequence to change the output buffer to its initial shift state, if the former state of the conversion descriptor cd mandates so.
(0003326)
geoffclare (manager)
2016-08-04 16:35
edited on: 2016-08-11 15:22

Add to application usage as a new paragraph after P1110, L37364:

It is the responsibility of the application to ensure that, if the output codeset has a locking-shift encoding, the output buffer is returned to its initial shift state when conversion is completed. This can be accomplished by calling iconv() with inbuf as a null pointer, or with inbuf pointing to a null pointer, before calling iconv_close(). Since the standard does not provide a way to query whether a codeset has a locking-shift encoding, it is recommended that applications always call iconv() in this way before calling iconv_close().

(0003335)
geoffclare (manager)
2016-08-11 15:23

Note: 0003326 was edited during the 2016-08-11 teleconference to add the last sentence.

- Issue History
Date Modified Username Field Change
2015-11-16 22:24 steffen New Issue
2015-11-16 22:24 steffen Status New => Under Review
2015-11-16 22:24 steffen Assigned To => ajosey
2015-11-16 22:24 steffen Name => steffen
2015-11-16 22:24 steffen Section => Vol.2, System Interfaces, iconv
2015-11-16 22:24 steffen Page Number => 1109
2015-11-16 22:24 steffen Line Number => 37302 ff.
2015-11-17 16:59 steffen Note Added: 0002966
2015-11-17 17:03 geoffclare Project 1003.1(2008)/Issue 7 => 1003.1(2013)/Issue7+TC1
2016-08-04 16:35 geoffclare Note Added: 0003326
2016-08-04 16:35 geoffclare Interp Status => ---
2016-08-04 16:35 geoffclare Final Accepted Text => Note: 0003326
2016-08-04 16:35 geoffclare Status Under Review => Resolved
2016-08-04 16:35 geoffclare Resolution Open => Accepted As Marked
2016-08-04 16:36 geoffclare Tag Attached: tc3-2008
2016-08-11 15:22 geoffclare Note Edited: 0003326
2016-08-11 15:23 geoffclare Note Added: 0003335
2019-10-21 13:42 geoffclare Status Resolved => Applied
2024-06-11 09:02 agadmin Status Applied => Closed


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker