View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0000601 | 1003.1(2008)/Issue 7 | System Interfaces | public | 2012-08-15 15:57 | 2019-06-10 08:55 |
Reporter | nick | Assigned To | ajosey | ||
Priority | normal | Severity | Comment | Type | Clarification Requested |
Status | Closed | Resolution | Accepted As Marked | ||
Name | Nick Stoughton | ||||
Organization | USENIX | ||||
User Reference | nms-mbsnrtowcs-001 | ||||
Section | mbsnrtowcs | ||||
Page Number | 1277 | ||||
Line Number | 41975 | ||||
Interp Status | Approved | ||||
Final Accepted Text | 0000601:0001568 | ||||
Summary | 0000601: mbsnrtowcs clarification | ||||
Description | In austin-group-l:archive/latest/17532 Matthew Dempsky posed the following question:On Ubuntu 10.04, the code below prints "0 2". This is the behavior that I think logically makes sense (and that I was intending to implement for OpenBSD). However, my reading of mbsnrtowcs() description in Issue 7 is that the correct output (assuming "en_US.UTF-8" is a valid UTF-8 based locale) should be "0 0". Issue 7 says: """ If dst is not a null pointer, the pointer object pointed to by src shall be assigned either a null pointer (if conversion stopped due to reaching a terminating null character) or the address just past the last character converted (if any). """ However, in my test program, mbs+2 is in the *middle* of a [multi-byte] character, not "just past" a [multi-byte] character. Ubuntu 10.04's behavior would be consistent if the description was "just past the last input byte consumed". Am I misunderstanding something? Or is there a bug in either Ubuntu 10.04's implementation or the POSIX wording? #include <wchar.h> #include <locale.h> #include <string.h> #include <stdio.h> wchar_t wcs[100]; char mbs[100]; int main() { setlocale(LC_CTYPE, "en_US.UTF-8"); memcpy(mbs, "\xe7\x95\x8c", 4); const char *s = mbs; printf("%u ", (unsigned)mbsnrtowcs(wcs, &s, 2, 100, NULL)); printf("%u\n", (unsigned)(s - mbs)); } Further discussion noted that 'C99 does in fact state that mbstate_t's conversion state includes tracking "the position within a multibyte character", so multibyte character string inputs do not necessarily need to be processed exclusively at multibyte character boundaries. E.g., it's okay to call mbrtowc() to process one byte at a time of a multibyte string.' But more importantly, do any implementations of mbsnrtowcs() print "0 0"? Glibc, FreeBSD, and OS X all print "0 2". If no implementation actually prints "0 0", then I think it makes sense to revise the wording for mbsnrtowcs() to "just past the last byte processed" instead of "just past the last multibyte character converted". --- Given that a number of implementations do not follow the apparent requirements of the standard to process the src string character by character rather than byte by byte, I believe a formal interpretation is required. | ||||
Desired Action | EITHER: Add explicit permission for mbsnrtowcs to advance the src pointer to the middle of multibyte character when the nmc value does not include all the bytes of the final character. (this matches existing behavior on at least Glibc, FreeBSD, and OS X) OR: Make a partial conversion (i.e. where *src[nmc] is in the middle of a multi-byte character) into an illegal sequence (as if the src string was actually truncated at this point). OR: further explain why the current wording is correct in the light of several implementations that implement something else. This is change 1 above, and the preferred approach: At page 1277 line 41975-41979 change: If dst is not a null pointer, the pointer object pointed to by src shall be assigned either a null pointer (if conversion stopped due to reaching a terminating null character) or the address just past the last character converted (if any). If conversion stopped due to reaching a terminating null character, and if dst is not a null pointer, the resulting state described shall be the initial conversion state. to If dst is not a null pointer, the pointer object pointed to by src shall be assigned either a null pointer (if conversion stopped due to reaching a terminating null character) or the address just past the last byte converted (if any). If conversion stopped due to reaching a terminating null character or nmc bytes into src, and if dst is not a null pointer, the resulting state described shall be the initial conversion state. | ||||
Tags | c99, tc2-2008 |
|
"If conversion stopped due to reaching a terminating null character or nmc bytes into src, and if dst is not a null pointer, the resulting state described shall be the initial conversion state." Hm, I don't agree with this change. I think the conversion state should be left as is if the conversion stops after nmc bytes. |
|
Initial Interpretation response (superseded by 0000601:0001568) ------------------------------- The standard is unclear on this issue, and no conformance distinction can be made between alternative implementations based on this. This is being referred to the sponsor. Rationale: ------------- None. Notes to the Editor (not part of this interpretation): ------------------------------------------------------- At page 1277 line 41977 change: past the last character converted (if any) to: past the last byte processed (if any) At page 1277 line 41986 change: ... limited to at most nmc bytes (the size of the input buffer). to (all within the CX shading): ... limited to at most nmc bytes (the size of the input buffer). If the input buffer ends with an incomplete character, it is unspecified whether conversion stops at the end of the previous character (if any), or at the end of the input buffer. In the latter case, a subsequent call to mbsnrtowcs() with an input buffer that starts with the remainder of the incomplete character shall correctly complete the conversion of that character. At line 1278 line 42008 change FUTURE DIRECTIONS from: None. to: A future version may require that when the input buffer ends with an incomplete character, conversion stops at the end of the input buffer. |
|
Minor nit: Is saying "past the last byte converted" the same as "past the last byte processed"? I ask because other wording in the standard (e.g., mbrtowc()'s (size_t)-2 return value description) uses "processed" to refer to input bytes that have been consumed for conversion but don't yield a complete character, but "converted" only seems to be used to refer to complete characters. |
|
I agree that it should say "processed", and I believe this will easily achieve consensus, so I have updated 0000601:0001381 accordingly. I have also updated the desired action in 0000616 to match. |
|
Interpretation Proposed 29 Mar 2013 |
|
Antoine Leca pointed out in comp.std.c that the page 1277 line 41977 change from "past the last character converted (if any)" to "past the last byte processed (if any)" applies to mbsrtowcs() as well as mbsnrtowcs(), and therefore is potentially introducing a conflict with the C Standard. Perhaps we should leave that line alone, and deal with this in the later CX shaded text so that it only applies to mbsnrtowcs(). Note that if we do this, we should also change 0000616 to match. |
|
I think adding CX shading is fine to be careful, but my interpretation is that saying "past the last byte processed" is the same as "past the last character converted" for mbsrtowcs(). The conversions are done by mbrtowc(), and my reading of mbrtowc() is that if it fails with EILSEQ, then the bytes have only been inspected, not processed. |
|
Interpretation response ------------------------ The standard is unclear on this issue, and no conformance distinction can be made between alternative implementations based on this. This is being referred to the sponsor. Rationale: ------------- None. Notes to the Editor (not part of this interpretation): ------------------------------------------------------- At page 1277 line 41986 change: except that the conversion of characters pointed to by src is limited to at most nmc bytes (the size of the input buffer). to (all within the CX shading): except that the conversion of characters indirectly pointed to by src is limited to at most nmc bytes (the size of the input buffer), and under conditions where mbsrtowcs() would assign the address just past the last character converted (if any) to the pointer object pointed to by src, mbsnrtowcs() shall instead assign the address just past the last byte processed (if any) to that pointer object. If the input buffer ends with an incomplete character, it is unspecified whether conversion stops at the end of the previous character (if any), or at the end of the input buffer. In the latter case, a subsequent call to mbsnrtowcs() with an input buffer that starts with the remainder of the incomplete character shall correctly complete the conversion of that character. At line 1278 line 42008 change FUTURE DIRECTIONS from: None. to: A future version may require that when the input buffer ends with an incomplete character, conversion stops at the end of the input buffer. |
|
Just a note, for the example cited: If the function is behaving as if calls to mbrtowc() are being done, the return values should be similar... In this case it should have returned (size_t) -2 to indicate incomplete character, not 2 to indicate how many partially processed chars were done, or the value reserved for such indication. A successive call with the same parameters would return 1 to indicate three characters, i.e. nmc from first call + num<=nmc in second, contributed to the one wide character which was successfully converted and stored and src would be set to null to indicate a terminating '\0' was found before nmc bytes exhausted in the second call. (size_t)-2 is not listed as one of the possible return values, however, which is more the bug in this case for POSIX as it's a lack of consistency. mbrtowc() needs this value because it isn't allowed to modify src to indicate '/0' found, it uses the 0 return value for this. However, glibc, et al, should be returning 0 as the substitute for (size_t)-2, even though it isn't documented for this purpose explicitly, on the first call to indicate that no complete wide character has been processed, but a successive call still might produce a wide char, since (size_t)-1 not warranted. Returning > 0 indicates, falsely in this case, that at least one wide char was successfully stored, as that is how mbrtowc() interprets it. Returning >0 from these functions is supposed to mean 1 or more wchars was stored and that IS explicit. The src pointer should still be pointing to the first byte after the first call and the fact that ps has been modified would be used to advance the pointer to try for a successful conversion on that subsequent call. This is consistent with the wording that "Conversion shall stop early in either of the following cases: • A sequence of bytes is encountered that does not form a valid character." A partial character is not fully valid yet, so the conversion should leave src at the start of the sequence, with ps set to continue the conversion, not at the end of the buffer pointing at a '/0'. This leaves src so it can be immediately copied to the head of that same buffer and the remaining chars making up that wide char can be appended from elsewhere to continue the conversion. src gets reset to the buffer head position value before attempting completing that conversion and doing any others. Also, a restart as the 'r' in mbrtowc() stands for implies start pos stays the same. Otherwise it would be 'c' for continuable, most likely. This would force the extra step of adding the bytes processed returned to the buffer head pointer to get back to the start of the fragment. Both mbsnr and mbsr should return 0 on such a fragment. So, both are nominally buggy, but POSIX less so than the others, as consistent behavior can be inferred by cross referencing to mbrtowc() in this way. I think this would be enough to make a conformance distinction too, as is, though the language should be cleaned up to specify that is the function of 0 as return value and how this relates to ending position of src, rather than having to infer it. mbrtowc() could be considered buggy in that a successful conversion should always return from 1 to MB_CUR_MAX, not 1 to n, after all bytes processed so that a wchar could be stored, whether in a single call or multiple calls were required, to show how much of src was used so it can be advanced properly to the next char beginning a possible wchar by the caller. Otherwise the application has to do the same num of chars partially processed managed by ps already when (size_t)-2 is returned, since src is not advanced by the interface, which to me doesn't make sense as a requirement. It complicates the implementation of these two functions, at least, as shown above and the use by applications directly. Also, the return value (size_t)-2 should have text explicitly indicating that it is the value to be returned when a null '/0' byte is encountered and more chars are needed to finish the current pending wchar. This condition is an interruption of the conversion process, whether n bytes reached or not, so neither 0 nor (size_t)-1 particularly appropriate. 0 is appropriate when '/0' is the first byte examined, after a possible shift code to return ps to init state. Otherwise should just mean 'more data required' as for the tail fragments of mbs(n)r. (size_t)-1 should only be returned when the encoding specification has assigned a particular byte sequence as invalid explicitly. |
|
The correct return value for mbrtowc when the mbstate_t object has recorded the position in a partially decoded character (*) and the next byte to be processed cannot continue or complete that character is (size_t)-1. This is true whether the next byte to be processed is '\0' or something else. If the next byte to be processed is '\0', returning (size_t)-2 is non-conforming because ISO C forbids the null byte from appearing as part of any multibyte character except itself. A return value of (size_t)-2 would indicate that the null byte is contributing to a complete character but the character has not yet been completed; this is forbidden. The poster of note 1703 should be aware that none of this is relevant to this issue report at hand. Moreover, this tracker is for the work of the Austin Group in clarifying issues that arise attempting to interpret the standard and maintaining/developing the standard, not for use as a personal blog. I do not speak for the group, but I believe I'm not being unreasonable to ask you to stop posting long off-topic ramblings here, especially when the information contained in your posts is often of questionable accuracy and posted as fact rather than as asking for an interpretation. I believe it is distracting from the standards process myself and others are trying to participate in, and confusing to others who are trying to follow the Austin Group's work. (*) Note that position within a multibyte character is different from a shift state. See C99 7.24.6 Extended multibyte/wide character conversion utilities, paragraph 4. |
|
Excuse Me? This was left as "No conformance distinction can be made" and I'm trying to show that such distinctions can be made so implementations are more reliably portable. I fail to see how that's not on-topic. That the logic chains to reconcile discrepancies may get long-winded is regrettable, but I'm trying to be proactive, not "blogging". What I am saying in this case is that the '\0' is indicating 'end-of-buffer', not participating in the decoding process. This allows the mbsrtowcs() function to stop at the last fragment, as it's supposed to, not report every string with a trailing fragment as (size_t)-1 EILSEQ. Does this mean that other interfaces that currently require only -1 be returned be examined as being under-specified. Probably. I'm not proposing normative wording though because the issue has been closed, and giving possible reasons the group as a whole may want to reopen it. If you're not one of them, fine, but don't get all holier-than-thou either. You've shown some of your own misconceptions already. |
|
OK, I understand better what you're trying to say now, but it's wrong and contrary to the normative requirements of ISO C. If it were correct, it would make it impossible for ISO C's multibyte facilities to support UTF-8, because they could not detect encoding errors or distinguish the strings "abc\xc2" and "abc"; both would convert to the same wide character sequence with no indication that the first contains an encoding error. I already cited the C requirements the return value of mbrtowc so I don't think I need to cite them again. Your claim of relevance is also incorrect. The issue at hand is the mbsnrtowcs function, not mbsrtowcs. Per ISO C, mbsrtowcs is required to return (size_t)-1 if the string ends with an incomplete character, which is an encoding error. The mbsrtowcs function is for processing complete strings; it cannot process fragments. POSIX added the mbsnrtowcs function for the purpose of processing fragments. The interpretation response that "No conformance distinction can be made" does not indicate inaction. It means that the existing text in the standard was insufficient to resolve the issue, and that textual changes (which follow) are being proposed to resolve the issue. As noted in the proposed "future directions", a future version of the standard may require processing of a final partial character, but the text fixing the ambiguity allows either historical behavior so as not to make incompatible changes in a technical corrigendum. So for now, conforming applications which use mbsnrtowcs to process fragments must keep any unprocessed input bytes (of which there may be zero, if the fragment does not end with a partial character or if the implementation takes the option to process these bytes and store the partial character in the mbstate_t object) and prepend them to the next fragment when calling mbsnrtowcs. This has no negative impact on application portability, and in fact applications were already doing this anyway, since (as I understand it) most implementations do not process a final partial character. The possible change mentioned in "future directions" has nothing to do with the ability to write portable programs; all it would achieve is allowing slightly simpler and more efficient program logic (especially for programs which may be processing text from read-only buffers). |
|
"This has no negative impact on application portability, and in fact applications were already doing this anyway, since (as I understand it) most implementations do not process a final partial character." My recollection is the exact opposite actually: that all existing mbsnrtowcs() implementations *do* process the final partial character, which was contrary to the current standard's wording, hence this defect report. |
|
Apologies, there was something I overlooked in the wording on mbrtowc( ), so I withdraw objection about language of (size_t)-2 return. It can happen properly in a robust implementation as stands. The standard does say: If s is not a null pointer, the mbrtowc( ) function shall inspect at most n bytes beginning at the byte pointed to by s to determine the number of bytes needed to complete the next character... This 'at most' infers it can also do its own strlen( ) of s internally and use the lesser of this len or n as the effective count to try and process. This wouldn't stop a check for '/0' if after this effective count, less than n, was processed and ps in initial state so 0 can be returned also. When count = n and ps is holding a shift back to initial state, i.e. the '/0' is at position n+1, I think the intent is (size_t)-2 should still be returned. I believe this describes how (size_t)-1 can be avoided on ending fragments given the current language. When '/0' is the initial char pointed at by s, and ps is neither in initial state or a pending return to initial state, then would be appropriate to return (size_t)-1. So it can be done, I think, but the language doesn't specify this strlen( ) should be done also. That it can be done does give more point to the intent is that mbsn( ) and mbsnr( ) should be stopping at the beginning of a trailing fragment, though, as this is what best matches the semantics of mbrtowc( ), and for mbsnr, an initial fragment when n<MB_CUR_MAX should be returning 0. It also allows, as an extension, and possibly a normative option, CX or XSI, a singleton '/0' to be examined at the start of a passed in buffer if ps isn't in initial state and a code set is specified as having a nul as a possible intermediate or trailing character. For something like this the description could be reworded in the vendor's documentation of extensions to require returning with (size_t)-2 after the one '/0' byte, rather than (size_t)-1, to prevent buffer overruns from corrupted data. It wouldn't prevent GIGO, but it would give the application a chance to use a different routine to validate the '/0' was legitimate as a non-buffer delimiter. An extension might define an ismbslastnul(mbstate_t ps) interface to facilitate checking for this and keep mbstate_t opaque. This would be, I think, consistent with how it's supposed to check for nul anyways to complete a pending 'return to initial state' shift sequence and return 0 as 'end of mbs' as a special case. I think some additional language would be needed to support it as a normative option, though, and this extra interface makes it unsuitable for TC scope. Unless is___( ) type interfaces are permitted. I forget offhand. |
|
Note 1708 is presumes something false. The null byte can never be a possible intermediate or trailing character. This is specified in ISO C, to which POSIX defers. See C99 5.2.1.2 Multibyte characters: "A byte with all bits zero shall be interpreted as a null character independent of shift state. Such a byte shall not occur as part of any other multibyte character." Note, however, that such a byte can occur after bytes that represent a change in shift state. This would not be an encoding error, but would discard the shift state and result in the null wide character. One might ask the question "Why must UTF-8 lead bytes be partial characters and not shift states?" In fact, as far as I could tell, it would be possible to make an implementation in which there is a locale that behaves similar to a UTF-8 based locale, but for which lead bytes are treated as shift states rather than partial characters. There would be observable differences in the behavior of mbrtowc. The important thing to realize is that this would not be UTF-8. While such an implementation would be implementing _a_ multibyte encoding, the encoding it would implement does not conform to the requirements of ISO-10646 or Unicode regarding the interpretation of UTF-8, and thus it would not be correct to call it UTF-8. In particular, "abc\xc3" and "abc" would both convert to the string L"abc". |
|
Interpretation Proposed 6 Sep 2013 |
|
Interpretation approved 14 October 2013 |
|
Per the mailing list, I still object; it's creating a new function, not clarifying the issue... The implementations are non-conforming, that's all! |
Date Modified | Username | Field | Change |
---|---|---|---|
2012-08-15 15:57 | nick | New Issue | |
2012-08-15 15:57 | nick | Status | New => Under Review |
2012-08-15 15:57 | nick | Assigned To | => ajosey |
2012-08-15 15:57 | nick | Name | => Nick Stoughton |
2012-08-15 15:57 | nick | Organization | => USENIX |
2012-08-15 15:57 | nick | User Reference | => nms-mbsnrtowcs-001 |
2012-08-15 15:57 | nick | Section | => mbsnrtowcs |
2012-08-15 15:57 | nick | Page Number | => 1277 |
2012-08-15 15:57 | nick | Line Number | => 41975 |
2012-08-15 15:57 | nick | Interp Status | => --- |
2012-08-15 17:01 | mdempsky | Note Added: 0001327 | |
2012-09-26 15:10 | nick | Desired Action Updated | |
2012-09-26 15:38 | geoffclare | Note Added: 0001381 | |
2012-09-26 15:39 | geoffclare | Interp Status | --- => Pending |
2012-09-26 15:39 | geoffclare | Final Accepted Text | => 0000601:0001381 |
2012-09-26 15:39 | geoffclare | Status | Under Review => Interpretation Required |
2012-09-26 15:39 | geoffclare | Resolution | Open => Accepted As Marked |
2012-09-26 15:39 | geoffclare | Tag Attached: tc2-2008 | |
2012-09-26 15:41 | geoffclare | Note Edited: 0001381 | |
2012-09-26 15:47 | nick | Issue cloned: 0000616 | |
2012-09-26 15:47 | nick | Relationship added | parent of 0000616 |
2012-09-26 16:50 | mdempsky | Note Added: 0001383 | |
2012-09-27 07:25 | geoffclare | Note Edited: 0001381 | |
2012-09-27 07:27 | geoffclare | Note Added: 0001384 | |
2013-03-29 08:06 | ajosey | Interp Status | Pending => Proposed |
2013-03-29 08:06 | ajosey | Note Added: 0001526 | |
2013-04-26 14:03 | geoffclare | Note Added: 0001552 | |
2013-04-26 18:53 | mdempsky | Note Added: 0001553 | |
2013-05-02 15:30 | nick | Tag Attached: c99 | |
2013-05-03 09:42 | geoffclare | Note Added: 0001568 | |
2013-05-09 15:17 | geoffclare | Note Edited: 0001568 | |
2013-05-09 15:19 | geoffclare | Note Edited: 0001381 | |
2013-05-09 15:19 | geoffclare | Interp Status | Proposed => Pending |
2013-05-09 15:19 | geoffclare | Final Accepted Text | 0000601:0001381 => 0000601:0001568 |
2013-08-06 22:05 | shware_systems | Note Added: 0001703 | |
2013-08-07 00:37 | dalias | Note Added: 0001704 | |
2013-08-07 03:32 | shware_systems | Note Added: 0001705 | |
2013-08-07 04:24 | dalias | Note Added: 0001706 | |
2013-08-07 05:23 | mdempsky | Note Added: 0001707 | |
2013-08-08 08:35 | shware_systems | Note Added: 0001708 | |
2013-08-08 08:52 | shware_systems | Note Edited: 0001708 | |
2013-08-08 15:38 | dalias | Note Added: 0001709 | |
2013-09-06 08:31 | ajosey | Interp Status | Pending => Proposed |
2013-09-06 08:31 | ajosey | Note Added: 0001821 | |
2013-10-14 13:03 | ajosey | Interp Status | Proposed => Approved |
2013-10-14 13:03 | ajosey | Note Added: 0001883 | |
2013-10-14 14:59 | shware_systems | Note Added: 0001902 | |
2019-06-10 08:55 | agadmin | Status | Interpretation Required => Closed |