Anonymous | Login | 2023-12-05 12:27 UTC |
Main | My View | View Issues | Change Log | Docs |
Viewing Issue Simple Details [ Jump to Notes ] | [ Issue History ] [ Print ] | ||||||
ID | Category | Severity | Type | Date Submitted | Last Update | ||
0000948 | [1003.1(2013)/Issue7+TC1] Base Definitions and Headers | Objection | Error | 2015-05-11 15:38 | 2020-04-08 15:31 | ||
Reporter | geoffclare | View Status | public | ||||
Assigned To | |||||||
Priority | normal | Resolution | Accepted | ||||
Status | Applied | ||||||
Name | Geoff Clare | ||||||
Organization | The Open Group | ||||||
User Reference | |||||||
Section | 7.3.2, 9.3.5 | ||||||
Page Number | 147, 150, 184 | ||||||
Line Number | 4393, 4503, 5963, and more | ||||||
Interp Status | --- | ||||||
Final Accepted Text | |||||||
Summary | 0000948: Collation issues in XBD (changes for Issue 8) | ||||||
Description |
A discussion on the mailing list identified some issues related to collation for locales that do not define a collation sequence with a total ordering of all characters. It is proposed that these issues are addressed in Issue 8 by requiring implementation-provided locales that do not have an '@' modifier in their name to define a collation sequence that has a total ordering of all characters (thus reducing the problem to "special" locales and user-defined locales), and by modifying the requirements for regular expressions and affected utilities so that they cope better with such locales. As an intermediate step, it is proposed that the new requirements slated for Issue 8 are recommended (or at least allowed) in TC2. The necessary changes will be split across four Mantis bugs, targeting XBD TC2, XCU TC2, XBD Issue 8, and XCU Issue 8. This bug contains the changes proposed for XBD in Issue 8. |
||||||
Desired Action |
After applying the bug 0000938 changes at each of the following locations, make further changes to the new text as noted below. On Page: 147 Line: 4393 Section: 7.3.2 LC_COLLATE In the new paragraph after the numbered list, change from: All implementation-provided locales (either preinstalled or provided as locale definitions which can be installed later) should define ... to: All implementation-provided locales (either preinstalled or provided as locale definitions which can be installed later) shall define ... and delete the first of the new small-font notes: <small>Note: a future version of this standard may require these locales to define a collation sequence that has a total ordering of all characters (by changing "should" to "shall").</small> On Page: 150 Line: 4503 Section: 7.3.2.4 Collation Order In the new paragraph, change from: Weights should be assigned such that the collation sequence ... to: Weights shall be assigned such that the collation sequence ... and delete the small-font note: <small>Note: a future version of this standard may require a total ordering of all characters for implementation-provided locales that do not have an '@' modifier in the locale name. See [xref to 7.3.2].</small> On Page: 150 Line: 4517 Section: 7.3.2.4 Collation Order In the updated text, change from: If the collation order has only one weight level, these characters should be assigned unique primary weights, equal to the relative order of their character in the character collation sequence, but may be assigned the same primary weight. to: If the collation order has only one weight level, these characters shall be assigned unique primary weights, equal to the relative order of their character in the character collation sequence. and delete the small-font note: <small>Note: a future version of this standard may require these characters to be assigned unique primary weights if the collation order has only one weight level.</small> On Page: 184 Line: 5963 Section: 9.3.5 RE Bracket Expression In the updated list item 2, change from: An ordinary character in the list should only match that character, but may match any single character that collates equally with that character; for example, "[abc]" is an RE that should only match one of the characters 'a', 'b', or 'c'. to: An ordinary character in the list shall only match that character; for example, "[abc]" is an RE that only matches one of the characters 'a', 'b', or 'c'. and delete the small-font note: <small>Note: a future version of this standard may require that an ordinary character in the list only matches that character.</small> On Page: 184 Line: 5970 Section: 9.3.5 RE Bracket Expression In the updated list item 3, change from: For example, if the RE "[abc]" only matches 'a', 'b', or 'c', then "[^abc]" is an RE that matches any character except 'a', 'b', or 'c'. to: For example, since the RE "[abc]" only matches 'a', 'b', or 'c', it follows that "[^abc]" is an RE that matches any character except 'a', 'b', or 'c'. Cross-volume changes to XRAT ... On Page: 3490 Line: 117820 Section: A.7.3.2 LC_COLLATE In the new paragraph, change from: This standard recommends (by the use of "should" in the normative text) that ... to: This standard requires that ... |
||||||
Tags | issue8, UTF-8_Locale | ||||||
Attached Files | |||||||
|
![]() |
||||||||||||||||
|
![]() |
|
(0002697) eblake (manager) 2015-06-04 18:02 |
Is this proposed wording still accurate, in light of 0000872 documenting how REG_ICASE affects range expressions? An ordinary character in the list shall only match that character; for example, "[abc]" is an RE that only matches one of the characters 'a', 'b', or 'c'. |
(0002698) shware_systems (reporter) 2015-06-04 18:38 |
It nominally is, as 'upper' and 'lower' determination is independant of collation order and bug 872 relates to equality testing only, not ordering, but a clarification 'when REG_ICASE has not been specified' inserted somewhere in there wouldn't hurt either, imo. |
(0002699) shware_systems (reporter) 2015-06-04 20:58 edited on: 2015-06-04 21:00 |
Where the wording changes introduce a possible ambiguity is with the strxfrm() interface. The C standard just states the interface shall refer to the LC_COLLATE category, but is not explicit about when items are copied from a source string verbatim and when a transformed substitute must be stored, so the change from should to shall may break some existing implementations. If the collation weightings are set up so that after COLL_WEIGHTS_MAX weights have been examined two elements can still compare as equal, though their binary value differs, it is not specified which element has primacy for storage so that strcmp() is deterministic. Using case insensitive on letters as an example, does the lower case or upper case version of a character get copied or always stored, or is it the first member of the given weight class pulled from the LC_COLLATE category that gets stored, which may be upper where what was input is lower. I can see the latter being the intent, but some implementations may prefer a particular case as a last determining factor, to match the expectations of various standards the transformed strings are routinely used with. Maybe I'm off, but I don't see the language precluding such a preference being implemented. |
(0002700) geoffclare (manager) 2015-06-05 09:38 |
(Response to Note: 0002697) I don't see any problem with the new wording as regards REG_ICASE. The way XBD chapter 9 is structured is that the details in 9.3 and 9.4 describe the normal case-sensitive matching process and the variation needed for case-insensitive matching is covered by this statement in 9.2: "When a standard utility or function that uses regular expressions specifies that pattern matching shall be performed without regard to the case (uppercase or lowercase) of either data or patterns, then when each character in the string is matched against the pattern, not only the character, but also its case counterpart (if any), shall be matched." |
Mantis 1.1.6[^] Copyright © 2000 - 2008 Mantis Group |