0000948: Collation issues in XBD (changes for Issue 8)

ID	Project	Category	View Status	Date Submitted	Last Update

0000948	1003.1(2013)/Issue7+TC1	Base Definitions and Headers	public	2015-05-11 15:38	2024-06-11 09:02

Reporter	geoffclare	Assigned To
Priority	normal	Severity	Objection	Type	Error
Status	Closed	Resolution	Accepted

Name	Geoff Clare
Organization	The Open Group
User Reference
Section	7.3.2, 9.3.5
Page Number	147, 150, 184
Line Number	4393, 4503, 5963, and more
Interp Status	---
Final Accepted Text


Summary	0000948: Collation issues in XBD (changes for Issue 8)
Description	A discussion on the mailing list identified some issues related to collation for locales that do not define a collation sequence with a total ordering of all characters. It is proposed that these issues are addressed in Issue 8 by requiring implementation-provided locales that do not have an '@' modifier in their name to define a collation sequence that has a total ordering of all characters (thus reducing the problem to "special" locales and user-defined locales), and by modifying the requirements for regular expressions and affected utilities so that they cope better with such locales. As an intermediate step, it is proposed that the new requirements slated for Issue 8 are recommended (or at least allowed) in TC2. The necessary changes will be split across four Mantis bugs, targeting XBD TC2, XCU TC2, XBD Issue 8, and XCU Issue 8. This bug contains the changes proposed for XBD in Issue 8.
Desired Action	After applying the bug 0000938 changes at each of the following locations, make further changes to the new text as noted below. On Page: 147 Line: 4393 Section: 7.3.2 LC_COLLATE In the new paragraph after the numbered list, change from: All implementation-provided locales (either preinstalled or provided as locale definitions which can be installed later) should define ... to: All implementation-provided locales (either preinstalled or provided as locale definitions which can be installed later) shall define ... and delete the first of the new small-font notes: <small>Note: a future version of this standard may require these locales to define a collation sequence that has a total ordering of all characters (by changing "should" to "shall").</small> On Page: 150 Line: 4503 Section: 7.3.2.4 Collation Order In the new paragraph, change from: Weights should be assigned such that the collation sequence ... to: Weights shall be assigned such that the collation sequence ... and delete the small-font note: <small>Note: a future version of this standard may require a total ordering of all characters for implementation-provided locales that do not have an '@' modifier in the locale name. See [xref to 7.3.2].</small> On Page: 150 Line: 4517 Section: 7.3.2.4 Collation Order In the updated text, change from: If the collation order has only one weight level, these characters should be assigned unique primary weights, equal to the relative order of their character in the character collation sequence, but may be assigned the same primary weight. to: If the collation order has only one weight level, these characters shall be assigned unique primary weights, equal to the relative order of their character in the character collation sequence. and delete the small-font note: <small>Note: a future version of this standard may require these characters to be assigned unique primary weights if the collation order has only one weight level.</small> On Page: 184 Line: 5963 Section: 9.3.5 RE Bracket Expression In the updated list item 2, change from: An ordinary character in the list should only match that character, but may match any single character that collates equally with that character; for example, "[abc]" is an RE that should only match one of the characters 'a', 'b', or 'c'. to: An ordinary character in the list shall only match that character; for example, "[abc]" is an RE that only matches one of the characters 'a', 'b', or 'c'. and delete the small-font note: <small>Note: a future version of this standard may require that an ordinary character in the list only matches that character.</small> On Page: 184 Line: 5970 Section: 9.3.5 RE Bracket Expression In the updated list item 3, change from: For example, if the RE "[abc]" only matches 'a', 'b', or 'c', then "[^abc]" is an RE that matches any character except 'a', 'b', or 'c'. to: For example, since the RE "[abc]" only matches 'a', 'b', or 'c', it follows that "[^abc]" is an RE that matches any character except 'a', 'b', or 'c'. Cross-volume changes to XRAT ... On Page: 3490 Line: 117820 Section: A.7.3.2 LC_COLLATE In the new paragraph, change from: This standard recommends (by the use of "should" in the normative text) that ... to: This standard requires that ...
Tags	issue8, UTF-8_Locale

eblake 2015-06-04 18:02 manager bugnote:0002697	Is this proposed wording still accurate, in light of 0000872 documenting how REG_ICASE affects range expressions? An ordinary character in the list shall only match that character; for example, "[abc]" is an RE that only matches one of the characters 'a', 'b', or 'c'.

shware_systems 2015-06-04 18:38 reporter bugnote:0002698	It nominally is, as 'upper' and 'lower' determination is independant of collation order and bug 872 relates to equality testing only, not ordering, but a clarification 'when REG_ICASE has not been specified' inserted somewhere in there wouldn't hurt either, imo.

shware_systems 2015-06-04 20:58 reporter bugnote:0002699 Last edited: 2015-06-04 21:00	Where the wording changes introduce a possible ambiguity is with the strxfrm() interface. The C standard just states the interface shall refer to the LC_COLLATE category, but is not explicit about when items are copied from a source string verbatim and when a transformed substitute must be stored, so the change from should to shall may break some existing implementations. If the collation weightings are set up so that after COLL_WEIGHTS_MAX weights have been examined two elements can still compare as equal, though their binary value differs, it is not specified which element has primacy for storage so that strcmp() is deterministic. Using case insensitive on letters as an example, does the lower case or upper case version of a character get copied or always stored, or is it the first member of the given weight class pulled from the LC_COLLATE category that gets stored, which may be upper where what was input is lower. I can see the latter being the intent, but some implementations may prefer a particular case as a last determining factor, to match the expectations of various standards the transformed strings are routinely used with. Maybe I'm off, but I don't see the language precluding such a preference being implemented.

geoffclare 2015-06-05 09:38 manager bugnote:0002700	(Response to 0000948:0002697) I don't see any problem with the new wording as regards REG_ICASE. The way XBD chapter 9 is structured is that the details in 9.3 and 9.4 describe the normal case-sensitive matching process and the variation needed for case-insensitive matching is covered by this statement in 9.2: "When a standard utility or function that uses regular expressions specifies that pattern matching shall be performed without regard to the case (uppercase or lowercase) of either data or patterns, then when each character in the string is matched against the pattern, not only the character, but also its case counterpart (if any), shall be matched."

Date Modified	Username	Field	Change
2015-05-11 15:38	geoffclare	New Issue
2015-05-11 15:38	geoffclare	Name	=> Geoff Clare
2015-05-11 15:38	geoffclare	Organization	=> The Open Group
2015-05-11 15:38	geoffclare	Section	=> 7.3.2, 9.3.5
2015-05-11 15:38	geoffclare	Page Number	=> 147, 150, 184
2015-05-11 15:38	geoffclare	Line Number	=> 4393, 4503, 5963, and more
2015-05-11 15:38	geoffclare	Interp Status	=> ---
2015-05-11 15:39	geoffclare	Relationship added	related to 0000938
2015-06-04 18:02	eblake	Note Added: 0002697
2015-06-04 18:38	shware_systems	Note Added: 0002698
2015-06-04 20:58	shware_systems	Note Added: 0002699
2015-06-04 21:00	shware_systems	Note Edited: 0002699
2015-06-05 09:38	geoffclare	Note Added: 0002700
2015-07-30 17:08	rhansen	Tag Attached: issue8
2015-07-30 17:08	rhansen	Tag Attached: UTF-8_Locale
2016-02-04 16:18	nick	Relationship added	related to 0000963
2016-02-04 16:33	~~Don Cragun~~	Status	New => Resolved
2016-02-04 16:33	~~Don Cragun~~	Resolution	Open => Accepted
2016-02-04 16:33	~~Don Cragun~~	Desired Action Updated
2016-08-25 11:12	geoffclare	Relationship added	related to 0001070
2020-04-08 15:31	geoffclare	Status	Resolved => Applied
2024-06-11 09:02	agadmin	Status	Applied => Closed

View Issue Details

Relationships

Activities

Issue History

related to	0000938	Closed	Collation issues in XBD (changes for TC2)
related to	0000963	Closed	Collation issues in XCU (changes for TC2)
related to	0001070	Closed	Collation issues in XCU (changes for Issue 8)