0000938: Collation issues in XBD (changes for TC2) - Austin Group Defect Tracker

Austin Group Defect Tracker

Aardvark Mark IV

Viewing Issue Simple Details [ Jump to Notes ]

[ Issue History ] [ Print ]

ID

Category

Severity

Type

Date Submitted

Last Update

0000938

[1003.1(2013)/Issue7+TC1] Base Definitions and Headers

Objection

Error

2015-04-22 15:56

2019-06-10 08:54

Reporter

geoffclare

View Status

public

Assigned To

Priority

normal

Resolution

Accepted

Status

Closed

Name

Geoff Clare

Organization

The Open Group

User Reference

Section

7.3.2, 9.3.5

Page Number

147, 150-151, 184

Line Number

4366, 4503, 5944, and more

Interp Status

---

Final Accepted Text

Summary

0000938: Collation issues in XBD (changes for TC2)

Description

A discussion on the mailing list identified some issues related to
collation for locales that do not define a collation sequence with
a total ordering of all characters. It is proposed that these issues
are addressed in Issue 8 by requiring implementation-provided locales
that do not have an '@' modifier in their name to define a collation
sequence that has a total ordering of all characters (thus reducing
the problem to "special" locales and user-defined locales), and by
modifying the requirements for regular expressions and affected
utilities so that they cope better with such locales. As an
intermediate step, it is proposed that the new requirements slated
for Issue 8 are recommended (or at least allowed) in TC2.

The necessary changes will be split across four Mantis bugs, targeting
XBD TC2, XCU TC2, XBD Issue 8, and XCU Issue 8. This bug contains the
changes proposed for XBD in TC2.

Desired Action

On Page: 147 Line: 4366 Section: 7.3.2 LC_COLLATE

Change from:

(sort, uniq, and so on)

to:

(ls, sort, and so on)

On Page: 147 Line: 4393 Section: 7.3.2 LC_COLLATE

Add a new paragraph and two small-font notes after the numbered list:

All implementation-provided locales (either preinstalled or provided as locale definitions which can be installed later) should define a collation sequence that has a total ordering of all characters unless the locale name has an '@' modifier indicating that it has a special collation sequence (for example, <tt>@icase</tt> could indicate that each upper- and lower-case character pair collates equally).

<small>Note: a future version of this standard may require these locales to define a collation sequence that has a total ordering of all characters (by changing "should" to "shall").</small>

<small>Note: users installing their own locales should ensure that they define a collation sequence with a total ordering of all characters unless an '@' modifier in the locale name (such as @icase) indicates that it has a special collation sequence.</small>

On Page: 150 Line: 4503 Section: 7.3.2.4 Collation Order

Add a new paragraph and a small-font note:

Weights should be assigned such that the collation sequence has a total ordering of all characters unless an '@' modifier in the locale name indicates that it has a special collation sequence.

<small>Note: a future version of this standard may require a total ordering of all characters for implementation-provided locales that do not have an '@' modifier in the locale name. See [xref to 7.3.2].</small>

On Page: 150 Line: 4517 Section: 7.3.2.4 Collation Order

Change from:

Characters specified via an explicit or implicit UNDEFINED special symbol shall by default be assigned the same primary weight (that is, they belong to the same equivalence class).

to:

Characters specified via an explicit or implicit UNDEFINED special symbol shall by default be assigned the same primary weight (that is, they belong to the same equivalence class) if the collation order has more than one weight level. If the collation order has only one weight level, these characters should be assigned unique primary weights, equal to the relative order of their character in the character collation sequence, but may be assigned the same primary weight.

<small>Note: a future version of this standard may require these characters to be assigned unique primary weights if the collation order has only one weight level.</small>

On Page: 151 Line: 4539 Section: 7.3.2.4 Collation Order

Delete the first entry in the example collation order:

<tt>UNDEFINED IGNORE;IGNORE</tt>

On Page: 151 Line: 4552 Section: 7.3.2.4 Collation Order

Add the following as the last entry in the example collation order:

<tt>UNDEFINED IGNORE;...</tt>

On Page: 151 Line: 4555 Section: 7.3.2.4 Collation Order

Delete item 1 in the numbered list:

The UNDEFINED means that all characters not specified in this definition (explicitly or via the ellipsis) shall be ignored for collation purposes.

and renumber items 2-4 to be 1-3.

On Page: 151 Line: 4563 Section: 7.3.2.4 Collation Order

Add a new item 4 to the numbered list:

The UNDEFINED means that all characters not specified in this definition (explicitly or via the ellipsis) shall be ignored when comparing primary weights, and have individual secondary weights based on their ordinal encoded values.

On Page: 184 Line: 5944 Section: 9.3.5 RE Bracket Expression

Change from:

A bracket expression (an expression enclosed in square brackets, "[ ]") is an RE that shall match a single collating element contained in the non-empty set of collating elements represented by the bracket expression.

to:

A bracket expression (an expression enclosed in square brackets, "[ ]") is an RE that shall match a specific set of single characters, and may match a specific set of multi-character collating elements, based on the non-empty set of list expressions contained in the bracket expression.

On Page: 184 Line: 5949 Section: 9.3.5 RE Bracket Expression

In list item 1, change from:

It consists of one or more expressions: collating elements, collating symbols, ...

to:

It consists of one or more expressions: ordinary characters, collating elements, collating symbols, ...

On Page: 184 Line: 5963 Section: 9.3.5 RE Bracket Expression

In list item 2, change from:

A matching list expression specifies a list that shall match any single-character collating element in any of the expressions represented in the list. The first character in the list shall not be the <circumflex>; for example, "[abc]" is an RE that matches any of the characters 'a', 'b', or 'c'.

to:

A matching list expression specifies a list that shall match any single character that is matched by one of the expressions represented in the list. The first character in the list can not be the <circumflex>. An ordinary character in the list should only match that character, but may match any single character that collates equally with that character; for example, "[abc]" is an RE that should only match one of the characters 'a', 'b', or 'c'.

<small>Note: a future version of this standard may require that an ordinary character in the list only matches that character.</small>

On Page: 184 Line: 5970 Section: 9.3.5 RE Bracket Expression

In list item 3, change from:

For example, "[^abc]" is an RE that matches any character except the characters 'a', 'b', or 'c'.

to:

For example, if the RE "[abc]" only matches 'a', 'b', or 'c', then "[^abc]" is an RE that matches any character except 'a', 'b', or 'c'.

(Note that no change is needed to the first sentence in list item 3 because it is already modified suitably by 0000872.)

On Page: 185 Line: 5995 Section: 9.3.5 RE Bracket Expression

In list item 6, change from:

The set of single-character collating elements whose characters belong to the character class

to:

The set of single characters that belong to the character class

Cross-volume changes to XRAT ...

On Page: 3490 Line: 117820 Section: A.7.3.2 LC_COLLATE

Add a new paragraph:

This standard recommends (by the use of "should" in the normative text) that all implementation-provided locales define a collation sequence that has a total ordering of all characters unless the locale name has an '@' modifier indicating that it has a special collation sequence. Defining locales in this way eliminates unexpected behavior when non-identical strings can collate equally (for example, <tt>sort -u</tt> and <tt>sort | uniq</tt> are not equivalent). The exception for locales with a suitable '@' modifier in the name allows implementations to supply locales which do not have a total ordering of all characters provided that they draw attention to it in the modifier name. For example, <tt>@icase</tt> could indicate that each upper- and lower-case character pair collates equally. Even with an '@' modifier, total ordering is preferred when possible; for example, characters that are "ignored" in dictionary order need not be completely ignored (by using IGNORE for all collation weights), but can instead by given a unique weight after one or more IGNORE weights.

Tags

tc2-2008, UTF-8_Locale

Relationships

related to	0000872	Closed		REG_ICASE regex matching and negated bracket expr
related to	0000948	Closed		Collation issues in XBD (changes for Issue 8)
related to	0000963	Closed		Collation issues in XCU (changes for TC2)

Notes
(0002664) geoffclare (manager) 2015-05-11 15:33 edited on: 2015-06-04 16:24	In the desired action the line number for the 9.3.5 list item 3 change is wrong; it should be 5970. Update: the line number has been corrected in the desired action.

Issue History
Date Modified	Username	Field	Change
2015-04-22 15:56	geoffclare	New Issue
2015-04-22 15:56	geoffclare	Name	=> Geoff Clare
2015-04-22 15:56	geoffclare	Organization	=> The Open Group
2015-04-22 15:56	geoffclare	Section	=> 7.3.2, 9.3.5
2015-04-22 15:56	geoffclare	Page Number	=> 147, 150-151, 184
2015-04-22 15:56	geoffclare	Line Number	=> 4366, 4503, 5944, and more
2015-04-22 15:56	geoffclare	Interp Status	=> ---
2015-05-11 15:33	geoffclare	Note Added: 0002664
2015-05-11 15:39	geoffclare	Relationship added	related to 0000948
2015-06-04 16:23	geoffclare	Desired Action Updated
2015-06-04 16:24	geoffclare	Note Edited: 0002664
2015-06-04 16:24	Don Cragun	Status	New => Resolved
2015-06-04 16:24	Don Cragun	Resolution	Open => Accepted
2015-06-04 16:24	Don Cragun	Tag Attached: tc2-2008
2015-06-04 16:25	eblake	Relationship added	related to 0000872
2015-06-24 15:24	geoffclare	Relationship added	related to 0000963
2015-07-30 17:06	rhansen	Tag Attached: UTF-8_Locale
2019-06-10 08:54	agadmin	Status	Resolved => Closed

Mantis 1.1.6[^]

Copyright © 2000 - 2008 Mantis Group