0001070: Collation issues in XCU (changes for Issue 8)

ID	Project	Category	View Status	Date Submitted	Last Update

0001070	1003.1(2013)/Issue7+TC1	Shell and Utilities	public	2016-08-25 11:11	2024-06-11 08:55

Reporter	geoffclare	Assigned To
Priority	normal	Severity	Objection	Type	Error
Status	Closed	Resolution	Accepted

Name	Geoff Clare
Organization	The Open Group
User Reference
Section	2.13.3, awk, comm, localedef, ls, sort, uniq
Page Number	2356, 2459, 2559, 2874, 2888, 3210, 3309, and more
Line Number	75082, 78745, 82755, 94650, 95164, 107544, 111067, and more
Interp Status	---
Final Accepted Text


Summary	0001070: Collation issues in XCU (changes for Issue 8)
Description	A discussion on the mailing list identified some issues related to collation for locales that do not define a collation sequence with a total ordering of all characters. It is proposed that these issues are addressed in Issue 8 by requiring implementation-provided locales that do not have an '@' modifier in their name to define a collation sequence that has a total ordering of all characters (thus reducing the problem to "special" locales and user-defined locales), and by modifying the requirements for regular expressions and affected utilities so that they cope better with such locales. As an intermediate step, it is proposed that the new requirements slated for Issue 8 are recommended (or at least allowed) in TC2. The necessary changes will be split across four Mantis bugs, targeting XBD TC2, XCU TC2, XBD Issue 8, and XCU Issue 8. This bug contains the changes proposed for XCU in Issue 8.
Desired Action	After applying the bug 0000963 changes at each of the following locations, make further changes to the new text as noted below. (There is also a change to localedef inserted among the changes derived from bug 963.) On Page: 2356 Line: 75082 Section: 2.13.3 Patterns Used for Filename Expansion In the updated list item 3, change from: any filenames or pathnames that collate equally should be further compared byte-by-byte using the collating sequence for the POSIX locale. to: any filenames or pathnames that collate equally shall be further compared byte-by-byte using the collating sequence for the POSIX locale. and delete the small-font note: <small>Note: a future version of this standard may require the byte-by-byte further comparison described above.</small> On Page: 2459 Line: 78745 Section: awk In the updated text, change from: For the "!=" and "==" operators, the strings should be compared to check if they are identical but may be compared using the locale-specific collation sequence to check if they collate equally. to: For the "!=" and "==" operators, the strings shall be compared to check if they are identical (not to check if they collate equally). On Page: 2478 Line: 79587 Section: awk Change the two new APPLICATION USAGE paragraphs from: On implementations where the "==" operator checks if strings collate equally, applications needing to check whether strings are identical can use: length(a) == length(b) && index(a,b) == 1 On implementations where the "==" operator checks if strings are identical, applications needing to check whether strings collate equally can use: a <= b && a >= b to: Since the "==" operator checks whether strings are identical, not whether they collate equally, applications needing to check whether strings collate equally can use: a <= b && a >= b On Page: 2486 Line: 79914 Section: awk Change the updated FUTURE DIRECTIONS section from: A future version of this standard may require the "!=" and "==" operators to perform string comparisons by checking if the strings are identical (and not by checking if they collate equally). to: None. On Page: 2559 Line: 82755 Section: comm Change the new DESCRIPTION paragraph from: If the collating sequence of the current locale does not have a total ordering of all characters (see [xref to XBD 7.3.2]) and any lines from the input files collate equally but are not identical, comm should treat them as different lines but may treat them as being the same. If it treats them as different, comm should expect them to be ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale and if they are not ordered in this way, the output of comm can identify such lines as being both unique to file1 and unique to file2 instead of being in both files. to: If the collating sequence of the current locale does not have a total ordering of all characters (see [xref to XBD 7.3.2]) and any lines from the input files collate equally but are not identical, comm shall treat them as different lines and shall expect them to be ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale; if they are not ordered in this way, the output of comm can identify such lines as being both unique to file1 and unique to file2 instead of being in both files. On Page: 2560 Line: 82810 Section: comm In the updated text, change from: If the input files contained any lines that collated equally but were not identical and within each file those lines were ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale, and commtreated them as different lines, then lines written that collate equally but are not identical should be ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale. to: If the input files contained any lines that collated equally but were not identical and within each file those lines were ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale, then lines written that collate equally but are not identical shall be ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale. On Page: 2561 Line: 82825 Section: comm Change the new APPLICATION USAGE paragraphs from: If the collating sequence of the current locale does not have a total ordering of all characters, this can affect the behaviour of comm in the following ways: * If comm treats lines as being the same only if they are identical, some lines can be misleadingly identified as being both unique to file1 and unique to file2. * If comm treats lines as being the same if they collate equally and a line from file1 collates equally with a line from file2 but is not identical to it, one of the lines is misleadingly identified as being in both files and the other is not written to the output at all. Such problems can be avoided by forcing the use of the POSIX locale, for example the following identifies lines in both file1 and file2: LC_ALL=POSIX sort file1 > file1.posix LC_ALL=POSIX sort file2 > file2.posix LC_ALL=POSIX comm -12 file1.posix file2.posix \| sort The final sort re-sorts the output of comm according to the collating sequence of the original locale. Doing this might be difficult if more than one column is output and leading blanks cannot be ignored. to: If the collating sequence of the current locale does not have a total ordering of all characters, since comm treats lines as being the same only if they are identical, some lines can be misleadingly identified as being both unique to file1 and unique to file2 if lines that collate equally but are not identical are not ordered in the way that comm expects. If the input does not come from utilities (such as ls and sort) which provide this ordering, the problem can be avoided by pre-sorting the input files using sort. On Page: 2561 Line: 82842 Section: comm Change the updated FUTURE DIRECTIONS section from: A future version of this standard may require that if any lines from the input files collate equally but are not identical, then comm treats them as different lines and expects them to be ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale. A future version of this standard may require that if the input files contained any lines that collated equally but were not identical and within each file those lines were ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale, then lines written that collate equally but are not identical are ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale. to: None. On Page: 2874 Line: 94650 Section: localedef Add a new paragraph to the DESCRIPTION section: If the LC_COLLATE category defines a collation sequence that does not have a total ordering of all characters, localedef shall write a warning message to standard error and, if the exit status would otherwise have been zero, shall exit with status 1. On Page: 2888 Line: 95164 Section: ls In the new DESCRIPTION paragraph change from: any filenames or pathnames that collate equally should be further compared byte-by-byte using the collating sequence for the POSIX locale. to: any filenames or pathnames that collate equally shall be further compared byte-by-byte using the collating sequence for the POSIX locale. On Page: 2896 Line: 95520 Section: ls In the FUTURE DIRECTIONS section, delete the new paragraph: A future version of this standard may require that if the collating sequence for the current locale does not have a total ordering of all characters, any filenames or pathnames that collate equally are further compared byte-by-byte using the collating sequence for the POSIX locale. On Page: 3210 Line: 107544 Section: sort In the updated text, change from: any lines of input that collate equally should be further compared byte-by-byte using the collating sequence for the POSIX locale. to: any lines of input that collate equally shall be further compared byte-by-byte using the collating sequence for the POSIX locale. On Page: 3214 Line: 107719 Section: sort In the updated APPLICATION USAGE text, change from: If the collating sequence of the current locale does not have a total ordering of all characters, this can affect the behavior of sort in the following ways: * As <tt>sort -u</tt> suppresses lines with duplicate keys, it suppresses lines that collate equally but are not identical. * The output of sort (without -u) can contain identical lines that are not adjacent, if it does not implement the recommended further byte-by-byte comparison of lines that collate equally. This affects the use of sort with comm and uniq; see the APPLICATION USAGE for those utilities. to: If the collating sequence of the current locale does not have a total ordering of all characters, since <tt>sort -u</tt> suppresses lines with duplicate keys, it suppresses lines that collate equally but are not identical. On Page: 3215 Line: 107783 Section: sort In the new RATIONALE paragraph change from: Implementations are encouraged to perform the recommended further byte-by-byte comparison of lines that collate equally, even though this may affect efficiency. The impact on efficiency can be mitigated by only performing the additional comparison if the current locale's collating sequence does not have a total ordering of all characters (if the implementation provides a way to query this) or by only performing the additional comparison if the locale name associated with the LC_COLLATE category has an '@' modifier in the name (since locales without an '@' modifier should have a total ordering of all characters - see [xref to XBD 7.3.2]). Note that if the implementation provides a stable sort option as an extension (usually -s), the additional comparison should not be performed when this option has been specified. to: The required further byte-by-byte comparison of lines that collate equally may have an impact on efficiency, but this can be mitigated by only performing the additional comparison if the current locale's collating sequence does not have a total ordering of all characters (if the implementation provides a way to query this) or by only performing the additional comparison if the locale name associated with the LC_COLLATE category has an '@' modifier in the name (since implementation-supplied locales without an '@' modifier have a total ordering of all characters - see [xref to XBD 7.3.2] - and localedef users are warned to follow the same convention). Note that if the implementation provides a stable sort option as an extension (usually -s), the additional comparison should not be performed when this option has been specified. On Page: 3215 Line: 107785 Section: sort Change the updated FUTURE DIRECTIONS section from: A future version of this standard may require that if the collating sequence of the current locale does not have a total ordering of all characters, any lines of input that collate equally when comparing them as whole lines are further compared byte-by-byte using the collating sequence for the POSIX locale. to: None. On Page: 3310 Line: 111099 Section: uniq In the updated APPLICATION USAGE section, change from: If the collating sequence of the current locale has a total ordering of all characters, the sort utility can be used to cause repeated lines to be adjacent in the input file. If the collating sequence does not have a total ordering of all characters, the sort utility should still do this but it might not. To ensure that all duplicate lines are eliminated, and have the output sorted according the collating sequence of the current locale, applications should use: LC_ALL=C sort -u \| sort instead of: sort \| uniq To remove duplicate lines based on whether they collate equally instead of whether they are identical, applications should use: sort -u instead of: sort \| uniq to: The sort utility can be used to cause repeated lines to be adjacent in the input file. If the collating sequence of the current locale does not have a total ordering of all characters, the behavior of <tt>sort \| uniq</tt> differs from <tt>sort -u</tt>, as uniq treats lines as duplicates only if they are identical, whereas <tt>sort -u</tt> treats lines as duplicates if they collate equally.
Tags	issue8

Date Modified	Username	Field	Change
2016-08-25 11:11	geoffclare	New Issue
2016-08-25 11:11	geoffclare	Name	=> Geoff Clare
2016-08-25 11:11	geoffclare	Organization	=> The Open Group
2016-08-25 11:11	geoffclare	Section	=> 2.13.3, awk, comm, localedef, ls, sort, uniq
2016-08-25 11:11	geoffclare	Page Number	=> 2356, 2459, 2559, 2874, 2888, 3210, 3309, and more
2016-08-25 11:11	geoffclare	Line Number	=> 75082, 78745, 82755, 94650, 95164, 107544, 111067, and more
2016-08-25 11:11	geoffclare	Interp Status	=> ---
2016-08-25 11:11	geoffclare	Relationship added	related to 0000963
2016-08-25 11:12	geoffclare	Relationship added	related to 0000948
2016-08-25 11:18	geoffclare	Desired Action Updated
2018-01-04 17:07	~~Don Cragun~~	Status	New => Resolved
2018-01-04 17:07	~~Don Cragun~~	Resolution	Open => Accepted
2018-01-04 17:07	~~Don Cragun~~	Tag Attached: issue8
2020-04-21 13:35	geoffclare	Status	Resolved => Applied
2024-06-11 08:55	agadmin	Status	Applied => Closed

View Issue Details

Relationships

Activities

Issue History

related to	0000963	Closed		Collation issues in XCU (changes for TC2)
related to	0000948	Closed		Collation issues in XBD (changes for Issue 8)