Austin Group Defect Tracker

Aardvark Mark IV


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0001477 [1003.1(2016/18)/Issue7+TC2] Base Definitions and Headers Objection Error 2021-05-20 11:16 2021-12-13 15:17
Reporter geoffclare View Status public  
Assigned To
Priority normal Resolution Accepted  
Status Applied  
Name Geoff Clare
Organization The Open Group
User Reference
Section 7.1, 8.2
Page Number 135, 176
Line Number 3967, 5768
Interp Status ---
Final Accepted Text
Summary 0001477: Consequences of specifying locale categories with different charsets
Description XBD 7.1 says:
If different character sets are used by the locale categories, the results achieved by an application utilizing these categories are undefined. Likewise, if different codesets are used for the data being processed by interfaces whose behavior is dependent on the current locale, or the codeset is different from the codeset assumed when the locale was created, the result is also undefined.

XBD 8.2 says:
If these variables specify locale categories that are not based upon the same underlying codeset, the results are unspecified.

There are several problems with these statements:

1. They say different things. I suggest the statement in 8.2 should be replaced by a cross-reference to 7.1, since the latter is more precise.

2. The second sentence of the 7.1 statement contradicts requirements elsewhere, e.g. it says the behaviour is undefined in situations where individual function descriptions require them to return EILSEQ errors. It's also not clear what the point of referring to "the codeset assumed when the locale was created" is.

3. The statement in 7.1 uses "current locale" but should be inclusive of the *_l() functions.

4. Both statements are overly restrictive. If I set:

LANG=en_US.utf8
LC_TIME=POSIX

on a system where the codeset for the POSIX locale is ASCII and the codeset for en_US.utf8 is UTF-8, all of the characters used in the LC_TIME locale data exist, with the same encoding, in the codeset used for LC_CTYPE (via LANG), so there is no reason for the behaviour to be unspecified/undefined.
Desired Action On page 135 line 3967 section 7.1, change:
If different character sets are used by the locale categories, the results achieved by an application utilizing these categories are undefined. Likewise, if different codesets are used for the data being processed by interfaces whose behavior is dependent on the current locale, or the codeset is different from the codeset assumed when the locale was created, the result is also undefined.
to:
If incompatible character sets are used by the locale categories, the results achieved by an application utilizing these categories are undefined. Two locale categories have incompatible character sets if one of the categories is LC_CTYPE and the locale data associated with the other category includes at least one character that either is not in the character set used by LC_CTYPE or has a different encoding than the same character in the character set used by LC_CTYPE.

Likewise, unless specified otherwise, if different codesets are used by a particular category of the selected locale and by the data being processed by an interface whose behavior is dependent on that category of the selected locale, the results are undefined.

On page 176 line 5768 section 8.2, change:
If these variables specify locale categories that are not based upon the same underlying codeset, the results are unspecified.
to:
See [xref to Section 7.1] for the consequences of setting these variables to locales with different character sets.

Move the following paragraph from page 3537 line 119926 section A.8.2
to after page 3525 line 119377 section A.7.1:
The locale settings of individual categories cannot be truly independent and still guarantee correct results. For example, when collating two strings, characters must first be extracted from each string (governed by LC_CTYPE) before being mapped to collating elements (governed by LC_COLLATE) for comparison. That is, if LC_CTYPE is causing parsing according to the rules of a large, multi-byte code set (potentially returning 20 000 or more distinct character codeset values), but LC_COLLATE is set to handle only an 8-bit codeset with 256 distinct characters, meaningful results are obviously impossible.
and add the following new paragraph after it:
Earlier versions of this standard stated that if different character sets are used by the locale categories, the results achieved by an application utilizing these categories are undefined. This was felt to be overly restrictive. For example, when setting:
LANG=en_US.utf8
LC_TIME=POSIX
on a system where the codeset for the POSIX locale is ASCII and the codeset for en_US.utf8 is UTF-8, all of the characters used in the LC_TIME locale data exist, with the same encoding, in the codeset used for LC_CTYPE (via LANG), so there is no reason for the behavior to be undefined in this case. This standard now has more precise requirements in this area.

Tags tc3-2008
Attached Files

- Relationships

There are no notes attached to this issue.

- Issue History
Date Modified Username Field Change
2021-05-20 11:16 geoffclare New Issue
2021-05-20 11:16 geoffclare Name => Geoff Clare
2021-05-20 11:16 geoffclare Organization => The Open Group
2021-05-20 11:16 geoffclare Section => 7.1, 8.2
2021-05-20 11:16 geoffclare Page Number => 135, 176
2021-05-20 11:16 geoffclare Line Number => 3967, 5768
2021-05-20 11:16 geoffclare Interp Status => ---
2021-11-18 17:03 Don Cragun Tag Attached: tc3-2008
2021-11-18 17:07 Don Cragun Status New => Resolved
2021-11-18 17:07 Don Cragun Resolution Open => Accepted
2021-12-13 15:17 geoffclare Status Resolved => Applied


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker