0001477: Consequences of specifying locale categories with different charsets

ID	Project	Category	View Status	Date Submitted	Last Update

0001477	1003.1(2016/18)/Issue7+TC2	Base Definitions and Headers	public	2021-05-20 11:16	2024-06-11 09:07

Reporter	geoffclare	Assigned To
Priority	normal	Severity	Objection	Type	Error
Status	Closed	Resolution	Accepted

Name	Geoff Clare
Organization	The Open Group
User Reference
Section	7.1, 8.2
Page Number	135, 176
Line Number	3967, 5768
Interp Status	---
Final Accepted Text


Summary	0001477: Consequences of specifying locale categories with different charsets
Description	XBD 7.1 says: If different character sets are used by the locale categories, the results achieved by an application utilizing these categories are undefined. Likewise, if different codesets are used for the data being processed by interfaces whose behavior is dependent on the current locale, or the codeset is different from the codeset assumed when the locale was created, the result is also undefined. XBD 8.2 says: If these variables specify locale categories that are not based upon the same underlying codeset, the results are unspecified. There are several problems with these statements: 1. They say different things. I suggest the statement in 8.2 should be replaced by a cross-reference to 7.1, since the latter is more precise. 2. The second sentence of the 7.1 statement contradicts requirements elsewhere, e.g. it says the behaviour is undefined in situations where individual function descriptions require them to return EILSEQ errors. It's also not clear what the point of referring to "the codeset assumed when the locale was created" is. 3. The statement in 7.1 uses "current locale" but should be inclusive of the *_l() functions. 4. Both statements are overly restrictive. If I set: LANG=en_US.utf8 LC_TIME=POSIX on a system where the codeset for the POSIX locale is ASCII and the codeset for en_US.utf8 is UTF-8, all of the characters used in the LC_TIME locale data exist, with the same encoding, in the codeset used for LC_CTYPE (via LANG), so there is no reason for the behaviour to be unspecified/undefined.
Desired Action	On page 135 line 3967 section 7.1, change: If different character sets are used by the locale categories, the results achieved by an application utilizing these categories are undefined. Likewise, if different codesets are used for the data being processed by interfaces whose behavior is dependent on the current locale, or the codeset is different from the codeset assumed when the locale was created, the result is also undefined. to: If incompatible character sets are used by the locale categories, the results achieved by an application utilizing these categories are undefined. Two locale categories have incompatible character sets if one of the categories is LC_CTYPE and the locale data associated with the other category includes at least one character that either is not in the character set used by LC_CTYPE or has a different encoding than the same character in the character set used by LC_CTYPE. Likewise, unless specified otherwise, if different codesets are used by a particular category of the selected locale and by the data being processed by an interface whose behavior is dependent on that category of the selected locale, the results are undefined. On page 176 line 5768 section 8.2, change: If these variables specify locale categories that are not based upon the same underlying codeset, the results are unspecified. to: See [xref to Section 7.1] for the consequences of setting these variables to locales with different character sets. Move the following paragraph from page 3537 line 119926 section A.8.2 to after page 3525 line 119377 section A.7.1: The locale settings of individual categories cannot be truly independent and still guarantee correct results. For example, when collating two strings, characters must first be extracted from each string (governed by LC_CTYPE) before being mapped to collating elements (governed by LC_COLLATE) for comparison. That is, if LC_CTYPE is causing parsing according to the rules of a large, multi-byte code set (potentially returning 20 000 or more distinct character codeset values), but LC_COLLATE is set to handle only an 8-bit codeset with 256 distinct characters, meaningful results are obviously impossible. and add the following new paragraph after it: Earlier versions of this standard stated that if different character sets are used by the locale categories, the results achieved by an application utilizing these categories are undefined. This was felt to be overly restrictive. For example, when setting: LANG=en_US.utf8 LC_TIME=POSIX on a system where the codeset for the POSIX locale is ASCII and the codeset for en_US.utf8 is UTF-8, all of the characters used in the LC_TIME locale data exist, with the same encoding, in the codeset used for LC_CTYPE (via LANG), so there is no reason for the behavior to be undefined in this case. This standard now has more precise requirements in this area.
Tags	tc3-2008

Date Modified	Username	Field	Change
2021-05-20 11:16	geoffclare	New Issue
2021-05-20 11:16	geoffclare	Name	=> Geoff Clare
2021-05-20 11:16	geoffclare	Organization	=> The Open Group
2021-05-20 11:16	geoffclare	Section	=> 7.1, 8.2
2021-05-20 11:16	geoffclare	Page Number	=> 135, 176
2021-05-20 11:16	geoffclare	Line Number	=> 3967, 5768
2021-05-20 11:16	geoffclare	Interp Status	=> ---
2021-11-18 17:03	Don Cragun	Tag Attached: tc3-2008
2021-11-18 17:07	Don Cragun	Status	New => Resolved
2021-11-18 17:07	Don Cragun	Resolution	Open => Accepted
2021-12-13 15:17	geoffclare	Status	Resolved => Applied
2024-06-11 09:07	agadmin	Status	Applied => Closed

View Issue Details

Activities

Issue History