0001477: Consequences of specifying locale categories with different charsets

Austin Group Defect Tracker

Aardvark Mark IV

Viewing Issue Simple Details [ Jump to Notes ]

[ Issue History ] [ Print ]

Category

Severity

Type

Date Submitted

Last Update

0001477

[1003.1(2016/18)/Issue7+TC2] Base Definitions and Headers

Objection

Error

2021-05-20 11:16

2021-12-13 15:17

Reporter

geoffclare

View Status

public

Assigned To

Priority

normal

Resolution

Accepted

Status

Applied

Name

Geoff Clare

Organization

The Open Group

User Reference

Section

7.1, 8.2

Page Number

135, 176

Line Number

3967, 5768

Interp Status

---

Final Accepted Text

Summary

0001477: Consequences of specifying locale categories with different charsets

Description

XBD 7.1 says:

If different character sets are used by the locale categories, the results achieved by an application utilizing these categories are undefined. Likewise, if different codesets are used for the data being processed by interfaces whose behavior is dependent on the current locale, or the codeset is different from the codeset assumed when the locale was created, the result is also undefined.

XBD 8.2 says:

If these variables specify locale categories that are not based upon the same underlying codeset, the results are unspecified.

There are several problems with these statements:

1. They say different things. I suggest the statement in 8.2 should be replaced by a cross-reference to 7.1, since the latter is more precise.

2. The second sentence of the 7.1 statement contradicts requirements elsewhere, e.g. it says the behaviour is undefined in situations where individual function descriptions require them to return EILSEQ errors. It's also not clear what the point of referring to "the codeset assumed when the locale was created" is.

3. The statement in 7.1 uses "current locale" but should be inclusive of the *_l() functions.

4. Both statements are overly restrictive. If I set:

LANG=en_US.utf8
LC_TIME=POSIX

on a system where the codeset for the POSIX locale is ASCII and the codeset for en_US.utf8 is UTF-8, all of the characters used in the LC_TIME locale data exist, with the same encoding, in the codeset used for LC_CTYPE (via LANG), so there is no reason for the behaviour to be unspecified/undefined.

Desired Action

On page 135 line 3967 section 7.1, change:

If different character sets are used by the locale categories, the results achieved by an application utilizing these categories are undefined. Likewise, if different codesets are used for the data being processed by interfaces whose behavior is dependent on the current locale, or the codeset is different from the codeset assumed when the locale was created, the result is also undefined.

to:

If incompatible character sets are used by the locale categories, the results achieved by an application utilizing these categories are undefined. Two locale categories have incompatible character sets if one of the categories is LC_CTYPE and the locale data associated with the other category includes at least one character that either is not in the character set used by LC_CTYPE or has a different encoding than the same character in the character set used by LC_CTYPE.

Likewise, unless specified otherwise, if different codesets are used by a particular category of the selected locale and by the data being processed by an interface whose behavior is dependent on that category of the selected locale, the results are undefined.

On page 176 line 5768 section 8.2, change:

If these variables specify locale categories that are not based upon the same underlying codeset, the results are unspecified.

to:

See [xref to Section 7.1] for the consequences of setting these variables to locales with different character sets.

Move the following paragraph from page 3537 line 119926 section A.8.2
to after page 3525 line 119377 section A.7.1:

The locale settings of individual categories cannot be truly independent and still guarantee correct results. For example, when collating two strings, characters must first be extracted from each string (governed by LC_CTYPE) before being mapped to collating elements (governed by LC_COLLATE) for comparison. That is, if LC_CTYPE is causing parsing according to the rules of a large, multi-byte code set (potentially returning 20 000 or more distinct character codeset values), but LC_COLLATE is set to handle only an 8-bit codeset with 256 distinct characters, meaningful results are obviously impossible.

and add the following new paragraph after it:

Earlier versions of this standard stated that if different character sets are used by the locale categories, the results achieved by an application utilizing these categories are undefined. This was felt to be overly restrictive. For example, when setting:
LANG=en_US.utf8
LC_TIME=POSIX
on a system where the codeset for the POSIX locale is ASCII and the codeset for en_US.utf8 is UTF-8, all of the characters used in the LC_TIME locale data exist, with the same encoding, in the codeset used for LC_CTYPE (via LANG), so there is no reason for the behavior to be undefined in this case. This standard now has more precise requirements in this area.

Issue History
Date Modified	Username	Field	Change
2021-05-20 11:16	geoffclare	New Issue
2021-05-20 11:16	geoffclare	Name	=> Geoff Clare
2021-05-20 11:16	geoffclare	Organization	=> The Open Group
2021-05-20 11:16	geoffclare	Section	=> 7.1, 8.2
2021-05-20 11:16	geoffclare	Page Number	=> 135, 176
2021-05-20 11:16	geoffclare	Line Number	=> 3967, 5768
2021-05-20 11:16	geoffclare	Interp Status	=> ---
2021-11-18 17:03	Don Cragun	Tag Attached: tc3-2008
2021-11-18 17:07	Don Cragun	Status	New => Resolved
2021-11-18 17:07	Don Cragun	Resolution	Open => Accepted
2021-12-13 15:17	geoffclare	Status	Resolved => Applied

Relationships