(0003380)
Don Cragun (manager)
2016-09-16 18:35
|
The characters included in character classes digit and digit do not vary from one locale to another, but the codeset used does vary by locale. A locale based on an EBCDIC codeset; a locale based on an ASCII, an ISO 8859-*, or a UTF-8 codeset; a locale based on a UTF-16 codeset; and a locale based on a UTF-32 codeset all have different byte encodings for the characters in both of these character classes. Therefore, the locale is required by all of these functions to determine the byte encoding of the characters in the class. |
(0003381)
eblake (manager)
2016-09-16 18:50
|
In the past, I recall the analysis that it is not possible to have a system with simultaneous EBCDIC and ASCII-derived locales (you can have a system with support for both encodings only if there is some non-standard way to switch which mode you are in, but within a given mode, standard-conforming apps see either that all available locales are EBCDIC-based, or that all available locales are ASCII-based). In part, this is because the standard requires that all locales use the same single-byte representation for the various digits; because isdigit() really is locale-independent, and returns non-zero for the same set of 10 bytes regardless of what else is going on in the locale.
The standard forbids a locale with UTF-16 or UTF-32 codesets (the all-zero NUL byte is required to represent the NUL character across ALL locales, and is not permitted to appear within a multi-byte character encoding). That does not mean that the standard forbids processing of UTF-16 or UTF-32 data (that would be a job for iconv) but merely that your locale is never encoded as UTF-16 or UTF-32.
Furthermore, while the standard does permit multibyte encodings where one byte of a multi-byte character happens to also have the same value as a single-byte letter character, it is fairly explicit that at least the portable filename character set (which includes all of the characters in question by isxdigit()) must be single-byte characters.
So about the only reason the locale could even play a role is in determining whether you have a locale where some multibyte character encodings happen to reuse a byte that can also be one of the characters in question if it appears on its own; but not whether the characters in question can have a different single-byte encoding. |
(0003946)
geoffclare (manager)
2018-04-05 16:42
edited on: 2018-04-05 16:46
|
Interpretation response
------------------------
The standard states the requirements for isdigit() and isxdigit(), and conforming implementations must conform to this. However, concerns have been raised about this which are being referred to the sponsor.
Rationale:
-------------
Allowing additional characters in the digit and xdigit classes is a conflict with the requirements of ISO C for isdigit() and isxdigit().
Notes to the Editor (not part of this interpretation):
-------------------------------------------------------
(Page and line numbers are for the 2016 edition)
On page 139 line 4123 delete:
These only need to be specified if the character values (that is, encoding) differ from the implementation default values.
At page 140 line 4152, change:In the POSIX locale, only: to:In all locales, only:
On page 141 line 4195, change:In the POSIX locale, only: to:In all locales, only:
On page 141 at Line 4198, change:In a locale definition file, only the characters defined for the class digit shall be specified, in contiguous ascending sequence by numerical value, followed by one or more sets of six characters representing the hexadecimal digits 10 to 15 inclusive, with each set in ascending order (for example, <A>, < B>, <C>, <D>, <E>, <F>, <a>, < b>, <c>, <d>, <e>, <f>). The digits <zero> to <nine>, the uppercase letters <A> to <F>, and the lowercase letters <a> to <f> of the portable character set are automatically included in this class. to:In a locale definition file, only the characters defined for the class digit shall be specified, in contiguous ascending sequence by numerical value, followed by two sets, in either order, of six characters representing the hexadecimal digits corresponding to the decimal numbers 10 to 15 inclusive, with each set in ascending order: <A>, < B>, <C>, <D>, <E>, <F> and <a>, < b>, <c>, <d>, <e>, <f>. The digits <zero> to <nine>, the uppercase letters <A> to <F>, and the lowercase letters <a> to <f> of the portable character set are automatically included in this class.
(Note that the space in < b> and < B> should be omitted; it is there to prevent Mantis from interpreting it as a bold specifier.)
|