Austin Group Defect Tracker

Aardvark Mark IV


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0001078 [1003.1(2013)/Issue7+TC1] System Interfaces Editorial Clarification Requested 2016-09-16 17:54 2024-06-11 08:54
Reporter Florian Weimer View Status public  
Assigned To
Priority normal Resolution Accepted As Marked  
Status Closed  
Name Florian Weimer
Organization Red Hat
User Reference
Section isdigt, isxdigit
Page Number unknown
Line Number unknown
Interp Status Approved
Final Accepted Text Note: 0003946
Summary 0001078: isdigit, isxdigit locale dependance
Description ISO C99 and C11 are very explicit about the set of decimal digits and hexadecimal digits (0123456789 and 0123456789ABCDEFabcdef). There is no expectation that the result of these functions is locale-dependent, unlike isalpha (for example). Any locale-dependence of the results would thus violate C99/C11 semantics.

isalnum is affected by this as well because it is specified as the union of isalpha and isdigit in IS C. This is probably a defect in ISO C.
Desired Action “The isdigit function returns true if the argument refers to a character in the range from '0' to '9' (inclusive.”

“The isxdigit function returns true if the argument is a decimal digit according to the isdigit function, or if the argument refers to one of the characters 'A', 'B', 'C', 'D', 'E', 'F', 'a', 'b', 'c', 'd', 'e', 'f'.”

If so desired, any restrictions on the digit and xdigit character classes could be lifted, and applications could query them using the isdigit_l and isxdigit_l functions.
Tags tc3-2008
Attached Files

- Relationships

-  Notes
(0003380)
Don Cragun (manager)
2016-09-16 18:35

The characters included in character classes digit and digit do not vary from one locale to another, but the codeset used does vary by locale. A locale based on an EBCDIC codeset; a locale based on an ASCII, an ISO 8859-*, or a UTF-8 codeset; a locale based on a UTF-16 codeset; and a locale based on a UTF-32 codeset all have different byte encodings for the characters in both of these character classes. Therefore, the locale is required by all of these functions to determine the byte encoding of the characters in the class.
(0003381)
eblake (manager)
2016-09-16 18:50

In the past, I recall the analysis that it is not possible to have a system with simultaneous EBCDIC and ASCII-derived locales (you can have a system with support for both encodings only if there is some non-standard way to switch which mode you are in, but within a given mode, standard-conforming apps see either that all available locales are EBCDIC-based, or that all available locales are ASCII-based). In part, this is because the standard requires that all locales use the same single-byte representation for the various digits; because isdigit() really is locale-independent, and returns non-zero for the same set of 10 bytes regardless of what else is going on in the locale.

The standard forbids a locale with UTF-16 or UTF-32 codesets (the all-zero NUL byte is required to represent the NUL character across ALL locales, and is not permitted to appear within a multi-byte character encoding). That does not mean that the standard forbids processing of UTF-16 or UTF-32 data (that would be a job for iconv) but merely that your locale is never encoded as UTF-16 or UTF-32.

Furthermore, while the standard does permit multibyte encodings where one byte of a multi-byte character happens to also have the same value as a single-byte letter character, it is fairly explicit that at least the portable filename character set (which includes all of the characters in question by isxdigit()) must be single-byte characters.

So about the only reason the locale could even play a role is in determining whether you have a locale where some multibyte character encodings happen to reuse a byte that can also be one of the characters in question if it appears on its own; but not whether the characters in question can have a different single-byte encoding.
(0003946)
geoffclare (manager)
2018-04-05 16:42
edited on: 2018-04-05 16:46

Interpretation response
------------------------
The standard states the requirements for isdigit() and isxdigit(), and conforming implementations must conform to this. However, concerns have been raised about this which are being referred to the sponsor.

Rationale:
-------------
Allowing additional characters in the digit and xdigit classes is a conflict with the requirements of ISO C for isdigit() and isxdigit().

Notes to the Editor (not part of this interpretation):
-------------------------------------------------------

(Page and line numbers are for the 2016 edition)
    
On page 139 line 4123 delete:
These only need to be specified if the character values (that is, encoding) differ from the implementation default values.

At page 140 line 4152, change:
In the POSIX locale, only:
to:
In all locales, only:

On page 141 line 4195, change:
In the POSIX locale, only:
to:
In all locales, only:

On page 141 at Line 4198, change:
In a locale definition file, only the characters defined for the class digit shall be specified, in contiguous ascending sequence by numerical value, followed by one or more sets of six characters representing the hexadecimal digits 10 to 15 inclusive, with each set in ascending order (for example, <A>, < B>, <C>, <D>, <E>, <F>, <a>, < b>, <c>, <d>, <e>, <f>). The digits <zero> to <nine>, the uppercase letters <A> to <F>, and the lowercase letters <a> to <f> of the portable character set are automatically included in this class.
to:
In a locale definition file, only the characters defined for the class digit shall be specified, in contiguous ascending sequence by numerical value, followed by two sets, in either order, of six characters representing the hexadecimal digits corresponding to the decimal numbers 10 to 15 inclusive, with each set in ascending order: <A>, < B>, <C>, <D>, <E>, <F> and <a>, < b>, <c>, <d>, <e>, <f>. The digits <zero> to <nine>, the uppercase letters <A> to <F>, and the lowercase letters <a> to <f> of the portable character set are automatically included in this class.


(Note that the space in < b> and < B> should be omitted; it is there to prevent Mantis from interpreting it as a bold specifier.)

(0004141)
ajosey (manager)
2018-09-30 18:41

Interpretation Proposed: 30 September 2018
(0004163)
ajosey (manager)
2018-11-12 19:47

Interpretation approved: 12 November 2018

- Issue History
Date Modified Username Field Change
2016-09-16 17:54 Florian Weimer New Issue
2016-09-16 17:54 Florian Weimer Name => Florian Weimer
2016-09-16 17:54 Florian Weimer Organization => Red Hat
2016-09-16 17:54 Florian Weimer Section => isdigt, isxdigit
2016-09-16 17:54 Florian Weimer Page Number => unknown
2016-09-16 17:54 Florian Weimer Line Number => unknown
2016-09-16 18:35 Don Cragun Note Added: 0003380
2016-09-16 18:50 eblake Note Added: 0003381
2018-04-05 16:42 geoffclare Note Added: 0003946
2018-04-05 16:44 geoffclare Note Edited: 0003946
2018-04-05 16:44 geoffclare Note Edited: 0003946
2018-04-05 16:46 geoffclare Note Edited: 0003946
2018-04-05 16:48 geoffclare Interp Status => Pending
2018-04-05 16:48 geoffclare Final Accepted Text => Note: 0003946
2018-04-05 16:48 geoffclare Status New => Interpretation Required
2018-04-05 16:48 geoffclare Resolution Open => Accepted As Marked
2018-04-05 16:48 geoffclare Tag Attached: tc3-2008
2018-09-30 18:41 ajosey Interp Status Pending => Proposed
2018-09-30 18:41 ajosey Note Added: 0004141
2018-11-12 19:47 ajosey Interp Status Proposed => Approved
2018-11-12 19:47 ajosey Note Added: 0004163
2019-10-28 10:40 geoffclare Status Interpretation Required => Applied
2024-06-11 08:54 agadmin Status Applied => Closed


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker