Anonymous | Login | 2023-06-04 09:41 UTC |
Main | My View | View Issues | Change Log | Docs |
Viewing Issue Simple Details [ Jump to Notes ] | [ Issue History ] [ Print ] | ||||||
ID | Category | Severity | Type | Date Submitted | Last Update | ||
0000795 | [1003.1(2013)/Issue7+TC1] Base Definitions and Headers | Objection | Enhancement Request | 2013-11-15 22:46 | 2022-07-28 16:29 | ||
Reporter | steffen | View Status | public | ||||
Assigned To | ajosey | ||||||
Priority | normal | Resolution | Rejected | ||||
Status | Closed | ||||||
Name | steffen | ||||||
Organization | |||||||
User Reference | |||||||
Section | XBD section 3 (Definitions) and section 7 (Locale) | ||||||
Page Number | 93, 140-146, 167 | ||||||
Line Number | 2586, 4047, 4065, 4072, 4081, 4090, 4093, 4156, 4174, 4177, 4215, 4271, 4278, 4295-4297, 4330, 4332, 4360, 4362, 5302 | ||||||
Interp Status | --- | ||||||
Final Accepted Text | |||||||
Summary | 0000795: Addition of a new «symbol» character class | ||||||
Description |
With the possible support of the Universal Character Set via a POSIX.UTF-8 locale thousands of characters need to be classified with the restricted set of POSIX character classes. This rather necessarily results in problematic classifications. Unicode introduces some more major categories, and furtherly subdivides those to get around these limitations. One major category that is entirely missing from POSIX is "Symbol", which this issue tries to introduce. Historically (portable character set) POSIX classifies symbols as punctuation characters, and current implementations of an UTF-8 locale extend this classification to the entire UCS spectrum. Since portability issues will forbid changing all those implementations, the symbol character class will add support for a little subdivision. |
||||||
Desired Action |
- Vol 1: Base Definitions, Chapter 3, «Definitions». On page 93, line 2586 add (insert before) 3.374 Symbol One of the characters included in the symbol character classification of the LC_CTYPE category in the current locale. Note: The LC_CTYPE category is defined in detail in Section 7.3.1 (on page 139). - Vol 1: Base Definitions, Chapter 7, «Locale», 7.3.1, «LC_CTYPE». On page 140, line 4047 ff. (alpha) change In a locale definition file, no character specified for the keywords cntrl, digit, punct, or space shall be specified. to In a locale definition file, no character specified for the keywords cntrl, digit, punct, symbol or space shall be specified. On page 140, line 4065 ff. (space) change In a locale definition file, no character specified for the keywords upper, lower, alpha, digit, graph, or xdigit shall be specified. to In a locale definition file, no character specified for the keywords upper, lower, alpha, digit, graph, symbol or xdigit shall be specified. On page 140, line 4072 ff. (cntrl) change In a locale definition file, no character specified for the keywords upper, lower, alpha, digit, punct, graph, print, or xdigit shall be specified. to In a locale definition file, no character specified for the keywords upper, lower, alpha, digit, punct, graph, print, symbol or xdigit shall be specified. On page 140, line 4081 ff. (graph) change In the POSIX locale, all characters in classes alpha, digit, and punct shall be included; no characters in class cntrl shall be included. In a locale definition file, characters specified for the keywords upper, lower, alpha, digit, xdigit, and punct are automatically included in this class. No character specified for the keyword cntrl shall be specified. to In the POSIX locale, all characters in classes alpha, digit, punct and symbol shall be included; no characters in class cntrl shall be included. In a locale definition file, characters specified for the keywords upper, lower, alpha, digit, xdigit, punct and symbol are automatically included in this class. No character specified for the keyword cntrl shall be specified. On page 141, line 4090 ff. (print) change In a locale definition file, characters specified for the keywords upper, lower, alpha, digit, xdigit, punct, graph, and the <space> are automatically included in this class. No character specified for the keyword cntrl shall be specified. to In a locale definition file, characters specified for the keywords upper, lower, alpha, digit, xdigit, punct, symbol, graph, and the <space> are automatically included in this class. No character specified for the keyword cntrl shall be specified. On page 141, line 4093 add (insert before) symbol Define characters to be classified as characters representing symbols. In the POSIX locale, only: <dollar-sign>, <plus-sign>, <less-than-sign>, <equals-sign>, <greater-than-sign>, <circumflex>, <grave-accent>, <vertical-line> and <tilde> shall be included. In a locale definition file, no character specified for the keywords upper, lower, alpha, digit, cntrl, xdigit, punct or as the <space> shall be specified. The <dollar-sign>, <plus-sign>, <less-than-sign>, <equals-sign>, <greater-than-sign>, <circumflex>, <grave-accent>, <vertical-line> and <tilde> of the portable character class may be included in this class even if they belong to the punct character class. On page 142, line 4156 ff., change the table to Table 7-1 Valid Character Class Combinations Can Also Belong To In Class upper lower alpha digit space cntrl punct symbol graph print xdigit blank upper — A x x x x x A A — x lower — A x x x x x A A — x alpha — — x x x x x A A — x digit x x x x x x x A A A x space x x x x — * x * * x — cntrl x x x x — x x x x x — punct x x x x — x * A A x — symbol x x x x x x * A A x x graph — — — — — x — – A — — print — — — — — x — – — — — xdigit — — — — x x x x A A x blank x x x x A — * x * * x On page 143, line 4174 ff. change 2. The <space>, which is part of the space and blank classes, cannot belong to punct or graph, but shall automatically belong to the print class. Other space or blank characters can be classified as any of punct, graph, or print. to 2. The <space>, which is part of the space and blank classes, cannot belong to punct, symbol or graph, but shall automatically belong to the print class. Other space or blank characters can be classified as any of punct, graph, or print. On page 143, line 4177 add (insert before) 3. Historically the <dollar-sign>, <plus-sign>, <less-than-sign>, <equals-sign>, <greater-than-sign>, <circumflex>, <grave-accent>, <vertical-line> and <tilde> symbol characters where included in the punct character class. On page 144, line 4215 add (insert before) # symbol <dollar-sign>;<plus-sign>;<less-than-sign>;\ <equals-sign>;<greater-than-sign>;<circumflex>;\ <grave-accent>;<vertical-line>;<tilde> On page 145, line 4271 change <dollar-sign> punct, print, graph to <dollar-sign> punct, symbol, print, graph On page 145, line 4278 change <plus-sign> punct, print, graph to <plus-sign> punct, symbol, print, graph On page 145, lines 4295-4297 change <less-than-sign> punct, print, graph <equals-sign> punct, print, graph <greater-than-sign> punct, print, graph to <less-than-sign> punct, symbol, print, graph <equals-sign> punct, symbol, print, graph <greater-than-sign> punct, symbol, print, graph On page 146, line 4330 change <circumflex> punct, print, graph to <circumflex> punct, symbol, print, graph On page 146, line 4332 change <grave-accent> punct, print, graph to <grave-accent> punct, symbol, print, graph On page 146, line 4360 change <vertical-line> punct, print, graph to <vertical-line> punct, symbol, print, graph On page 146, line 4362 change <tilde> punct, print, graph to <tilde> punct, symbol, print, graph - Vol 1: Base Definitions, Chapter 7, «Locale», 7.4.2, «Locale Grammar». On page 167, line 5302 add (insert before) | ’symbol’ |
||||||
Tags | UTF-8_Locale | ||||||
Attached Files | |||||||
|
![]() |
|||||||||||||||||||
|
![]() |
|
(0001987) Don Cragun (manager) 2013-11-16 07:15 |
This bug was originally filed against the Rationale section of the 2008-TC1 project and the page and line number for the 1st suggested change did not match any version of the standard (the page number was off by one from the 2013 edition). That mistaken page number has been corrected and this bug has been moved to Category Base Definitions and Headers in the Project 1003.1(2013)/Issue7+TC1. If these changes are made to the POSIX Locale, don't we also need to add issymbol() and iswsymbol() functions to XSH? If we add a new character class to the POSIX Locale, is it still appropriate to require that the C Locale and the POSIX Locale be synonyms for the same locale? |
(0001988) steffen (reporter) 2013-11-16 15:51 |
Addition: XBD section 7.3.1, page 139, lines 4023-4025 change The character classes digit, xdigit, lower, upper, and space have a set of automatically included characters. to The character classes digit, xdigit, lower, upper, symbol, blank and space have a set of automatically included characters. [Note that "blank" doesn't belong to this issue, but oh dear] |
(0001989) steffen (reporter) 2013-11-16 16:27 |
Addition: XRAT A.7.3.1 «LC_CTYPE», page 3488, line 117720 change The character classes digit, xdigit, lower, upper, and space have a set of automatically included to The character classes digit, xdigit, lower, upper, symbol, blank and space have a set of automatically included [Again, "blank" doesn't belong to this issue] |
(0001990) steffen (reporter) 2013-11-16 21:08 |
Note: 0001987: If these changes are made to the POSIX Locale, don't we also need to add issymbol() and iswsymbol() functions to XSH? I've opened 0000797. If we add a new character class to the POSIX Locale, is it still appropriate to require that the C Locale and the POSIX Locale be synonyms for the same locale? It seems to me that POSIX already defines itself (the POSIX locale) to be a (compatible) superset of what ISO C requires for the "C" locale. I've opened 0000796. The character class itself however would be an extension (at the time of this writing). |
(0001991) shware_systems (reporter) 2013-11-16 23:32 |
RE: #0001987 I'm more into pestering the 'C' standard folks to add the issymbol() and iswsymbol() first, as something that should have been added to C11. What else may be considered 'missing' is a separate gripe. This in particular has been a known issue since before c89, that some common code sets are mostly symbols, in the linguistic sense, not any of the other ctype keywords. Until that happens I think extra ctype keywords like this and even locale categories should be a testable option an implementation supports. That's all I see that can be done while deferring to 'C' as the base behavior. As it is, most of the proposed changes here would need to be shaded CX until 'C' incorporates them. They could be shaded LCX for Locale Extensions Option. Such an Option could specify additional standardized locales that would incorporate this and other changes without burdening embedded systems that only need the base POSIX locale as it is now, whether localedef also supported or not. This would be consistent with how other extensions have been introduced to the standard that are now part of the base. Another possibility is 'symbol' becomes a reserved charclass-name for the charclass keyword, with defined behavior if it is specified as a charclass argument, similar to the elective Environment Variables of XBD Section 8. That extension mechanism is already in place and supporting it there might make this a TC2 candidate, as it does apply to other code sets besides Unicode. Making it a non-charclass optional keyword could be left to Issue 8 as part of a full Unicode proposal, with modifying the 'C' locale left to Issue 9. |
(0001992) geoffclare (manager) 2013-11-18 10:23 |
There is a major discrepancy between the description and the desired action, in that the description says symbols are currently classified as punctuation characters and portability issues will forbid changing that, but the desired action does exactly that. In any case, this appears to be invention and therefore cannot be standardized until it has been implemented in at least one widely used system. (The related 0000797 adding new is[w]symbol[_l]() interfaces would also need a sponsor.) |
(0001993) steffen (reporter) 2013-11-18 13:30 |
Reply to Note: 0001992. > There is a major discrepancy [.] > the desired action does exactly that. Oh, a Freudian slip! Correction: On page 141, line 4093 add (insert before) symbol Define characters to be classified as characters representing symbols. In the POSIX locale, only: <dollar-sign>, <plus-sign>, <less-than-sign>, <equals-sign>, <greater-than-sign>, <circumflex>, <grave-accent>, <vertical-line> and <tilde> shall be included. In a locale definition file, no character specified for the keywords upper, lower, alpha, digit, cntrl, xdigit, punct or as the <space> shall be specified. Historically the <dollar-sign>, <plus-sign>, <less-than-sign>, <equals-sign>, <greater-than-sign>, <circumflex>, <grave-accent>, <vertical-line> and <tilde> of the portable character class are also included in the punct class. > this appears to be invention POSIX already allows implementations to add additional [:class:]es. IBM ICU / Unicode make extensive use of that and standardize extensions ([1]) which are covered via \p{PROPERTY} in engines like the one from perl(1) (and compatible), e.g., [:script=greek:]. (Note they also extend the syntax to something that is direly missing: class intersections and subtractions; e.g., [[:punct:]--[:symbol:]] (which means the set of all punctuation characters that are not also symbols).) So yes, it is absent (to the best of my knowledge) from a POSIX implementation, but standardized, supported and even furtherly syntax extended by the leasing industry standard, Unicode. There is an online example ([2]) which can be used to test character classes like [:ASCII:], [:symbol:] and [:mark:]. The latter is also the second major category that is entirely missing from the POSIX standard. (And testing it via [2] shows a result set of not less than 1,645 Code Points.) [1] http://www.unicode.org/reports/tr18/ [^] [2] http://unicode.org/cldr/utility/list-unicodeset.jsp [^] > adding new is[w]symbol[_l]() interfaces would also need a sponsor I'm afraid so; as a last resort these issues could be left open until they come in via the C standard? And what about «mark», isw?mark(_l)?() and [:mark:]? |
(0001994) shware_systems (reporter) 2013-11-18 14:37 |
They get added as charclass keyword names and the generic iswctype() is used directly or in wrapper functions by REs and tr. This is already required behavior. These additions can overlap other classes, as xdigit combines digits and some upper and lower, so with 'symbol' defined that way a code point can be both punct and symbol for current code sets. Restricting it to non-alpha, non-xdigit, non-NUL should suffice to keep backwards compatibility. For some code sets allowing a code point to be both symbol and blank, cntrl, or punct is warranted. |
(0005919) Don Cragun (manager) 2022-07-28 16:29 |
This was discussed during the 2022-07-28 conference call. There is no existing practice for this change, so this bug is rejected. And, the proposed changes conflict with existing character class requirements. These conflicts should be addressed as part of 0001548, not here. |
Mantis 1.1.6[^] Copyright © 2000 - 2008 Mantis Group |