0000795: Addition of a new «symbol» character class

Notes
(0001987) Don Cragun (manager) 2013-11-16 07:15	This bug was originally filed against the Rationale section of the 2008-TC1 project and the page and line number for the 1st suggested change did not match any version of the standard (the page number was off by one from the 2013 edition). That mistaken page number has been corrected and this bug has been moved to Category Base Definitions and Headers in the Project 1003.1(2013)/Issue7+TC1. If these changes are made to the POSIX Locale, don't we also need to add issymbol() and iswsymbol() functions to XSH? If we add a new character class to the POSIX Locale, is it still appropriate to require that the C Locale and the POSIX Locale be synonyms for the same locale?

(0001988) steffen (reporter) 2013-11-16 15:51	Addition: XBD section 7.3.1, page 139, lines 4023-4025 change The character classes digit, xdigit, lower, upper, and space have a set of automatically included characters. to The character classes digit, xdigit, lower, upper, symbol, blank and space have a set of automatically included characters. [Note that "blank" doesn't belong to this issue, but oh dear]

(0001989) steffen (reporter) 2013-11-16 16:27	Addition: XRAT A.7.3.1 «LC_CTYPE», page 3488, line 117720 change The character classes digit, xdigit, lower, upper, and space have a set of automatically included to The character classes digit, xdigit, lower, upper, symbol, blank and space have a set of automatically included [Again, "blank" doesn't belong to this issue]

(0001990) steffen (reporter) 2013-11-16 21:08	Note: 0001987: If these changes are made to the POSIX Locale, don't we also need to add issymbol() and iswsymbol() functions to XSH? I've opened 0000797. If we add a new character class to the POSIX Locale, is it still appropriate to require that the C Locale and the POSIX Locale be synonyms for the same locale? It seems to me that POSIX already defines itself (the POSIX locale) to be a (compatible) superset of what ISO C requires for the "C" locale. I've opened 0000796. The character class itself however would be an extension (at the time of this writing).

(0001991) shware_systems (reporter) 2013-11-16 23:32	RE: #0001987 I'm more into pestering the 'C' standard folks to add the issymbol() and iswsymbol() first, as something that should have been added to C11. What else may be considered 'missing' is a separate gripe. This in particular has been a known issue since before c89, that some common code sets are mostly symbols, in the linguistic sense, not any of the other ctype keywords. Until that happens I think extra ctype keywords like this and even locale categories should be a testable option an implementation supports. That's all I see that can be done while deferring to 'C' as the base behavior. As it is, most of the proposed changes here would need to be shaded CX until 'C' incorporates them. They could be shaded LCX for Locale Extensions Option. Such an Option could specify additional standardized locales that would incorporate this and other changes without burdening embedded systems that only need the base POSIX locale as it is now, whether localedef also supported or not. This would be consistent with how other extensions have been introduced to the standard that are now part of the base. Another possibility is 'symbol' becomes a reserved charclass-name for the charclass keyword, with defined behavior if it is specified as a charclass argument, similar to the elective Environment Variables of XBD Section 8. That extension mechanism is already in place and supporting it there might make this a TC2 candidate, as it does apply to other code sets besides Unicode. Making it a non-charclass optional keyword could be left to Issue 8 as part of a full Unicode proposal, with modifying the 'C' locale left to Issue 9.

(0001992) geoffclare (manager) 2013-11-18 10:23	There is a major discrepancy between the description and the desired action, in that the description says symbols are currently classified as punctuation characters and portability issues will forbid changing that, but the desired action does exactly that. In any case, this appears to be invention and therefore cannot be standardized until it has been implemented in at least one widely used system. (The related 0000797 adding new is[w]symbol[_l]() interfaces would also need a sponsor.)

(0001993) steffen (reporter) 2013-11-18 13:30	Reply to Note: 0001992. > There is a major discrepancy [.] > the desired action does exactly that. Oh, a Freudian slip! Correction: On page 141, line 4093 add (insert before) symbol Define characters to be classified as characters representing symbols. In the POSIX locale, only: <dollar-sign>, <plus-sign>, <less-than-sign>, <equals-sign>, <greater-than-sign>, <circumflex>, <grave-accent>, <vertical-line> and <tilde> shall be included. In a locale definition file, no character specified for the keywords upper, lower, alpha, digit, cntrl, xdigit, punct or as the <space> shall be specified. Historically the <dollar-sign>, <plus-sign>, <less-than-sign>, <equals-sign>, <greater-than-sign>, <circumflex>, <grave-accent>, <vertical-line> and <tilde> of the portable character class are also included in the punct class. > this appears to be invention POSIX already allows implementations to add additional [:class:]es. IBM ICU / Unicode make extensive use of that and standardize extensions ([1]) which are covered via \p{PROPERTY} in engines like the one from perl(1) (and compatible), e.g., [:script=greek:]. (Note they also extend the syntax to something that is direly missing: class intersections and subtractions; e.g., [[:punct:]--[:symbol:]] (which means the set of all punctuation characters that are not also symbols).) So yes, it is absent (to the best of my knowledge) from a POSIX implementation, but standardized, supported and even furtherly syntax extended by the leasing industry standard, Unicode. There is an online example ([2]) which can be used to test character classes like [:ASCII:], [:symbol:] and [:mark:]. The latter is also the second major category that is entirely missing from the POSIX standard. (And testing it via [2] shows a result set of not less than 1,645 Code Points.) [1] http://www.unicode.org/reports/tr18/ [^] [2] http://unicode.org/cldr/utility/list-unicodeset.jsp [^] > adding new is[w]symbol[_l]() interfaces would also need a sponsor I'm afraid so; as a last resort these issues could be left open until they come in via the C standard? And what about «mark», isw?mark(_l)?() and [:mark:]?

(0001994) shware_systems (reporter) 2013-11-18 14:37	They get added as charclass keyword names and the generic iswctype() is used directly or in wrapper functions by REs and tr. This is already required behavior. These additions can overlap other classes, as xdigit combines digits and some upper and lower, so with 'symbol' defined that way a code point can be both punct and symbol for current code sets. Restricting it to non-alpha, non-xdigit, non-NUL should suffice to keep backwards compatibility. For some code sets allowing a code point to be both symbol and blank, cntrl, or punct is warranted.

(0005919) Don Cragun (manager) 2022-07-28 16:29	This was discussed during the 2022-07-28 conference call. There is no existing practice for this change, so this bug is rejected. And, the proposed changes conflict with existing character class requirements. These conflicts should be addressed as part of 0001548, not here.

Issue History
Date Modified	Username	Field	Change
2013-11-15 22:46	steffen	New Issue
2013-11-15 22:46	steffen	Status	New => Under Review
2013-11-15 22:46	steffen	Assigned To	=> ajosey
2013-11-15 22:46	steffen	Name	=> steffen
2013-11-15 22:46	steffen	Section	=> Vol 1. 3., Vol 1. 7.,
2013-11-15 22:46	steffen	Page Number	=> 92, 140-146, 167
2013-11-15 22:46	steffen	Line Number	=> 2586, 4047, 4065, 4072, 4081, 4090, 4093, 4156, 4174, 4177, 4215, 4271, 4278, 4295-4297, 4330, 4332, 4360, 4362, 5302
2013-11-16 07:04	Don Cragun	Project	2008-TC1 => 1003.1(2013)/Issue7+TC1
2013-11-16 07:15	Don Cragun	Section	Vol 1. 3., Vol 1. 7., => XBD section 3 (Definitions) and section 7 (Locale)
2013-11-16 07:15	Don Cragun	Page Number	92, 140-146, 167 => 93, 140-146, 167
2013-11-16 07:15	Don Cragun	Interp Status	=> ---
2013-11-16 07:15	Don Cragun	Note Added: 0001987
2013-11-16 07:15	Don Cragun	Severity	Editorial => Objection
2013-11-16 07:15	Don Cragun	Category	Rationale => Base Definitions and Headers
2013-11-16 07:15	Don Cragun	Desired Action Updated
2013-11-16 15:51	steffen	Note Added: 0001988
2013-11-16 16:27	steffen	Note Added: 0001989
2013-11-16 21:08	steffen	Note Added: 0001990
2013-11-16 23:32	shware_systems	Note Added: 0001991
2013-11-18 10:23	geoffclare	Note Added: 0001992
2013-11-18 10:24	geoffclare	Relationship added	related to 0000797
2013-11-18 10:24	geoffclare	Relationship added	related to 0000798
2013-11-18 13:30	steffen	Note Added: 0001993
2013-11-18 14:37	shware_systems	Note Added: 0001994
2014-01-16 16:45	Don Cragun	Tag Attached: UTF-8_Locale
2022-07-28 16:22	eblake	Relationship added	related to 0001548
2022-07-28 16:29	Don Cragun	Note Added: 0005919
2022-07-28 16:29	Don Cragun	Status	Under Review => Closed
2022-07-28 16:29	Don Cragun	Resolution	Open => Rejected

Aardvark Mark IV