0000554: Five(?) clarifications of the informal regex descriptions

ID	Project	Category	View Status	Date Submitted	Last Update

0000554	1003.1(2008)/Issue 7	Base Definitions and Headers	public	2012-04-10 20:04	2019-06-10 08:55

Reporter	dubiousjim	Assigned To	ajosey
Priority	normal	Severity	Editorial	Type	Clarification Requested
Status	Closed	Resolution	Accepted As Marked

Name	Jim Pryor
Organization
User Reference
Section	XBD Chapter 9
Page Number	181-191
Line Number	see below
Interp Status	---
Final Accepted Text	See 0000554:0001308, 0000554:0001309, 0000554:0001310, 0000554:0001311, 0000554:0001315


Summary	0000554: Five(?) clarifications of the informal regex descriptions
Description	[Edited to correct page/line numbers to the 2008 base document -- Mark Brown] I hope not to be wasting anyone's time with this. I realize this Draft is well advanced and many eyes have already looked at this material; however I've recently noticed some unclarities in the informal summaries of BRE and ERE grammar, that might arguably benefit from being tightened up. (1) p. 182, lines 5842-5848, of POSIX.1-2008 say: \| In the regular expression processing \| described in POSIX.1-2008, the <newline> is regarded as an ordinary character and both a \| <period> and a non-matching list can match one. The Shell and Utilities volume of \| POSIX.1-2008 specifies within the individual descriptions of those standard utilities employing \| regular expressions whether they permit matching of <newline> characters; if not stated \| otherwise, the use of literal <newline> characters or any escape sequence equivalent produces \| undefined results. One awkwardness here is that the first sentence can be read as defining the default behavior, but the last clause of the second sentence says the behavior when not explicitly stated is unspecified. Is the intent that the first sentence not be stating default behavior, but only what's assumed in examples in the standard? (Are there any examples where this is relied on?) Or is the intent rather that the first sentence is stating default behavior outside of the Shell and Utilities, and the second sentence is meant to be understood as restricted to that context? (In that case, I'd recommend making the restriction explicit in the second clause.) A second awkwardness here is that the second clause of the second sentence leaves it ambiguous whether it's the use of newlines in text being matched which is undefined, or the use of newlines in patterns, or both. (2) p. 184, line 5894, and p. 189, line 6106 of POSIX.1-2008 say that "^" is a special character when it occurs in [^...] position. It's unclear what is the point of designating it as "special" in that position is: "-" and "]" aren't designated as "special" when they occur in bracket constructions (nor are "." or ":" or "="---though the sequences "[." and so on are called "special"). The role of "special" in the grammar seems motivated by two things: (i) sometimes, to indicate the character isn't in that position representing itself; and (ii) to mark those characters the \-escaping of which has a special definition. (i) is not uniformly applied; but perhaps it is what motivates calling the circumflex in "[^...]" "special." If we understand (2) to be the criterion for "specialness", then presumably we shouldn't count any characters (or character-sequences) inside brackets as "special". "[\^...]" doesn't escape the "^", it's just a range expression with a non-special occurrence of "\" followed by a non-special occurrence of "^". (3) p. 188, lines 6081-6083, of POSIX.1-2008 say: \| An ordinary character is an ERE that matches itself. An ordinary character is any character in the \| supported character set, except for the ERE special characters listed in Section 9.4.3. The \| interpretation of an ordinary character preceded by a <backslash> ('\\') is undefined. This seems to say that the interpretation of "a" in "[\a]" is undefined. Contrast p. 183, lines 5873-5879, which explicitly exclude occurrences of these characters inside brackets: \| An ordinary character is a BRE that matches itself: any character in the supported character set, \| except for the BRE special characters listed in Section 9.3.3. \| The interpretation of an ordinary character preceded by a <backslash> (?\\?) is undefined, \| except for: ... \| -- A character inside a bracket expression (4) p. 188, lines 6095-6099, of POSIX.1-2008 say: \| +?{ The <asterisk>, <plus-sign>, <question-mark>, and <left-brace> shall be special except \| when used in a bracket expression (see Section 9.3.5, on page 184). Any of the following uses produce undefined results: \| -- If these characters appear first in an ERE, or immediately following a <vertical- \| line>, <circumflex>, or <left-parenthesis> This seems to say that "" is not special in the pattern "a^"; and indeed, in EREs (as opposed to BREs), "^" is anchoring even when it occurs in the middle of the pattern. So we understand "a^" to be a valid pattern of a literal "a", an anchor, and a literal ""---a valid pattern but one that will never match any text. It's curious that nothing is said about the pattern "$z". If you want to require implementations to accept "a^" as valid, instead of an error, why wouldn't you want to do so for "$z" too? Or if you instead want to permit implementations latitude to reject the latter as an error, why wouldn't you want to do so for the former too? (5) p. 183, lines 5886-5888, of POSIX.1-2008 say: \| An expression containing \| a '[' that is not preceded by a <backslash> and is not part of a bracket expression \| produces undefined results. This is in the description of "BRE Special Characters". There is no analogous statement for an unescaped, unmatched '[' in EREs. The implementations I checked (FreeBSD egrep, Gnu egrep, BusyBox egrep) all rejected the pattern "[" as an error.
Desired Action	[Edited to correct page/line numbers to the 2008 base document -- Mark Brown] (1) I don't understand what is intended by seeming to specify the default interpretation, and then saying the default is undefined, so I can't offer a recommendation there. As to the issue of whether it was newlines in patterns, or newlines in matched text that was meant, I recommend just adding one of the phrases: "in patterns", "in matched text", or "in either patterns or matched text" to the ___ indicated: \| if not stated \| otherwise, the use of literal <newline> characters or any escape sequence equivalent ___ produces \| undefined results. (p. 182, lines 5846-5850) (2) Decide whether the criterion for being a "special" occurrence of a character is (a) not representing yourself, or (b) the way you interact with "\". If (b), then the circumflex in "[^...]" is not "special"; and neither are sequences like "[:". If (a), then various other characters also merit being called "special" (at least "-" and "]" when they occur after "["); but this will take some more careful review of the Chapter and is arguably not worth the time and effort. (3) Change p. 188, lines 6082-6083 from: \| The \| interpretation of an ordinary character preceded by a <backslash> ('\\') is undefined. to: \| The \| interpretation of an ordinary character preceded by a <backslash> ('\\'), outside of a bracket expression, is undefined. (4) On p. 188, change lines 6098-6099 from: \| -- If these characters appear first in an ERE, or immediately following a <vertical- \| line>, <circumflex>, or <left-parenthesis> to: \| -- If these characters appear first in an ERE, or immediately following a <vertical- \| line>, <circumflex>, <dollar>, or <left-parenthesis> This counts quantifier characters following anchors as everywhere valid (but interpreted literally); to count them as everywhere an error, instead omit the items "<circumflex>, <dollar>". (5) On p. 188, add after line 6092: "A <left-square-bracket> that is not preceded by a <backslash> and is not part of a bracket expression also produces undefined results."
Tags	tc2-2008

dubiousjim 2012-04-10 20:18 reporter bugnote:0001194	Some further observations: Re item (2), note that in the formal grammars, no characters inside bracket constructions are counted as part of SPEC_CHAR. "^", "-", and "]" after a not-yet-closed "[" are instead all counted as META_CHAR. Re item (4), the "<vertical-line>", "<left-parenthesis>" and so on should be prefixed with "unescaped".

nick 2012-04-12 15:36 manager bugnote:0001197	Need to update the page and line numbers to match base document, rather than the D4 merged document.

dubiousjim 2012-04-12 15:56 reporter bugnote:0001199	I don't think I have access to a version of the base document with line numbers. I've registered at the OpenGroup site, with the Austin site (and was able to get the D4 merged document there), and at this BugTracker. But when I attempt to download the PDF from <https://www2.opengroup.org/ogsys/jsp/publications/PublicationDetails.jsp?catalogno=c082>, I'm told "You do not have the correct privileges to be able to view this item. You can try logging in as a different user using the links on the left." Will delete this note when resolved, not sure where else to respond.

msbrown 2012-07-12 16:25 manager bugnote:0001305	Edited to correct page/line numbers in Description and Desired Actionto the 2008 base document.

nick 2012-07-19 15:20 manager bugnote:0001308 Last edited: 2012-07-19 15:33	In response to item 1: Change p. 182, lines 5842-5848, of POSIX.1-2008 from: In the regular expression processing described in POSIX.1-2008, the <newline> is regarded as an ordinary character and both a <period> and a non-matching list can match one. The Shell and Utilities volume of POSIX.1-2008 specifies within the individual descriptions of those standard utilities employing regular expressions whether they permit matching of <newline> characters; if not stated otherwise, the use of literal <newline> characters or any escape sequence equivalent produces undefined results. to In the functions processing regular expressions described in the System Interfaces volume of POSIX.1-2008, the <newline> is regarded as an ordinary character and both a <period> and a non-matching list can match one. The Shell and Utilities volume of POSIX.1-2008 specifies within the individual descriptions of those standard utilities employing regular expressions whether they permit matching of <newline> characters; if not stated otherwise, the use of literal <newline> characters or any escape sequence equivalent in either patterns or matched text produces undefined results .

nick 2012-07-19 15:29 manager bugnote:0001309 Last edited: 2012-08-02 15:37	In response to item 2: At both p. 184, line 5894-5896 AND p. 189, line 6106-6109 of POSIX.1-2008 change The <circumflex> shall be special when used as: - An anchor (see Section XXX on page YYY) - The first character of a bracket expression (see Section AAA on page BBB) to The <circumflex> shall be special when used as an anchor (see Section XXX, on page YYY). The <circumflex> shall signify a non matching list expression when it occurs first in a list, immediately following a <left-square-bracket> (see Section AAA on page BBB). At p 192 line 6233 delete the phrase "or when first in a bracket expression"

nick 2012-07-19 15:59 manager bugnote:0001310 Last edited: 2012-07-26 15:37	In response to item 3: Change p. 188, lines 6082-6083 from: The interpretation of an ordinary character preceded by a <backslash> ('\\') is undefined. to: The interpretation of an ordinary character preceded by an unescaped <backslash> ('\\') is undefined, except in the context of a bracket expression (see XREF to ERE bracket expression). Also for BRE (page 183 line 5875), change: The interpretation of an ordinary character preceded by a <backslash> (’\\’) is undefined, except for: to The interpretation of an ordinary character preceded by an unescaped <backslash> (’\\’) is undefined, except for: Also, at page 195, line 6381, change: ORD_CHAR preceded by a <backslash> character to ORD_CHAR preceded by an unescaped <backslash> character

nick 2012-07-19 16:13 manager bugnote:0001311 Last edited: 2012-07-26 15:40	In response to item 4: On p. 188, change lines 6098-6099 from: -- If these characters appear first in an ERE, or immediately following a <vertical- line>, <circumflex>, or <left-parenthesis> to: -- If these characters appear first in an ERE, or immediately following an unescaped <vertical- line>, <circumflex>, <dollar-sign>, or <left-parenthesis> Also at page 195, line 6379-6380, change The ERE grammar does not permit several constructs that previous sections specify as having undefined results: to The ERE grammar does not permit several constructs that previous sections specify as having undefined results. Additionally, there are some constructs which the grammar permits but which still give undefined results: At p 195 line 6383 add '$' to the list of chars.

eblake 2012-07-19 16:35 manager bugnote:0001312 Last edited: 2012-07-26 15:41	In response to 0000554:0001311 (item 4), we also have to worry about the grammar. Line 6361 states: ERE_expression : : one_char_or_coll_elem_ERE \| ’ˆ’ \| ’$’ \| ’(’ extended_reg_exp ’)’ \| ERE_expression ERE_dupl_symbol ; which means the grammar requires the parser to accept both 'a^' and '$z' as valid expressions. The existing text at line 6098 said that even though 'a^' parses, it still produces undefined results; and our addition of <dollar-sign> means that '$z' is now in the same category. But line 6382 claims that the ERE grammar does not even accept the use of ERE_dupl_symbol after '^'; either we change the grammar to make that statement true (in which case, we should also expand the statement to cover '$'), or we need to correct that statement to drop '^' (the grammar as-is forbids ERE_dupl_symbol after '\|' or '(', but not after '^'). (0000554:0001311 edited in place)

eblake 2012-07-19 17:11 manager bugnote:0001313 Last edited: 2012-07-26 15:20	In response to 0000554:0001310 (item 3), I think we need the phrase 'an unescaped <backslash>', since '\\a' is a well-defined regex. In that case, line 5875 of BRE Ordinary Characters needs a similar change. (note was edited in place to take this into account)

eblake 2012-07-26 15:19 manager bugnote:0001314 Last edited: 2012-07-26 15:44	In response to 0000554:0001309 (item 2), we also need to fix the grammar at line 6232, to distinguish between when ^ is special (as an anchor) vs. an alternate interpretation but not special (leading a bracket expression). (0000554:0001309 edited in place)

nick 2012-07-26 15:49 manager bugnote:0001315 Last edited: 2012-08-02 15:39	In response to item 5, On p. 188, add after line 6092: "A <left-square-bracket> that is unescaped and is not part of a bracket expression also produces undefined results." Also at p 183 line 5887 change "that is not preceded by a <backslash>" to "that is unescaped"

Date Modified	Username	Field	Change
2012-04-10 20:04	dubiousjim	New Issue
2012-04-10 20:04	dubiousjim	Status	New => Under Review
2012-04-10 20:04	dubiousjim	Assigned To	=> ajosey
2012-04-10 20:04	dubiousjim	Name	=> Jim Pryor
2012-04-10 20:04	dubiousjim	Section	=> XBD Chapter 9
2012-04-10 20:04	dubiousjim	Page Number	=> 181-191
2012-04-10 20:04	dubiousjim	Line Number	=> see below
2012-04-10 20:18	dubiousjim	Note Added: 0001194
2012-04-12 15:35	msbrown	Project	2008-TC1 => 1003.1(2008)/Issue 7
2012-04-12 15:36	nick	Note Added: 0001197
2012-04-12 15:56	dubiousjim	Note Added: 0001199
2012-07-12 16:24	msbrown	Interp Status	=> ---
2012-07-12 16:24	msbrown	Description Updated
2012-07-12 16:24	msbrown	Desired Action Updated
2012-07-12 16:25	msbrown	Note Added: 0001305
2012-07-12 16:26	msbrown	Description Updated
2012-07-19 15:20	nick	Note Added: 0001308
2012-07-19 15:21	nick	Note Edited: 0001308
2012-07-19 15:29	nick	Note Added: 0001309
2012-07-19 15:33	nick	Note Edited: 0001308
2012-07-19 15:34	nick	Note Edited: 0001309
2012-07-19 15:44	nick	Note Edited: 0001309
2012-07-19 15:46	nick	Note Edited: 0001309
2012-07-19 15:53	nick	Note Edited: 0001309
2012-07-19 15:59	nick	Note Added: 0001310
2012-07-19 16:13	nick	Note Added: 0001311
2012-07-19 16:18	nick	Note Edited: 0001311
2012-07-19 16:22	nick	Note Edited: 0001311
2012-07-19 16:35	eblake	Note Added: 0001312
2012-07-19 17:11	eblake	Note Added: 0001313
2012-07-19 17:11	eblake	Note Edited: 0001312
2012-07-19 17:11	eblake	Note Edited: 0001313
2012-07-20 15:20	eblake	Relationship added	related to 0000595
2012-07-26 15:09	nick	Note Edited: 0001310
2012-07-26 15:19	eblake	Note Added: 0001314
2012-07-26 15:19	nick	Note Edited: 0001310
2012-07-26 15:20	nick	Note Edited: 0001313
2012-07-26 15:20	eblake	Note Edited: 0001314
2012-07-26 15:37	nick	Note Edited: 0001310
2012-07-26 15:40	nick	Note Edited: 0001311
2012-07-26 15:41	nick	Note Edited: 0001312
2012-07-26 15:44	nick	Note Edited: 0001309
2012-07-26 15:44	nick	Note Edited: 0001314
2012-07-26 15:49	nick	Note Added: 0001315
2012-07-26 15:51	nick	Note Edited: 0001315
2012-07-26 15:53	nick	Final Accepted Text	=> See 0000554:0001308, 0000554:0001309, 0000554:0001310, 0000554:0001311, 0000554:0001315
2012-07-26 15:53	nick	Status	Under Review => Resolved
2012-07-26 15:53	nick	Resolution	Open => Accepted As Marked
2012-07-26 15:54	nick	Note Edited: 0001315
2012-07-26 15:54	nick	Tag Attached: tc2-2008
2012-08-02 15:37	geoffclare	Note Edited: 0001309
2012-08-02 15:39	geoffclare	Note Edited: 0001315
2019-06-10 08:55	agadmin	Status	Resolved => Closed

View Issue Details

Relationships

Activities

Issue History