Anonymous | Login | 2024-03-29 12:50 UTC |
Main | My View | View Issues | Change Log | Docs |
Viewing Issue Simple Details [ Jump to Notes ] | [ Issue History ] [ Print ] | ||||||
ID | Category | Severity | Type | Date Submitted | Last Update | ||
0000554 | [1003.1(2008)/Issue 7] Base Definitions and Headers | Editorial | Clarification Requested | 2012-04-10 20:04 | 2019-06-10 08:55 | ||
Reporter | dubiousjim | View Status | public | ||||
Assigned To | ajosey | ||||||
Priority | normal | Resolution | Accepted As Marked | ||||
Status | Closed | ||||||
Name | Jim Pryor | ||||||
Organization | |||||||
User Reference | |||||||
Section | XBD Chapter 9 | ||||||
Page Number | 181-191 | ||||||
Line Number | see below | ||||||
Interp Status | --- | ||||||
Final Accepted Text | See Note: 0001308, Note: 0001309, Note: 0001310, Note: 0001311, Note: 0001315 | ||||||
Summary | 0000554: Five(?) clarifications of the informal regex descriptions | ||||||
Description |
[Edited to correct page/line numbers to the 2008 base document -- Mark Brown] I hope not to be wasting anyone's time with this. I realize this Draft is well advanced and many eyes have already looked at this material; however I've recently noticed some unclarities in the informal summaries of BRE and ERE grammar, that might arguably benefit from being tightened up. (1) p. 182, lines 5842-5848, of POSIX.1-2008 say: | In the regular expression processing | described in POSIX.1-2008, the <newline> is regarded as an ordinary character and both a | <period> and a non-matching list can match one. The Shell and Utilities volume of | POSIX.1-2008 specifies within the individual descriptions of those standard utilities employing | regular expressions whether they permit matching of <newline> characters; if not stated | otherwise, the use of literal <newline> characters or any escape sequence equivalent produces | undefined results. One awkwardness here is that the first sentence can be read as defining the default behavior, but the last clause of the second sentence says the behavior when not explicitly stated is unspecified. Is the intent that the first sentence *not* be stating default behavior, but only what's assumed in examples in the standard? (Are there any examples where this is relied on?) Or is the intent rather that the first sentence is stating default behavior outside of the Shell and Utilities, and the second sentence is meant to be understood as restricted to that context? (In that case, I'd recommend making the restriction explicit in the second clause.) A second awkwardness here is that the second clause of the second sentence leaves it ambiguous whether it's the use of newlines in *text being matched* which is undefined, or the use of newlines in *patterns*, or both. (2) p. 184, line 5894, and p. 189, line 6106 of POSIX.1-2008 say that "^" is a special character when it occurs in [^...] position. It's unclear what is the point of designating it as "special" in that position is: "-" and "]" aren't designated as "special" when they occur in bracket constructions (nor are "." or ":" or "="---though the sequences "[." and so on are called "special"). The role of "special" in the grammar seems motivated by two things: (i) sometimes, to indicate the character isn't in that position representing itself; and (ii) to mark those characters the \-escaping of which has a special definition. (i) is not uniformly applied; but perhaps it is what motivates calling the circumflex in "[^...]" "special." If we understand (2) to be the criterion for "specialness", then presumably we shouldn't count *any* characters (or character-sequences) inside brackets as "special". "[\^...]" doesn't escape the "^", it's just a range expression with a non-special occurrence of "\" followed by a non-special occurrence of "^". (3) p. 188, lines 6081-6083, of POSIX.1-2008 say: | An ordinary character is an ERE that matches itself. An ordinary character is any character in the | supported character set, except for the ERE special characters listed in Section 9.4.3. The | interpretation of an ordinary character preceded by a <backslash> ('\\') is undefined. This seems to say that the interpretation of "a" in "[\a]" is undefined. Contrast p. 183, lines 5873-5879, which explicitly exclude occurrences of these characters inside brackets: | An ordinary character is a BRE that matches itself: any character in the supported character set, | except for the BRE special characters listed in Section 9.3.3. | The interpretation of an ordinary character preceded by a <backslash> (?\\?) is undefined, | except for: ... | -- A character inside a bracket expression (4) p. 188, lines 6095-6099, of POSIX.1-2008 say: | *+?{ The <asterisk>, <plus-sign>, <question-mark>, and <left-brace> shall be special except | when used in a bracket expression (see Section 9.3.5, on page 184). Any of the following uses produce undefined results: | -- If these characters appear first in an ERE, or immediately following a <vertical- | line>, <circumflex>, or <left-parenthesis> This seems to say that "*" is not special in the pattern "a^*"; and indeed, in EREs (as opposed to BREs), "^" is anchoring even when it occurs in the middle of the pattern. So we understand "a^*" to be a valid pattern of a literal "a", an anchor, and a literal "*"---a valid pattern but one that will never match any text. It's curious that nothing is said about the pattern "$*z". If you want to require implementations to accept "a^*" as valid, instead of an error, why wouldn't you want to do so for "$*z" too? Or if you instead want to permit implementations latitude to reject the latter as an error, why wouldn't you want to do so for the former too? (5) p. 183, lines 5886-5888, of POSIX.1-2008 say: | An expression containing | a '[' that is not preceded by a <backslash> and is not part of a bracket expression | produces undefined results. This is in the description of "BRE Special Characters". There is no analogous statement for an unescaped, unmatched '[' in EREs. The implementations I checked (FreeBSD egrep, Gnu egrep, BusyBox egrep) all rejected the pattern "[" as an error. |
||||||
Desired Action |
[Edited to correct page/line numbers to the 2008 base document -- Mark Brown] (1) I don't understand what is intended by seeming to specify the default interpretation, and then saying the default is undefined, so I can't offer a recommendation there. As to the issue of whether it was newlines in patterns, or newlines in matched text that was meant, I recommend just adding one of the phrases: "in patterns", "in matched text", or "in either patterns or matched text" to the ___ indicated: | if not stated | otherwise, the use of literal <newline> characters or any escape sequence equivalent ___ produces | undefined results. (p. 182, lines 5846-5850) (2) Decide whether the criterion for being a "special" occurrence of a character is (a) not representing yourself, or (b) the way you interact with "\". If (b), then the circumflex in "[^...]" is not "special"; and neither are sequences like "[:". If (a), then various other characters also merit being called "special" (at least "-" and "]" when they occur after "["); but this will take some more careful review of the Chapter and is arguably not worth the time and effort. (3) Change p. 188, lines 6082-6083 from: | The | interpretation of an ordinary character preceded by a <backslash> ('\\') is undefined. to: | The | interpretation of an ordinary character preceded by a <backslash> ('\\'), outside of a bracket expression, is undefined. (4) On p. 188, change lines 6098-6099 from: | -- If these characters appear first in an ERE, or immediately following a <vertical- | line>, <circumflex>, or <left-parenthesis> to: | -- If these characters appear first in an ERE, or immediately following a <vertical- | line>, <circumflex>, <dollar>, or <left-parenthesis> This counts quantifier characters following anchors as everywhere valid (but interpreted literally); to count them as everywhere an error, instead omit the items "<circumflex>, <dollar>". (5) On p. 188, add after line 6092: "A <left-square-bracket> that is not preceded by a <backslash> and is not part of a bracket expression also produces undefined results." |
||||||
Tags | tc2-2008 | ||||||
Attached Files | |||||||
|
Relationships | ||||||
|
Notes | |
(0001194) dubiousjim (reporter) 2012-04-10 20:18 |
Some further observations: Re item (2), note that in the formal grammars, no characters inside bracket constructions are counted as part of SPEC_CHAR. "^", "-", and "]" after a not-yet-closed "[" are instead all counted as META_CHAR. Re item (4), the "<vertical-line>", "<left-parenthesis>" and so on should be prefixed with "unescaped". |
(0001197) nick (manager) 2012-04-12 15:36 |
Need to update the page and line numbers to match base document, rather than the D4 merged document. |
(0001199) dubiousjim (reporter) 2012-04-12 15:56 |
I don't think I have access to a version of the base document with line numbers. I've registered at the OpenGroup site, with the Austin site (and was able to get the D4 merged document there), and at this BugTracker. But when I attempt to download the PDF from <https://www2.opengroup.org/ogsys/jsp/publications/PublicationDetails.jsp?catalogno=c082>, [^] I'm told "You do not have the correct privileges to be able to view this item. You can try logging in as a different user using the links on the left." Will delete this note when resolved, not sure where else to respond. |
(0001305) msbrown (manager) 2012-07-12 16:25 |
Edited to correct page/line numbers in Description and Desired Actionto the 2008 base document. |
(0001308) nick (manager) 2012-07-19 15:20 edited on: 2012-07-19 15:33 |
In response to item 1: Change p. 182, lines 5842-5848, of POSIX.1-2008 from: In the regular expression processing described in POSIX.1-2008, the <newline> is regarded as an ordinary character and both a <period> and a non-matching list can match one. The Shell and Utilities volume of POSIX.1-2008 specifies within the individual descriptions of those standard utilities employing regular expressions whether they permit matching of <newline> characters; if not stated otherwise, the use of literal <newline> characters or any escape sequence equivalent produces undefined results. to In the functions processing regular expressions described in the System Interfaces volume of POSIX.1-2008, the <newline> is regarded as an ordinary character and both a <period> and a non-matching list can match one. The Shell and Utilities volume of POSIX.1-2008 specifies within the individual descriptions of those standard utilities employing regular expressions whether they permit matching of <newline> characters; if not stated otherwise, the use of literal <newline> characters or any escape sequence equivalent in either patterns or matched text produces undefined results . |
(0001309) nick (manager) 2012-07-19 15:29 edited on: 2012-08-02 15:37 |
In response to item 2: At both p. 184, line 5894-5896 AND p. 189, line 6106-6109 of POSIX.1-2008 change The <circumflex> shall be special when used as: - An anchor (see Section XXX on page YYY) - The first character of a bracket expression (see Section AAA on page BBB) to The <circumflex> shall be special when used as an anchor (see Section XXX, on page YYY). The <circumflex> shall signify a non matching list expression when it occurs first in a list, immediately following a <left-square-bracket> (see Section AAA on page BBB). At p 192 line 6233 delete the phrase "or when first in a bracket expression" |
(0001310) nick (manager) 2012-07-19 15:59 edited on: 2012-07-26 15:37 |
In response to item 3: Change p. 188, lines 6082-6083 from: The interpretation of an ordinary character preceded by a <backslash> ('\\') is undefined. to: The interpretation of an ordinary character preceded by an unescaped <backslash> ('\\') is undefined, except in the context of a bracket expression (see XREF to ERE bracket expression). Also for BRE (page 183 line 5875), change: The interpretation of an ordinary character preceded by a <backslash> (’\\’) is undefined, except for: to The interpretation of an ordinary character preceded by an unescaped <backslash> (’\\’) is undefined, except for: Also, at page 195, line 6381, change: ORD_CHAR preceded by a <backslash> character to ORD_CHAR preceded by an unescaped <backslash> character |
(0001311) nick (manager) 2012-07-19 16:13 edited on: 2012-07-26 15:40 |
In response to item 4: On p. 188, change lines 6098-6099 from: -- If these characters appear first in an ERE, or immediately following a <vertical- line>, <circumflex>, or <left-parenthesis> to: -- If these characters appear first in an ERE, or immediately following an unescaped <vertical- line>, <circumflex>, <dollar-sign>, or <left-parenthesis> Also at page 195, line 6379-6380, change The ERE grammar does not permit several constructs that previous sections specify as having undefined results: to The ERE grammar does not permit several constructs that previous sections specify as having undefined results. Additionally, there are some constructs which the grammar permits but which still give undefined results: At p 195 line 6383 add '$' to the list of chars. |
(0001312) eblake (manager) 2012-07-19 16:35 edited on: 2012-07-26 15:41 |
In response to Note: 0001311 (item 4), we also have to worry about the grammar. Line 6361 states: ERE_expression : : one_char_or_coll_elem_ERE | ’ˆ’ | ’$’ | ’(’ extended_reg_exp ’)’ | ERE_expression ERE_dupl_symbol ; which means the grammar requires the parser to accept both 'a^*' and '$*z' as valid expressions. The existing text at line 6098 said that even though 'a^*' parses, it still produces undefined results; and our addition of <dollar-sign> means that '$*z' is now in the same category. But line 6382 claims that the ERE grammar does not even accept the use of ERE_dupl_symbol after '^'; either we change the grammar to make that statement true (in which case, we should also expand the statement to cover '$'), or we need to correct that statement to drop '^' (the grammar as-is forbids ERE_dupl_symbol after '|' or '(', but not after '^'). (Note: 0001311 edited in place) |
(0001313) eblake (manager) 2012-07-19 17:11 edited on: 2012-07-26 15:20 |
In response to Note: 0001310 (item 3), I think we need the phrase 'an unescaped <backslash>', since '\\a' is a well-defined regex. In that case, line 5875 of BRE Ordinary Characters needs a similar change. (note was edited in place to take this into account) |
(0001314) eblake (manager) 2012-07-26 15:19 edited on: 2012-07-26 15:44 |
In response to Note: 0001309 (item 2), we also need to fix the grammar at line 6232, to distinguish between when ^ is special (as an anchor) vs. an alternate interpretation but not special (leading a bracket expression). (Note: 0001309 edited in place) |
(0001315) nick (manager) 2012-07-26 15:49 edited on: 2012-08-02 15:39 |
In response to item 5, On p. 188, add after line 6092: "A <left-square-bracket> that is unescaped and is not part of a bracket expression also produces undefined results." Also at p 183 line 5887 change "that is not preceded by a <backslash>" to "that is unescaped" |
Issue History | |||
Date Modified | Username | Field | Change |
2012-04-10 20:04 | dubiousjim | New Issue | |
2012-04-10 20:04 | dubiousjim | Status | New => Under Review |
2012-04-10 20:04 | dubiousjim | Assigned To | => ajosey |
2012-04-10 20:04 | dubiousjim | Name | => Jim Pryor |
2012-04-10 20:04 | dubiousjim | Section | => XBD Chapter 9 |
2012-04-10 20:04 | dubiousjim | Page Number | => 181-191 |
2012-04-10 20:04 | dubiousjim | Line Number | => see below |
2012-04-10 20:18 | dubiousjim | Note Added: 0001194 | |
2012-04-12 15:35 | msbrown | Project | 2008-TC1 => 1003.1(2008)/Issue 7 |
2012-04-12 15:36 | nick | Note Added: 0001197 | |
2012-04-12 15:56 | dubiousjim | Note Added: 0001199 | |
2012-07-12 16:24 | msbrown | Interp Status | => --- |
2012-07-12 16:24 | msbrown | Description Updated | |
2012-07-12 16:24 | msbrown | Desired Action Updated | |
2012-07-12 16:25 | msbrown | Note Added: 0001305 | |
2012-07-12 16:26 | msbrown | Description Updated | |
2012-07-19 15:20 | nick | Note Added: 0001308 | |
2012-07-19 15:21 | nick | Note Edited: 0001308 | |
2012-07-19 15:29 | nick | Note Added: 0001309 | |
2012-07-19 15:33 | nick | Note Edited: 0001308 | |
2012-07-19 15:34 | nick | Note Edited: 0001309 | |
2012-07-19 15:44 | nick | Note Edited: 0001309 | |
2012-07-19 15:46 | nick | Note Edited: 0001309 | |
2012-07-19 15:53 | nick | Note Edited: 0001309 | |
2012-07-19 15:59 | nick | Note Added: 0001310 | |
2012-07-19 16:13 | nick | Note Added: 0001311 | |
2012-07-19 16:18 | nick | Note Edited: 0001311 | |
2012-07-19 16:22 | nick | Note Edited: 0001311 | |
2012-07-19 16:35 | eblake | Note Added: 0001312 | |
2012-07-19 17:11 | eblake | Note Added: 0001313 | |
2012-07-19 17:11 | eblake | Note Edited: 0001312 | |
2012-07-19 17:11 | eblake | Note Edited: 0001313 | |
2012-07-20 15:20 | eblake | Relationship added | related to 0000595 |
2012-07-26 15:09 | nick | Note Edited: 0001310 | |
2012-07-26 15:19 | eblake | Note Added: 0001314 | |
2012-07-26 15:19 | nick | Note Edited: 0001310 | |
2012-07-26 15:20 | nick | Note Edited: 0001313 | |
2012-07-26 15:20 | eblake | Note Edited: 0001314 | |
2012-07-26 15:37 | nick | Note Edited: 0001310 | |
2012-07-26 15:40 | nick | Note Edited: 0001311 | |
2012-07-26 15:41 | nick | Note Edited: 0001312 | |
2012-07-26 15:44 | nick | Note Edited: 0001309 | |
2012-07-26 15:44 | nick | Note Edited: 0001314 | |
2012-07-26 15:49 | nick | Note Added: 0001315 | |
2012-07-26 15:51 | nick | Note Edited: 0001315 | |
2012-07-26 15:53 | nick | Final Accepted Text | => See Note: 0001308, Note: 0001309, Note: 0001310, Note: 0001311, Note: 0001315 |
2012-07-26 15:53 | nick | Status | Under Review => Resolved |
2012-07-26 15:53 | nick | Resolution | Open => Accepted As Marked |
2012-07-26 15:54 | nick | Note Edited: 0001315 | |
2012-07-26 15:54 | nick | Tag Attached: tc2-2008 | |
2012-08-02 15:37 | geoffclare | Note Edited: 0001309 | |
2012-08-02 15:39 | geoffclare | Note Edited: 0001315 | |
2019-06-10 08:55 | agadmin | Status | Resolved => Closed |
Mantis 1.1.6[^] Copyright © 2000 - 2008 Mantis Group |