Austin Group Defect Tracker

Aardvark Mark IV


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0000554 [1003.1(2008)/Issue 7] Base Definitions and Headers Editorial Clarification Requested 2012-04-10 20:04 2019-06-10 08:55
Reporter dubiousjim View Status public  
Assigned To ajosey
Priority normal Resolution Accepted As Marked  
Status Closed  
Name Jim Pryor
Organization
User Reference
Section XBD Chapter 9
Page Number 181-191
Line Number see below
Interp Status ---
Final Accepted Text See Note: 0001308, Note: 0001309, Note: 0001310, Note: 0001311, Note: 0001315
Summary 0000554: Five(?) clarifications of the informal regex descriptions
Description [Edited to correct page/line numbers to the 2008 base document -- Mark Brown]

I hope not to be wasting anyone's time with this. I realize this Draft is well advanced and many eyes have already looked at this material; however I've recently noticed some unclarities in the informal summaries of BRE and ERE grammar, that might arguably benefit from being tightened up.

(1) p. 182, lines 5842-5848, of POSIX.1-2008 say:

| In the regular expression processing
| described in POSIX.1-2008, the <newline> is regarded as an ordinary character and both a
| <period> and a non-matching list can match one. The Shell and Utilities volume of
| POSIX.1-2008 specifies within the individual descriptions of those standard utilities employing
| regular expressions whether they permit matching of <newline> characters; if not stated
| otherwise, the use of literal <newline> characters or any escape sequence equivalent produces
| undefined results.

One awkwardness here is that the first sentence can be read as defining the default behavior, but the last clause of the second sentence says the behavior when not explicitly stated is unspecified. Is the intent that the first sentence *not* be stating default behavior, but only what's assumed in examples in the standard? (Are there any examples where this is relied on?) Or is the intent rather that the first sentence is stating default behavior outside of the Shell and Utilities, and the second sentence is meant to be understood as restricted to that context? (In that case, I'd recommend making the restriction explicit in the second clause.)

A second awkwardness here is that the second clause of the second sentence leaves it ambiguous whether it's the use of newlines in *text being matched* which is undefined, or the use of newlines in *patterns*, or both.


(2) p. 184, line 5894, and p. 189, line 6106 of POSIX.1-2008 say that "^" is a special character when it occurs in [^...] position. It's unclear what is the point of designating it as "special" in that position is: "-" and "]" aren't designated as "special" when they occur in bracket constructions (nor are "." or ":" or "="---though the sequences "[." and so on are called "special"). The role of "special" in the grammar seems motivated by two things: (i) sometimes, to indicate the character isn't in that position representing itself; and (ii) to mark those characters the \-escaping of which has a special definition. (i) is not uniformly applied; but perhaps it is what motivates calling the circumflex in "[^...]" "special." If we understand (2) to be the criterion for "specialness", then presumably we shouldn't count *any* characters (or character-sequences) inside brackets as "special". "[\^...]" doesn't escape the "^", it's just a range expression with a non-special occurrence of "\" followed by a non-special occurrence of "^".


(3) p. 188, lines 6081-6083, of POSIX.1-2008 say:

| An ordinary character is an ERE that matches itself. An ordinary character is any character in the
| supported character set, except for the ERE special characters listed in Section 9.4.3. The
| interpretation of an ordinary character preceded by a <backslash> ('\\') is undefined.

This seems to say that the interpretation of "a" in "[\a]" is undefined. Contrast p. 183, lines 5873-5879, which explicitly exclude occurrences of these characters inside brackets:

| An ordinary character is a BRE that matches itself: any character in the supported character set,
| except for the BRE special characters listed in Section 9.3.3.
| The interpretation of an ordinary character preceded by a <backslash> (?\\?) is undefined,
| except for:
...
| -- A character inside a bracket expression


(4) p. 188, lines 6095-6099, of POSIX.1-2008 say:

| *+?{
The <asterisk>, <plus-sign>, <question-mark>, and <left-brace> shall be special except
| when used in a bracket expression (see Section 9.3.5, on page 184). Any of the
following uses produce undefined results:
| -- If these characters appear first in an ERE, or immediately following a <vertical-
| line>, <circumflex>, or <left-parenthesis>

This seems to say that "*" is not special in the pattern "a^*"; and indeed, in EREs (as opposed to BREs), "^" is anchoring even when it occurs in the middle of the pattern. So we understand "a^*" to be a valid pattern of a literal "a", an anchor, and a literal "*"---a valid pattern but one that will never match any text. It's curious that nothing is said about the pattern "$*z". If you want to require implementations to accept "a^*" as valid, instead of an error, why wouldn't you want to do so for "$*z" too? Or if you instead want to permit implementations latitude to reject the latter as an error, why wouldn't you want to do so for the former too?

(5) p. 183, lines 5886-5888, of POSIX.1-2008 say:

| An expression containing
| a '[' that is not preceded by a <backslash> and is not part of a bracket expression
| produces undefined results.

This is in the description of "BRE Special Characters". There is no analogous statement for an unescaped, unmatched '[' in EREs. The implementations I checked (FreeBSD egrep, Gnu egrep, BusyBox egrep) all rejected the pattern "[" as an error.
Desired Action [Edited to correct page/line numbers to the 2008 base document -- Mark Brown]

(1) I don't understand what is intended by seeming to specify the default interpretation, and then saying the default is undefined, so I can't offer a recommendation there.
    As to the issue of whether it was newlines in patterns, or newlines in matched text that was meant, I recommend just adding one of the phrases: "in patterns", "in matched text", or "in either patterns or matched text" to the ___ indicated:

| if not stated
| otherwise, the use of literal <newline> characters or any escape sequence equivalent ___ produces
| undefined results. (p. 182, lines 5846-5850)


(2) Decide whether the criterion for being a "special" occurrence of a character is (a) not representing yourself, or (b) the way you interact with "\". If (b), then the circumflex in "[^...]" is not "special"; and neither are sequences like "[:". If (a), then various other characters also merit being called "special" (at least "-" and "]" when they occur after "["); but this will take some more careful review of the Chapter and is arguably not worth the time and effort.

(3) Change p. 188, lines 6082-6083 from:

| The
| interpretation of an ordinary character preceded by a <backslash> ('\\') is undefined.

to:

| The
| interpretation of an ordinary character preceded by a <backslash> ('\\'), outside of a bracket expression, is undefined.


(4) On p. 188, change lines 6098-6099 from:
| -- If these characters appear first in an ERE, or immediately following a <vertical-
| line>, <circumflex>, or <left-parenthesis>

to:

| -- If these characters appear first in an ERE, or immediately following a <vertical-
| line>, <circumflex>, <dollar>, or <left-parenthesis>

This counts quantifier characters following anchors as everywhere valid (but interpreted literally); to count them as everywhere an error, instead omit the items "<circumflex>, <dollar>".


(5) On p. 188, add after line 6092:

"A <left-square-bracket> that is not preceded by a <backslash> and is not part of a bracket expression also produces undefined results."
Tags tc2-2008
Attached Files

- Relationships
related to 0000595Closedajosey regular expressions do not operate on lines 

-  Notes
(0001194)
dubiousjim (reporter)
2012-04-10 20:18

Some further observations:

Re item (2), note that in the formal grammars, no characters inside bracket constructions are counted as part of SPEC_CHAR. "^", "-", and "]" after a not-yet-closed "[" are instead all counted as META_CHAR.

Re item (4), the "<vertical-line>", "<left-parenthesis>" and so on should be prefixed with "unescaped".
(0001197)
nick (manager)
2012-04-12 15:36

Need to update the page and line numbers to match base document, rather than the D4 merged document.
(0001199)
dubiousjim (reporter)
2012-04-12 15:56

I don't think I have access to a version of the base document with line numbers. I've registered at the OpenGroup site, with the Austin site (and was able to get the D4 merged document there), and at this BugTracker. But when I attempt to download the PDF from <https://www2.opengroup.org/ogsys/jsp/publications/PublicationDetails.jsp?catalogno=c082>, [^] I'm told "You do not have the correct privileges to be able to view this item. You can try logging in as a different user using the links on the left."

Will delete this note when resolved, not sure where else to respond.
(0001305)
msbrown (manager)
2012-07-12 16:25

Edited to correct page/line numbers in Description and Desired Actionto the 2008 base document.
(0001308)
nick (manager)
2012-07-19 15:20
edited on: 2012-07-19 15:33

In response to item 1:


Change p. 182, lines 5842-5848, of POSIX.1-2008 from:

In the regular expression processing
described in POSIX.1-2008, the <newline> is regarded as an ordinary character and both a
<period> and a non-matching list can match one. The Shell and Utilities volume of
POSIX.1-2008 specifies within the individual descriptions of those standard utilities employing
regular expressions whether they permit matching of <newline> characters; if not stated
otherwise, the use of literal <newline> characters or any escape sequence equivalent produces
undefined results.

to

In the functions processing regular expressions
described in the System Interfaces volume of
POSIX.1-2008, the <newline> is regarded as an ordinary character and both a
<period> and a non-matching list can match one. The Shell and Utilities volume of
POSIX.1-2008 specifies within the individual descriptions of those standard utilities employing
regular expressions whether they permit matching of <newline> characters; if not stated
otherwise, the use of literal <newline> characters or any escape sequence equivalent in either patterns or matched text produces undefined results .

(0001309)
nick (manager)
2012-07-19 15:29
edited on: 2012-08-02 15:37

In response to item 2:

At both
p. 184, line 5894-5896 AND
p. 189, line 6106-6109 of POSIX.1-2008

change

The <circumflex> shall be special when used as:
- An anchor (see Section XXX on page YYY)
- The first character of a bracket expression (see Section AAA on page BBB)

to

The <circumflex> shall be special when used as an anchor (see Section XXX, on page YYY). The <circumflex> shall signify a non matching list expression when it occurs first in a list, immediately following a <left-square-bracket> (see Section AAA on page BBB).

At p 192 line 6233 delete the phrase "or when first in a bracket expression"

(0001310)
nick (manager)
2012-07-19 15:59
edited on: 2012-07-26 15:37

In response to item 3:

 Change p. 188, lines 6082-6083 from:

The interpretation of an ordinary character preceded by a <backslash> ('\\') is undefined.

to:

The interpretation of an ordinary character preceded by an unescaped <backslash> ('\\') is undefined, except in the context of a bracket expression (see XREF to ERE bracket expression).

Also for BRE (page 183 line 5875), change:


The interpretation of an ordinary character preceded by a <backslash> (’\\’) is undefined, except for:


to


The interpretation of an ordinary character preceded by an unescaped <backslash> (’\\’) is undefined, except for:

Also, at page 195, line 6381, change:


ORD_CHAR preceded by a <backslash> character

to


ORD_CHAR preceded by an unescaped <backslash> character

(0001311)
nick (manager)
2012-07-19 16:13
edited on: 2012-07-26 15:40

In response to item 4:

On p. 188, change lines 6098-6099 from:
 -- If these characters appear first in an ERE, or immediately following a <vertical-
 line>, <circumflex>, or <left-parenthesis>

to:

 -- If these characters appear first in an ERE, or immediately following an unescaped <vertical-
 line>, <circumflex>, <dollar-sign>, or <left-parenthesis>


Also at page 195, line 6379-6380, change


The ERE grammar does not permit several constructs that previous sections specify as having
undefined results:


to


The ERE grammar does not permit several constructs that previous sections specify as having
undefined results. Additionally, there are some constructs which the grammar permits but which still give undefined results:

At p 195 line 6383 add '$' to the list of chars.

(0001312)
eblake (manager)
2012-07-19 16:35
edited on: 2012-07-26 15:41

In response to Note: 0001311 (item 4), we also have to worry about the grammar. Line 6361 states:

ERE_expression : : one_char_or_coll_elem_ERE
| ’ˆ’
| ’$’
| ’(’ extended_reg_exp ’)’
| ERE_expression ERE_dupl_symbol
;

which means the grammar requires the parser to accept both 'a^*' and '$*z' as valid expressions. The existing text at line 6098 said that even though 'a^*' parses, it still produces undefined results; and our addition of <dollar-sign> means that '$*z' is now in the same category. But line 6382 claims that the ERE grammar does not even accept the use of ERE_dupl_symbol after '^'; either we change the grammar to make that statement true (in which case, we should also expand the statement to cover '$'), or we need to correct that statement to drop '^' (the grammar as-is forbids ERE_dupl_symbol after '|' or '(', but not after '^').


(Note: 0001311 edited in place)

(0001313)
eblake (manager)
2012-07-19 17:11
edited on: 2012-07-26 15:20

In response to Note: 0001310 (item 3), I think we need the phrase 'an unescaped <backslash>', since '\\a' is a well-defined regex. In that case, line 5875 of BRE Ordinary Characters needs a similar change.

(note was edited in place to take this into account)

(0001314)
eblake (manager)
2012-07-26 15:19
edited on: 2012-07-26 15:44

In response to Note: 0001309 (item 2), we also need to fix the grammar at line 6232, to distinguish between when ^ is special (as an anchor) vs. an alternate interpretation but not special (leading a bracket expression).

(Note: 0001309 edited in place)

(0001315)
nick (manager)
2012-07-26 15:49
edited on: 2012-08-02 15:39

In response to item 5,
On p. 188, add after line 6092:

"A <left-square-bracket> that is unescaped and is not part of a bracket expression also produces undefined results."

Also at p 183 line 5887 change "that is not preceded by a <backslash>" to "that is unescaped"


- Issue History
Date Modified Username Field Change
2012-04-10 20:04 dubiousjim New Issue
2012-04-10 20:04 dubiousjim Status New => Under Review
2012-04-10 20:04 dubiousjim Assigned To => ajosey
2012-04-10 20:04 dubiousjim Name => Jim Pryor
2012-04-10 20:04 dubiousjim Section => XBD Chapter 9
2012-04-10 20:04 dubiousjim Page Number => 181-191
2012-04-10 20:04 dubiousjim Line Number => see below
2012-04-10 20:18 dubiousjim Note Added: 0001194
2012-04-12 15:35 msbrown Project 2008-TC1 => 1003.1(2008)/Issue 7
2012-04-12 15:36 nick Note Added: 0001197
2012-04-12 15:56 dubiousjim Note Added: 0001199
2012-07-12 16:24 msbrown Interp Status => ---
2012-07-12 16:24 msbrown Description Updated
2012-07-12 16:24 msbrown Desired Action Updated
2012-07-12 16:25 msbrown Note Added: 0001305
2012-07-12 16:26 msbrown Description Updated
2012-07-19 15:20 nick Note Added: 0001308
2012-07-19 15:21 nick Note Edited: 0001308
2012-07-19 15:29 nick Note Added: 0001309
2012-07-19 15:33 nick Note Edited: 0001308
2012-07-19 15:34 nick Note Edited: 0001309
2012-07-19 15:44 nick Note Edited: 0001309
2012-07-19 15:46 nick Note Edited: 0001309
2012-07-19 15:53 nick Note Edited: 0001309
2012-07-19 15:59 nick Note Added: 0001310
2012-07-19 16:13 nick Note Added: 0001311
2012-07-19 16:18 nick Note Edited: 0001311
2012-07-19 16:22 nick Note Edited: 0001311
2012-07-19 16:35 eblake Note Added: 0001312
2012-07-19 17:11 eblake Note Added: 0001313
2012-07-19 17:11 eblake Note Edited: 0001312
2012-07-19 17:11 eblake Note Edited: 0001313
2012-07-20 15:20 eblake Relationship added related to 0000595
2012-07-26 15:09 nick Note Edited: 0001310
2012-07-26 15:19 eblake Note Added: 0001314
2012-07-26 15:19 nick Note Edited: 0001310
2012-07-26 15:20 nick Note Edited: 0001313
2012-07-26 15:20 eblake Note Edited: 0001314
2012-07-26 15:37 nick Note Edited: 0001310
2012-07-26 15:40 nick Note Edited: 0001311
2012-07-26 15:41 nick Note Edited: 0001312
2012-07-26 15:44 nick Note Edited: 0001309
2012-07-26 15:44 nick Note Edited: 0001314
2012-07-26 15:49 nick Note Added: 0001315
2012-07-26 15:51 nick Note Edited: 0001315
2012-07-26 15:53 nick Final Accepted Text => See Note: 0001308, Note: 0001309, Note: 0001310, Note: 0001311, Note: 0001315
2012-07-26 15:53 nick Status Under Review => Resolved
2012-07-26 15:53 nick Resolution Open => Accepted As Marked
2012-07-26 15:54 nick Note Edited: 0001315
2012-07-26 15:54 nick Tag Attached: tc2-2008
2012-08-02 15:37 geoffclare Note Edited: 0001309
2012-08-02 15:39 geoffclare Note Edited: 0001315
2019-06-10 08:55 agadmin Status Resolved => Closed


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker