Austin Group Defect Tracker

Aardvark Mark IV


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0001546 [1003.1(2016/18)/Issue7+TC2] Base Definitions and Headers Editorial Enhancement Request 2022-01-08 03:48 2024-06-11 09:07
Reporter calestyo View Status public  
Assigned To
Priority normal Resolution Accepted As Marked  
Status Closed  
Name Christoph Anton Mitterer
Organization
User Reference
Section 9.3 Basic Regular Expressions
Page Number N/A
Line Number N/A
Interp Status ---
Final Accepted Text Note: 0005755
Summary 0001546: BREs: reserve \? \+ and \|
Description At least two implementations use \? \+ and \| as special characters, namely GNU's sed and grep as well as busybox' sed and grep.
Thereby being used by countless of systems.

There \? \+ and \| act for BREs as their unquoted (? + and |) counterparts of EREs, bringing the same functionality to BREs.

POSXI, as of now (https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html) [^] says:
The interpretation of an ordinary character preceded by an unescaped <backslash> ( '\\' ) is undefined, except for:
- The characters ')', '(', '{', and '}'
- The digits 1 to 9 inclusive (see BREs Matching Multiple Characters)
- A character inside a bracket expression

so a conforming implementation either didn't support \? \+ and \| at all, or already defined it's own semantics, as e.g. GNU did.
Desired Action I would propose, that POSIX doesn't standardise \? \+ and \| ... but reserve these for that purpose.

I.e. not add the functionality as used by GNU as required for conforming implementations, but mandating that *if* an implementation chooses to use \? \+ and \| in some way - it has to be the counterparts of ERE's ? + and | .


The actual effect of this to the world would be small. Implementation that don’t already support it, wouldn’t need to add support for it.
But it would prevent that any conforming implementation uses these as special characters for other purposes.


GNU (and I guess others) support further such uses beyond POSIX, e.g.:
\b \B \< \> \w \W \s \S (GNU's sed and grep)
and:
\` \' (GNU's sed)

I personally would rather tend not to reserve those in POSIX.
POSIX isn't GNU, and the main reason why I propose the reservation of \? \+ and \| is that these are already special in EREs.
But the others from above aren't.
Tags issue8
Attached Files

- Relationships
has duplicate 0000773Closedajosey 1003.1(2008)/Issue 7 Summary: Add \+, \?, and \| to Basic Regular Expressions (BREs) 

-  Notes
(0005636)
mirabilos (reporter)
2022-01-28 11:21

What’s the use?

If POSIX says they are unspecified, they are already reserved. If POSIX is going to use them, they’re (hopefully) looking at existing implementations.

By the way, the BSDs use [[:<:]] and [[:>:]] for GNU’s \< and \> which fit nicer into the framework. (mksh now supports them on nōn-BSDs as well.)
(0005639)
calestyo (reporter)
2022-01-28 23:10

As I've said above: "But it would prevent that any conforming implementation uses these as special characters for other purposes."

I don't see why unspecified should mean "reserved", neither reserved in the sense "something may not be used at all" nor in the sense "something may be used unless or a certain purpose".


"they’re (hopefully) looking at existing implementations."
One should hope that, but there's absolutely no guarantee that this is done.
(0005731)
Don Cragun (manager)
2022-03-03 06:56
edited on: 2022-03-03 07:03

Re Note: 0005636:
I don't understand what you mean when you say that BSD's use of [[:<:]] and [[:>:]] fit into the framework.

In POSIX "[[:class:]]" is an RE that matches a single character in the character class named "class". And the only valid character class names are 1 to {CHARCLASS_NAME_MAX} bytes from the portable filename character set where the 1st byte is not a digit. Neither '<' nor '>' is in the portable filenamne character set.

(0005738)
geoffclare (manager)
2022-03-10 17:09

Page and line numbers are for Issue 8 draft 2.1

After P165 line 5720, add:

escape sequence
An escape sequence, when not in a bracket expression, is defined as the escape character <backslash> ('\\') followed by any single character.

After P167 line 5806, add:
  • The '?', '+', and '|' characters; it is implemention-defined whether "\?", "\+", and "\|" each match the literal character '?', '+', or '|', respectively, or behave as described for the ERE special characters '?', '+', and '|', respectively (see [xref to 9.4.3]).
(0005740)
calestyo (reporter)
2022-03-10 17:44

Sounds good to me.


Though for the first part (escape sequence) I would more clearly mention that "escape character <backslash>" means that it's a backslash that is not escaped itself.

While this may already be perfectly clear to anyone who is very deep into the language of the standard, it may not be that obvious to others...
(0005744)
calestyo (reporter)
2022-03-12 21:12
edited on: 2022-03-12 21:13

I've just noted David Wheeler's request #773.

Since this one here seems to have been accepted (?) in the sense of Geoff's wording above... it might be worth to at least *consider* whether one should go further, and simply implement 773?

- The above change, would already break compatibility if anyone used "\?", "\+", and "\|" for any purpose other than those special meanings or literal value.

- Previously, "\?", "\+", and "\|" would have been undefined behavior.

Is it known, whether any implementations ever used "\?", "\+", and "\|" in the sense of literal meaning (or at least allowed that)?
Cause if there were none, one could also consider whether one goes straight down the road, and define it only to have the special meaning.

(0005748)
geoffclare (manager)
2022-03-14 10:08

Reopening because we forgot that the BRE grammar needs updating (David Wheeler proposed a change to the grammar in bug 0000773).

Re Note: 0005744 There are implementations where "\+", "\?" and "\|" match a literal '+', '?', or '|' (I tested Solaris and HP-UX; I would expect AIX is the same).

The people on the March 10 teleconference were not aware of any implementation which behaves differently than the two behaviours we decided to allow.
(0005749)
calestyo (reporter)
2022-03-14 13:50

And I assume these refer to reasonably current versions of such implementations? I mean it would probably make not much sense to standardise both ways, if one of them was only used 30 years ago.
(0005750)
geoffclare (manager)
2022-03-14 14:31

The Solaris I tested is the latest (11.4). HP-UX a little on the old side but I believe it is very unlikely to have had any changes to fundamentals like REs since the version I have.
(0005755)
geoffclare (manager)
2022-03-18 09:32
edited on: 2022-04-07 15:08

New proposed resolution that adds the grammar changes and changes to the anchoring description in 9.3.8. I also took the opportunity to improve consistency between the BRE grammar and ERE grammar.

Page and line numbers are for Issue 8 draft 2.1

After page 165 line 5720 section 9.1, add:

escape sequence
An escape sequence is defined as the escape character followed by any single character, which is thereby ``escaped''. The escape character is a <backslash> that is neither in a bracket expression nor itself escaped.

After page 167 line 5806 section 9.3.2, add:
  • The '?', '+', and '|' characters; it is implementation-defined whether "\?", "\+", and "\|" each match the literal character '?', '+', or '|', respectively, or behave as described for the ERE special characters '?', '+', and '|', respectively (see [xref to 9.4.3]).

    <small>Note: A future version of this standard may require "\?", "\+", and "\|" to behave as described for the ERE special characters '?', '+', and '|', respectively.</small>

After page 168 line 5818 section 9.3.3, add a new list item:
Immediately following a "\|" escape sequence (after an initial '^', if any), if the implementation does not match the escape sequence "\|" to the literal character '|'.

On page 172 line 6015 section 9.3.8, change:
A <circumflex> ('^') shall be an anchor when used as the first character of an entire BRE. The implementation may treat the <circumflex> as an anchor when used as the first character of a subexpression.
to:
A <circumflex> ('^') shall be an anchor when used as the first character of an entire BRE and, if the implementation does not match the escape sequence "\|" to the literal character '|', when used immediately following a "\|" escape sequence that is not inside a subexpression. The implementation may also treat a <circumflex> as an anchor when used inside a subexpression; in this case it shall be an anchor only when either of the following is true:

  • It is the first character of the subexpression.

  • It immediately follows a "\|" escape sequence and the implementation does not match the escape sequence "\|" to the literal character '|'.

On page 172 line 6023 section 9.3.8, change:
A <dollar-sign> ('$') shall be an anchor when used as the last character of an entire BRE. The implementation may treat a <dollar-sign> as an anchor when used as the last character of a subexpression.
to:
A <dollar-sign> ('$') shall be an anchor when used as the last character of an entire BRE and, if the implementation does not match the escape sequence "\|" to the literal character '|', when used immediately preceding a "\|" escape sequence that is not inside a subexpression. The implementation may also treat a <dollar-sign> as an anchor when used inside a subexpression; in this case it shall be an anchor only when either of the following is true:

  • It is the last character of the subexpression.

  • It immediately precedes a "\|" escape sequence and the implementation does not match the escape sequence "\|" to the literal character '|'.

On page 176 line 6191 section 9.5.1, change:
The character '^' when it appears as the first character of a basic regular expression and when not QUOTED_CHAR.
to:
The character '^' when it appears either as the first character of a basic regular expression or, if the implementation does not match the escape sequence "\|" to the literal character '|', when used immediately following a "\|" escape sequence that is not inside a subexpression, and when not QUOTED_CHAR.

After page 176 line 6197 section 9.5.1, add:
On implementations where the escape sequences "\?", "\+", and "\|" match the literal characters '?', '+', and '|', respectively, QUOTED_CHAR shall also include:
\?    \+    \|

On page 177 line 6201 section 9.5.1, change:
The character '$' when it appears as the last character of a basic regular expression and when not QUOTED_CHAR.
to:
The character '$' when it appears either as the last character of a basic regular expression or, if the implementation does not match the escape sequence "\|" to the literal character '|', when used immediately preceding a "\|" escape sequence that is not inside a subexpression, and when not QUOTED_CHAR.

After page 177 line 6228 section 9.5.2, add:
/* The following shall be tokens on implementations where
   \?, \+, and \| are not included in QUOTED_CHAR */

%token Back_qm Back_plus Back_bar
/*       \?      \+        \|     */

On page 178 line 6244 section 9.5.2, change:
basic_reg_exp  :          RE_expression
               | L_ANCHOR
               |                        R_ANCHOR
               | L_ANCHOR               R_ANCHOR
               | L_ANCHOR RE_expression
               |          RE_expression R_ANCHOR
               | L_ANCHOR RE_expression R_ANCHOR
               ;
RE_expression  :               simple_RE
               | RE_expression simple_RE
               ;
simple_RE      : nondupl_RE
               | nondupl_RE RE_dupl_symbol
               ;
nondupl_RE     : one_char_or_coll_elem_RE
               | Back_open_paren RE_expression Back_close_paren
               | BACKREF
               ;
one_char_or_coll_elem_RE : ORD_CHAR
               | QUOTED_CHAR
               | '.'
               | bracket_expression
               ;
RE_dupl_symbol : '*'
               | Back_open_brace DUP_COUNT               Back_close_brace
               | Back_open_brace DUP_COUNT ','           Back_close_brace
               | Back_open_brace DUP_COUNT ',' DUP_COUNT Back_close_brace
               ;
to:
basic_reg_exp   :                        BRE_branch
                | basic_reg_exp Back_bar BRE_branch /* if Back_bar
                                                       is a token */
                ;
BRE_branch      :            BRE_expression
                | BRE_branch BRE_expression 
                ;
BRE_expression  :          simple_BRE
                | L_ANCHOR
                |                     R_ANCHOR
                | L_ANCHOR            R_ANCHOR
                | L_ANCHOR simple_BRE
                |          simple_BRE R_ANCHOR
                | L_ANCHOR simple_BRE R_ANCHOR
                ;
simple_BRE      : nondupl_BRE
                | nondupl_BRE BRE_dupl_symbol
                ;
nondupl_BRE     : one_char_or_coll_elem_BRE
                | Back_open_paren basic_reg_exp Back_close_paren
                | BACKREF
                ;
one_char_or_coll_elem_BRE  : ORD_CHAR
                | QUOTED_CHAR
                | '.'
                | bracket_expression
                ;
BRE_dupl_symbol : '*'
                | Back_qm   /* if Back_qm is a token */
                | Back_plus /* if Back_plus is a token */
                | Back_open_brace DUP_COUNT               Back_close_brace
                | Back_open_brace DUP_COUNT ','           Back_close_brace
                | Back_open_brace DUP_COUNT ',' DUP_COUNT Back_close_brace
                ;

On page 179 line 6313 section 9.5.2, change:
The BRE grammar does not permit L_ANCHOR or R_ANCHOR inside "\(" and "\)" (which implies that '^' and '$' are ordinary characters). This reflects the semantic limits on the application, as noted in [xref to 9.3.8]. Implementations are permitted to extend the language to interpret '^' and '$' as anchors in these locations, and as such, conforming applications cannot use unescaped '^' and '$' in positions inside "\(" and "\)" that might be interpreted as anchors.
to:
Note that although the BRE grammar appears always to permit L_ANCHOR or R_ANCHOR inside "\(" and "\)", the lexical conventions (see [xref to 9.5.1]) imply that '^' and '$' may be ordinary characters there. This reflects the semantic limits on the application, as noted in [xref to 9.3.8]. Since it is an implementation option whether to interpret '^' and '$' as anchors in these locations, conforming applications cannot use unescaped '^' and '$' in positions inside "\(" and "\)" that might be interpreted as anchors.


(0005765)
calestyo (reporter)
2022-03-25 22:41

off-topic: Did Austin Group ever consider to handle such changes via git? It would make following/reviewing changes *so* much easier, than having to build up the final picture by fiddling around with pages, line numbers and what's ought to be changed to what.


Regarding https://austingroupbugs.net/view.php?id=1546#c5755 [^]

Some comments from my side:
1) I'm still a bit unhappy with the definition of "escape sequence".
It does not directly include the information that the escape character \ itself must not be quoted by another '\'.
Even if you would have 2.2.1 Escape Character (Backslash) in mind:
- this seems more or less defined for chapter 2 - not generally
- and even if one assumes it generally, than 2.2.1 would say "A <backslash> that is not quoted" - where quoting however means *any* shell quotation style (\… '…' "…" and in the future $'…') - however for REs, only the \… style would apply.


2) Perhaps it would also be better to write instead of:
"An escape sequence, when not in a bracket expression, is defined as the escape character <backslash> ('\\') followed by any single character."

something like the following:

"An escape sequence is defined as the escape character <backslash> ('\\') followed by any single character, not within a bracket expression."

The problem with "An escape sequence, when not in a bracket expression" is IMO that this makes no sense, as there are never escape sequences within a bracket expression - except perhaps for the open question (I've raised in some other ticket) of whether an an escaped sed delimiter (like in 's.a[\.]b.X.') inside a bracket expression is the undelimitered special character (special '.' in the example), the undelimitered literal character (literal '.' in the example) or the literal two characters ('\' and '.' in the example).


Apart from that,... the draft in 5755 looks good, though I haven't checked the grammar in detail.
(0005766)
calestyo (reporter)
2022-03-25 22:57

May I also add the following:

a) Not sure whether https://austingroupbugs.net/view.php?id=773 [^] is really to be considered a duplicate of this.

This issue was merely about preventing "\?", "\+", and "\|" to get any completely other meaning than that from ERE.

While #773 rather asked for defining them to ONLY have the special meaning (not either special or literal - as the above draft resolves it).




b) While the draft in 5755 prevents (a) from happening (e.g. \+ couldn't suddenly mean something completely different) I'm still a bit sceptical on whether that's the best way to go.

It merely records the current situation of implementations - but with that also statically codifies the two different interpretations (literal ? + \ vs. special).

So we'd get non-portable behaviour even deeper into the standard than it was before, where from the standard's PoV any use of "\?", "\+", and "\|" was simply undefined.


I realise that it would be problematic to follow the spirit of #773 and define "\?", "\+", and "\|" to have ONLY the special meaning, when there are implementations (you've mentioned Solaris, HP-UX, AIX) who behave already different.

OTOH, if an implementation assigns it's own semantics to undefined parts of a standard, it must always assume that sooner or later it could get compatibility issues.

The question arises: Do Solaris/HP-UX/AIX really give "\?", "\+", and "\|" the literal meaning on purpose (or is it perhaps just an undocumented side-effect of the implementation),... is it even used there,... would they be willing to adapt?

Depending on the answers, it could make sense to follow a different path.


And even if that doesn't seem feasible right now, one could perhaps add something to "future directions", indicating that a future standard might require "\?", "\+", and "\|" to have ONLY the special meaning.

This would encourage any new implementations to follow that. And as long as any implementations exist (possibly even forever),... the "future direction" would never need to become true, so no one should be sad.
(0005768)
geoffclare (manager)
2022-03-31 15:08

In the March 31st teleconference, the resolution in Note: 0005755 was updated to change the "escape sequence" definition and add the future directions note.
(0005770)
calestyo (reporter)
2022-04-01 22:22

"An escape sequence is defined as the escape character <backslash> ('\\'), when neither in a bracket expression nor itself escaped, followed by any single character."

Would do the job, I guess, but reads a bit unusual...?
The escape character <backslash> mustn't be in a bracket expression nor itself escaped... but it's rather the whole escape sequence that mustn't be in a BE?

What about:
"An escape sequence is defined as the escape character <backslash> ('\\') followed by any single character, when neither in a bracket expression nor when the escape character itself is escaped."


Other than that... seems good do me.

I like the future directions note. It gives POSIX the freedom to keep everything as it is (should that be necessary) or to unify the behaviour (should that become possible).
(0005773)
kre (reporter)
2022-04-02 08:17
edited on: 2022-04-02 09:52

Re the proposed text in Note: 0005770

No, absolutely not, that's worse than what is currently proposed, though
I agree that could do with a minor improvement.

You seem to be treating <backshash> and <escape character> as if they're
the same thing (because they are both written as a '\' perhaps).

They aren't. Escape chars are escape chars, and backlashes are backslashes,
they are not the same thing at all.

The common way to define this kind of thing is with a pair of mutually
referential definitions, something like

    An escape sequence is the escape character, followed by any single character.
    The escape character is a backslash which is not the second character of
    an escape sequence, and is not within a bracket expression.

Next, it is not that an escape sequence *mustn't* be in a bracket expression,
but that it cannot be, which is because no excape character can occur in a
bracket expression, and an escape sequence requires a leading escape character.
Nothing needs be said about that at all, this can be deduced from the definition.

(0005776)
calestyo (reporter)
2022-04-02 16:06

Well, the difference between escape character and backslash is clear to me,... that's why I've asked for the further clarification (in Note 5765) which Geoff made then.

But I agree, that both, Geoff’s original proposal and my correction, still do it wrong and use the wording escape sequence/character for a place where it's already clear that it's actually not an escape sequence/character.


I like your proposed text in Note 5773.

It's "recursive" wording also solves the issue of e.g. " \\\n", where the 3rd \ would be preceded by another backslash (the 2nd),... but since that in turn is preceded by yet another backslash (the 1st, which has a preceding " ")... the 3rd \ is actually a escape character.


Geoff, if you agree, then please make a new post with your proposal from 5755 and with Robert's improvement.
(Updating the existing post makes it IMO much harder to understand the history and to review).

I'd do it myself... but I have no idea how to make the markup. O:-)
(0005779)
geoffclare (manager)
2022-04-04 08:46

Re: Note: 0005776 The wording in note 5755 was agreed in a teleconference, so I can't change it without getting agreement in another teleconference. Personally I don't see that there is a problem with the current wording. To me, saying the escape character is backslash just means that the character that is used for escaping is backslash. In situations where backslash loses its special meaning, it doesn't stop being the character that is used for escaping - it just can't perform that function in that situation.

When a resolution is very long we prefer to make small changes by editing in place rather than creating another bug note that is otherwise identical. Your point about understanding the history is the reason for adding a separate note (5768) explaining what was changed.
(0005781)
calestyo (reporter)
2022-04-05 01:48

@Geoff:

I think the main problem with the wording in (the current) note #5755:
"An escape sequence is defined as the escape character <backslash> ('\\'), when neither in a bracket expression nor itself escaped, followed by any single character."

is that it applies the "neither in a bracket expression" only to the escape character (the "followed by any single character" comes after that subclause).

And that simply feels a bit strange, like ruling out that '[\]n' is a escape sequence.
I mean sure it isn't - but what we really want to rule out is, that '[\n]' isn't one.

Robert's wording seems to solve that in a nice manner.
(0005783)
geoffclare (manager)
2022-04-07 15:09

In the April 7th teleconference, the resolution in Note: 0005755 was updated to change the "escape sequence" definition (again).
(0005788)
calestyo (reporter)
2022-04-07 23:01

AFAICS, you just changed the definition of escape sequence... and there it seems you took the proposal from the mailing list with the addition of ", which is thereby ``escaped''." (which fits nicely, IMO), right?

From my side this issue would be fixed now,... let's set a timer for 10 years to re-visit the question whether the literal meaning choice can be skipped O;-)

Thanks for your efforts in this matter! :-)

- Issue History
Date Modified Username Field Change
2022-01-08 03:48 calestyo New Issue
2022-01-08 03:48 calestyo Name => Christoph Anton Mitterer
2022-01-08 03:48 calestyo Section => 9.3 Basic Regular Expressions
2022-01-08 03:48 calestyo Page Number => N/A
2022-01-08 03:48 calestyo Line Number => N/A
2022-01-28 11:21 mirabilos Note Added: 0005636
2022-01-28 23:10 calestyo Note Added: 0005639
2022-03-03 06:56 Don Cragun Note Added: 0005731
2022-03-03 07:03 Don Cragun Note Edited: 0005731
2022-03-10 17:09 geoffclare Note Added: 0005738
2022-03-10 17:09 geoffclare Interp Status => ---
2022-03-10 17:09 geoffclare Final Accepted Text => Note: 0005738
2022-03-10 17:09 geoffclare Status New => Resolved
2022-03-10 17:09 geoffclare Resolution Open => Accepted As Marked
2022-03-10 17:10 geoffclare Tag Attached: issue8
2022-03-10 17:44 calestyo Note Added: 0005740
2022-03-12 21:01 Don Cragun Relationship added related to 0000773
2022-03-12 21:12 calestyo Note Added: 0005744
2022-03-12 21:13 calestyo Note Edited: 0005744
2022-03-14 10:08 geoffclare Note Added: 0005748
2022-03-14 10:08 geoffclare Status Resolved => Under Review
2022-03-14 10:08 geoffclare Resolution Accepted As Marked => Reopened
2022-03-14 13:50 calestyo Note Added: 0005749
2022-03-14 14:31 geoffclare Note Added: 0005750
2022-03-18 09:32 geoffclare Note Added: 0005755
2022-03-24 15:29 geoffclare Final Accepted Text Note: 0005738 => Note: 0005755
2022-03-24 15:29 geoffclare Status Under Review => Resolved
2022-03-24 15:29 geoffclare Resolution Reopened => Accepted As Marked
2022-03-24 15:34 geoffclare Relationship replaced has duplicate 0000773
2022-03-25 22:41 calestyo Note Added: 0005765
2022-03-25 22:57 calestyo Note Added: 0005766
2022-03-31 15:06 geoffclare Note Edited: 0005755
2022-03-31 15:07 geoffclare Note Edited: 0005755
2022-03-31 15:08 geoffclare Note Added: 0005768
2022-04-01 22:22 calestyo Note Added: 0005770
2022-04-02 08:17 kre Note Added: 0005773
2022-04-02 08:19 kre Note Edited: 0005773
2022-04-02 08:22 kre Note Edited: 0005773
2022-04-02 08:23 kre Note Edited: 0005773
2022-04-02 09:52 kre Note Edited: 0005773
2022-04-02 16:06 calestyo Note Added: 0005776
2022-04-04 08:46 geoffclare Note Added: 0005779
2022-04-05 01:48 calestyo Note Added: 0005781
2022-04-07 15:08 geoffclare Note Edited: 0005755
2022-04-07 15:09 geoffclare Note Added: 0005783
2022-04-07 23:01 calestyo Note Added: 0005788
2022-05-19 09:15 geoffclare Status Resolved => Applied
2024-06-11 09:07 agadmin Status Applied => Closed


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker