Austin Group Defect Tracker

Aardvark Mark IV


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0001551 [Issue 8 drafts] Shell and Utilities Objection Clarification Requested 2022-01-14 05:39 2022-01-18 21:18
Reporter calestyo View Status public  
Assigned To
Priority normal Resolution Open  
Status New   Product Version Draft 2.1
Name Christoph Anton Mitterer
Organization
User Reference
Section Utilities, sed
Page Number 3132, ff. (in the draft)
Line Number see below
Final Accepted Text
Summary 0001551: sed: ambiguities in the how BREs/EREs are parsed/interpreted between delimiters (especially when these are special characters)
Description Hey.

First of all, I've asked/reported all his already at the mailing list:
"sed and delimiters that are also special characters to REs"
https://collaboration.opengroup.org/austin/plato/protected/mailarch.php?soph=N&action=show&archive=austin-group-l&num=33587&limit=100&offset=0&sid= [^]
(unfortunately there seems to be no thread-view)

So far, no one could really answer the core questions (or if I just didn't understand, than my apologies for writing again here).


I was looking into using BREs/EREs within delimiters, which as far as POSIX is concerned should be only sed, and in:
- context addresses (e.g. /RE/ or \xREx with x being another delimiter, of which the 1st needs to be quoted if not / )
- s-command


(I made another ticket (https://www.austingroupbugs.net/view.php?id=1550 [^] ) with respect to clarifications/ambiguities about context addresses and delimiters, which may be a bit related.)


This ticket covers presumed ambiguities in:
When BREs/EREs are used within delimiters...
AND
... the delimiter is a special character (or a character that would be special if quoted with a \).
(in the above 3 cases, though in my examples I use only the s-command).



As far as I can see, the documentation says with respect to the delimiters and their literal use in REs (or replacements) only:

[1] »If the character designated by c appears following a <backslash>, then it shall be considered to be that literal character, which shall not terminate the RE. For example, in the context address "\xabc\xdefx", the second x stands for itself, so that the RE is "abcxdef".«
(line 106088 et seq., in the draft)

[ii] »Within the RE and the replacement, the RE delimiter itself can be used
as a literal character if it is preceded by a <backslash>.«
(line 106204 et seq., in the draft)

[iii] for the s-command:
»Any character other than <backslash> or <newline> can be used instead of a <slash> to delimit the RE and the replacement. Within the RE and the replacement, the RE delimiter itself can be used as a literal character if it is preceded by a <backslash>.«
(line 106202 et seq., in the draft)



IMO, that leaves open a number of questions and ambiguities:



1) How are strings/commands which delimiters actually parsed (or split up)?

Consider the following example:
s(\\((X(


There are IMO at least two ways to parse that:

a) two stages
- 1st: splitting up into RE an replacement parts first by going through the string and looking for any delimiter which is not immediately preceded by \ which is here the 3rd ( .
- 2nd: taking the two parts (RE and replacement) and unquote any quoted delimiter \(
  RE-part = \\(
  replacement-part = X
  unquoted:
  RE-part = \( (here the \( became a ( with respect to the RE)
  replacement-part = X

now parse the RE \( as usual... assuming a BRE we'd end up with \( as the sequence starting a sub-pattern.

So effectively here we'd get:
s/\(/X/
=> would as such be an error, but there could have of course been a 'abc\)' in the RE, making it valid.


b) one stage
going from left to right applying the varying rules (for REs and delimiters), whichever comes first, resulting into:
s( ah, an s command with ( as delimiter
  \\ parser first sees these, makes them a literal \
    (( ah, the 2nd and 3rd delimiter
      X( flags to the s command
=> would likely be an error, given the unknown flags


I couldn't find any place, where it really says clearly (or unclearly) how the parsing is to be done.
Just because (b) seems the more logical way to do it, doesn't make it mandatory.

Especially, (a) doesn't seem to be ruled out, take [i], [ii], [iii] which all effectively say that if the delimiter is preceded by \ it's taken literally... that wording points IMO actually more towards (a), because (b) would require a wording, that says something like:
"if the delimiter is preceded by \ THAT BY ITSELF IS NOT ESCAPED"

And both ways (and there might be more crazy ways to do it ;-) )... produce quite different results.




2) What if the delimiter is a special character (assuming BREs here).

[i], [ii], [iii] all effectively say, that if the delimiter is preceded by \ it's taken literally.


a) What does »literally« mean here?
- It's not taken as a delimiter, but "directly" used as RE?
So if one has:
s.\..X.
it would be used as:
s/./X/
(btw, this is what GNU sed does:
   $ printf '%s\n' '.' | sed 's.\..X.'
   X
   $ printf '%s\n' 'v' | sed 's.\..X.'
   X
)

or:

- It's not taken as delimiter AND in RE context it would also be literally, even if it would normally be a special character:
So if one has:
s.\..X.
it would be used as:
s/\./X/
(btw, this is what BusyBox sed does:
   printf '%s\n' '.' | busybox sed 's.\..X.'
   X
   $ printf '%s\n' 'v' | busybox sed 's.\..X.'
   v
)

And again, as above, the standard says "if the delimiter is preceded by \" ... it does not say, that the \ is by itself NOT escaped, which links to (1), that is: how the parsing is actually done?!

=> The standard should clarify this ambiguity, given that two widely used implementations (GNU vs. BusyBox) already use different behaviour shows that there is something fishy.
And if it's undefined, the standard should also mention that (and probably describe it with an example) and warn from using any special characters as delimiters.


b) Depending on the answer of (2a), there isn't any mentioning on whether one can get back the special meaning respectively the literal character.

In both cases, the question would arise:
Other than using a more sane delimiter ;-) ...

- »literally« means, it's no longer a delimiter, but other than that goes directly into the RE and may be special there:
then: s.\..X. would be effectively s/./X/
... can I get the literal . here and if so how?

- »literally« means, literal even with respect to the RE:
then: s.\..X. would be effectively s/\./X/
... can I get the special meaning . here and if so how?

=> even if it's simply not possibly do get the other meaning (whichever is actually the "right" one), the standard should explicitly mention that.


c) (2a) and (2b) also affect characters that get their special meaning (in the RE) only when preceded by \ .

Consider:
s(\((X(

Unlike above in (1) (where s(\\((X( ) was used, there is no parsing ambiguity here, and the command should effectively be the same than:
s/<something>/X/

Again, for <something> the question from (2a) comes up:
What does the RE see?
- ( (the literal ( )
(btw, this is what GNU sed does:
   $ printf '%s\n' '(' | sed 's(\((X('
   X
   $ printf '%s\n' 'v' | sed 's(\((X('
   v
)

or:
- \( (the sequence \( which starts a subpattern)
(I know the resulting RE would lack a closing '\)' and something within the
subexpression ... but that could be easily added.)
(btw, this is what BusyBox sed does:
   $ printf '%s\n' 'anything' | busybox sed 's(\((X('
   sed: bad regex '\(': Unmatched ( or \(
)


So as one can see, the same questions as in (2a) and (2b) pop up for such characters that get their special meaning only when preceded by \ .



d) I found that e.g. GNU's sed (which (2a) uses the quoted delimiter that is a special character AS special character in the RE) allows the following workaround (to get the literal character):
s.[.].X.
which seems then to be used (by GNU sed) as:
s/[.]/X/

But again, at least to me the standard seems to be ambiguous with
respect to how the original form should be parsed (see point (1) above).

While the bracket expression itself is defined to take the . inside
literally, POSIX nowhere seems to say that this is even to be seen as a
'.' for the RE and not as the 2nd delimiter.

Instead, (i), (ii) and (iii) rather seem to imply, that because the (2nd) . is not preceded by \ it *IS* taken as a delimiter.

(GNU sed:
   $ printf '%s\n' '.' | sed 's.[.].X.'
   X
   $ printf '%s\n' 'v' | sed 's.[.].X.'
   v
)

(BusyBox sed also works like that:
   $ printf '%s\n' '.' | busybox sed 's.[.].X.'
   X
   $ printf '%s\n' 'v' | busybox sed 's.[.].X.'
   v
but since BusyBox sed anyway seems to tread the quoted delimiter \. as literal . in the RE (unlike GNU sed):
   $ printf '%s\n' '.' | busybox sed 's.\..X.'
   X
   $ printf '%s\n' 'v' | busybox sed 's.\..X.'
   v
the "trick" is not really a workaround to get the "other meaning" (which would be the special character . meaning here
)




3. Probably just a bug:

Not really an issue with POSIX, but just as an example how confusing things apparently are:


(GNU sed:
   $ printf '%s\n' '9+' | sed 's+9\++X+'
   X
   $ printf '%s\n' '99+' | sed 's+9\++X+'
   9X
   $ printf '%s\n' '999+' | sed 's+9\++X+'
   99X
)
These results are IMO fine, regardless of my other questions above.

In BREs, + alone is never special, and whether one parses all at once
from left to right (as in (1b))... or first looks for unquoted delimiter characters
and splits the command there (as in (1a))...
... the RE should always be effectively the string 9+ ... which is (in
BREs) the literal 9 followed by the literal + .


However...


(BusyBox sed:
   $ printf '%s\n' '9+' | busybox sed 's+9\++X+'
   X+
   $ printf '%s\n' '99+' | busybox sed 's+9\++X+'
   X+
   $ printf '%s\n' '999+' | busybox sed 's+9\++X+'
   X+
)
somehow does both:
- make out of the \+ a non-delimiter
- transforms the (wrt BRE) non-special character + into a special one.

Which I think is generally (regardless of the interpretation or any
ambiguities in POSIX) a bug (I'll report it there).


All the above should also at least partially apply to context addresses.

I'd guess nothing of it applies to the y-command, though,.. but I haven't really looked at it.
Desired Action see above, clarify things and resolve ambiguities


Thanks,
Chris.
Tags No tags attached.
Attached Files txt file icon summary-of-literal-behaviour-gnu-vs-busybox.txt [^] (373 bytes) 2022-01-14 22:15

- Relationships
related to 0001550New clarifications/ambiguities in the description of context addresses and their delimiters for sed 

-  Notes
(0005602)
Don Cragun (manager)
2022-01-14 06:51

This was originally filed against the Issue 7 + TC2 project, but the page and line numbers are from Issue 8 draft 2.1. It has been moved to the Issue 8 project.
(0005607)
calestyo (reporter)
2022-01-14 15:52

Some additions, which I was pointed to by Oğuz in:
https://collaboration.opengroup.org/austin/plato/protected/mailarch.php?soph=N&action=show&archive=austin-group-l&num=33636&limit=100&offset=0&sid= [^]



with respect to (2a) and (2b):

Not only special characters of the BRE/ERE are affected, but also special characters of the s-command's replacement (i.e. & ).

Consider:
s&x&_\&_&

which, depending on what "literal" means, could be either effectively:
s/x/_&_/
or:
s/x/_\&_/

(both, GNU and BusyBox sed seem to do the later
   $ printf '%s\n' 'x' | sed 's&x&_\&_&'
   _&_
   $ printf '%s\n' 'a' | sed 's&x&_\&_&'
   a
   $ printf '%s\n' 'x' | busybox sed 's&x&_\&_&'
   _&_
   $ printf '%s\n' 'a' | busybox sed 's&x&_\&_&'
   a
)




with respect to (2c):

And just as above but this time for characters which get their special meaning only when preceded with a \ :

Consider:
s1\(x\)1\11

which, depending on what "literal" means, could be either effectively:
s/\(x\)/\1/
(BusyBox sed seems to do this:
   $ printf '%s\n' 'oxo' | busybox sed 's1\(x\)1\11'
   oxo
   $ printf '%s\n' 'owo' | busybox sed 's1\(x\)1\11'
   owo
)

or:

s/\(x\)/1/
(GNU sed seems to do this:
   $ printf '%s\n' 'oxo' | sed 's1\(x\)1\11'
   o1o
   $ printf '%s\n' 'owo' | sed 's1\(x\)1\11'
   owo
)
(0005610)
calestyo (reporter)
2022-01-14 20:36

with respect to (2c):

The ambiguity with respect to what "literal" means AND when a delimiter is used that get's its sepcial meaning only when preceded by \ goes even a bit further.

In the note above, I gave the example of using one of the digits 1-9 in the s-command's replacement.

But or BREs only (unless there are implementations which provide this also for EREs) the same may even happen in the RE part.

Consider:
s1\(xo\)\11X1

which, depending on what "literal" means, could be either effectively:
s/\(xo\)\1/X/
(BusyBox sed seems to do this:
   $ printf '%s\n' 'xoxo' | busybox sed 's1\(xo\)\11X1'
   X
   $ printf '%s\n' 'xo1' | busybox sed 's1\(xo\)\11X1'
   xo1
)

or:

s/\(xo\)1/X/
(GNU sed seems to do this:
   $ printf '%s\n' 'xoxo' | sed 's1\(xo\)\11X1'
   xoxo
   $ printf '%s\n' 'xo1' | sed 's1\(xo\)\11X1'
   X
)
(0005611)
calestyo (reporter)
2022-01-14 21:40

with respect to (3):

For GNU sed: \w has a special meaning that extends POSIX:
»Matches any "word" character. A "word" character is any letter or digit or the underscore character.«

(similar as it gives \+ with BREs a special meaning, see the example s+9\++X+ above)

Considering:
sw9\wwXw

(GNU sed seems to do this:
   $ printf '%s\n' '99' | sed 'sw9\wwXw'
   99
   $ printf '%s\n' '9w' | sed 'sw9\wwXw'
   X
)
so it effectively takes the \w ... makes it a non-delimiter (and removes the escaping), being just left with the character w which is by itself literal.


However...


(BusyBox sed:
   $ printf '%s\n' '9' | busybox sed 'sw9\wwXw'
   9
   $ printf '%s\n' '99' | busybox sed 'sw9\wwXw'
   X
   $ printf '%s\n' '999' | busybox sed 'sw9\wwXw'
   X9
)
somehow does both:
- make out of the \w a non-delimiter
- but still sees \w with respect to the RE, and gives it special meaning.

Which I think is generally (regardless of the interpretation or any
ambiguities in POSIX) a bug (I'll report it there).



So why do I bring the same example from again, just with \w instead of \+ ?

Because that shows how problematic and dangerous this whole set of ambiguities is:

One could might argue that people using weird characters (like . or ( ) as delimiters don't deserve any better when they get unexpected results.

But this shows that even a plain letter 'x', which has never any special meaning (with or without preceding \ ), could cause such issues, if the implementation choose to give special meaning to \x and if it does the un-delimitering in a bad way.
(0005612)
calestyo (reporter)
2022-01-14 21:48
edited on: 2022-01-14 22:09

Just some summary on the different examples I gave so far in the notes above:


                GNU-sed         BusyBox-sed
s.\..X.         UD              TR
s(\((X(         UD[0a]          error+bug?[1a]
s.[.].X.        UD?[2a]         TR?[2b]
s+9\++X+        UD[0b]          bug?[1b]
s&x&_\&_&       TR           TR
s1\(x\)1\11     UD[0c]          bug?[1c]
s1\(xo\)\11X1   UD[0d]          bug?[1d]
sw9\wwXw        UD[0e]          bug?[1e]


Legend:
UD = un-delimitered, i.e. when the delimiter was . then a \. is seen by the RE as . WITH (if any!) special meaning kept
(un-delimitered meaning: the escaped delimiter character was made a non-delimiter and the escaping removed)

TR = truly literal, i.e. when the delimiter was . then a \. is seen by the RE as literal . with (if any!) special meaning REMOVED



[0]
a) I would count this as UD, because with GNU sed, the RE "sees" effectively ( ... that is the un-delimitered \( ... and that just also happens to be truly literal.
b) Same as (a) right above, GNU sed's RE "sees" effectively + ... that is the un-delimitereed \+ ... and that just happens to be already literal
c) Same as (a)... just on the replacement side.... GNU sed's replacement "sees" effectively 1 ... that is the un-delimitered \1 .... and that is again already literal
d) Same as (c), just not on the replacement side, but on the RE side of a BRE (where \1 is backreference)
e) Same as (a), just with a letter character that should never be special by itself and usable as delimiter without any worries.


[1]
a) BusyBox sed's RE apparently sees here \( in the RE, which means that it *was* un-delimitered, but it's still kept as \( ... and BB sed complains about the closing \) ... IMO likely a bug.
b) Same as (a) right above, BB sed, seems to un-delimiter the \+ ... but still see \+ afterwards and uses it with special meaning. IMO a bug.
c) Same as (a) right above, BB sed makes the \1 a non-delimiter ... and the replacement still "sees" a \1 (i.e. the back-reference)
d) Same as (c), just not on the replacement side, but on the RE side of a BRE (where \1 is backreference)
e) Same as (a), just with a letter character that should never be special by itself and usable a delimiter without any worries (but can't be, with busybox).


[2]
a) I'd also count that as UD, i.e. GNU-sed see's the special character . just that it has a TR meaning within the bracket expression
b) For BusyBox sed I'd count it as TR ... i.e. it might see the already literal character . which is again literal within the bracket expression



Conclusing:
GNU sed:
- seems to mostly follow the philosophy that the quoted delimiter character is just "un-delimitered" (with the preceding \ removed) but then treated special if it would normally be special ... and not treated special if it would normally (that is without preceding \ ) not be special.
- BUT... there is at least one deviation from that philosophy: \& in the replacement when & is the delimiter

BusyBox sed:
- seems to follow the philosophy of doing both: un-delimitering AND makeing the result non-special
- EXCEPT for any cases where the character ALONE would not be special, but WITH a preceding \ it would be. There it un-delimiters (effectively NOT removing the preceding \ ) yet still keeps the special meaning.

(0005613)
calestyo (reporter)
2022-01-14 22:17

sorry if you cannot properly read the table in the comment above... stupid Mantis folds multiple spaces and I had to use Unicode EN SPACE to get at least some readability.

I've attached the file summary-of-literal-behaviour-gnu-vs-busybox.txt which should should render properly in any normal text editor.
(0005627)
calestyo (reporter)
2022-01-18 21:18

btw: in https://austingroupbugs.net/view.php?id=1556#c5626 [^] I made some (hypothetical) example of what I personally would consider a proper definition of how things are parsed and to be interpreted.

- Issue History
Date Modified Username Field Change
2022-01-14 05:39 calestyo New Issue
2022-01-14 05:39 calestyo Name => Christoph Anton Mitterer
2022-01-14 05:39 calestyo Section => Utilities, sed
2022-01-14 05:39 calestyo Page Number => 3132, ff. (in the draft)
2022-01-14 05:39 calestyo Line Number => see below
2022-01-14 06:34 Don Cragun Relationship added related to 0001550
2022-01-14 06:48 Don Cragun Project 1003.1(2016/18)/Issue7+TC2 => Issue 8 drafts
2022-01-14 06:51 Don Cragun Note Added: 0005602
2022-01-14 06:51 Don Cragun version => Draft 2.1
2022-01-14 15:52 calestyo Note Added: 0005607
2022-01-14 20:36 calestyo Note Added: 0005610
2022-01-14 21:40 calestyo Note Added: 0005611
2022-01-14 21:48 calestyo Note Added: 0005612
2022-01-14 22:07 calestyo Note Edited: 0005612
2022-01-14 22:08 calestyo Note Edited: 0005612
2022-01-14 22:09 calestyo Note Edited: 0005612
2022-01-14 22:15 calestyo File Added: summary-of-literal-behaviour-gnu-vs-busybox.txt
2022-01-14 22:17 calestyo Note Added: 0005613
2022-01-18 21:18 calestyo Note Added: 0005627


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker