Austin Group Defect Tracker

Aardvark Mark III


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0001233 [1003.1(2016)/Issue7+TC2] Shell and Utilities Objection Enhancement Request 2019-03-08 17:32 2019-03-11 07:53
Reporter stephane View Status public  
Assigned To
Priority normal Resolution Open  
Status New  
Name Stephane
Organization
User Reference
Section 9.3.5
Page Number 184
Line Number 6087
Interp Status ---
Final Accepted Text
Summary 0001233: backslash needs to be escaped portably inside bracket expressions
Description (line numbers are for ISBN#1-947754-05-08 (2018 edition, c181.pdf), not 1003.1(2016))

The spec says than backslash shall lose its special meaning within bracket expressions.

When it comes to fnmatch(), it says backslash escapes special characters, it's not clear whether it's still the case inside bracket expressions.

For shell wildcards (in globs and in case statements), backslash is a quoting operator like '...' and "..." and quoting removes the special meaning of special characters (including the ^, ], - inside bracket expressions as recently clarified). The first paragraph of 2.13.1 also suggests backslash has a second meaning specific to pattern matching and separate from quoting (see bug:1190) though that doesn't match current practice in most shells. I'll cover that in a separate bug but that has some relevance here.

In practice backslash is still special inside bracket expressions in a number of utility implementations, either to remove the special meaning of ], ^, -, [:x:] or [.x.] or for ANSI C escapes like \n (whether that should happen sed is not clear in the spec), \t... or extended operators like \d, \s as shorthands for [:digit:] or [:space:].

Most modern regular expressions (perl, python, tcl, php, .net...) treat backslash specially within bracket expressions. Not allowing it for POSIX RE locks them in the past and contributes to making them obsolete.

[\t] will not match on \ and t in:

- where literal in shell globs, where \ is just quoting the t
- in fnmatch() and find -name/-path as backslash is replacement for shell quoting
- in awk in text ~ "[\t]" where that \t is expanded to a TAB inside double quotes
- in awk in text ~ /[\t]/ where either \t is expanded to TAB, or \t is meant to match a TAB
- in some awk implementations in P='[\t]' awk '$0 ~ ENVIRON["P"]' (as now required by POSIX)
- in GNU sed (unless POSIXLY_CORRECT is in the environment) or busybox sed (where it matches a TAB instead)
- in vim (vi/ex) when not in compatibility mode
- as mentioned above, modern regexps.

[\^a] doesn't match on \ in the above and also in

p='[\!a]'; case "\\" in $p) echo match; esac in bash and some ash-based shells (that's different in those ash-based shells when the pattern is used for globbing instead)
Desired Action Specify that within bracket expressions, backslash must be escaped or quoted for it to be matched literally. Applications authors should write:

- ['\'] or [\\] in sh
- p='[\\]'; case $var in $p; esac in sh
- fnmatch(): find . -name '[\\]*' (already covered, but could be made less ambiguous)
- $0 ~ /[\\]/ in awk
- $0 ~ "[\\\\]" in awk
- BRE: grep '[\\]'
- ERE: grep -E '[\\]'

Unspecified otherwise (except for the other cases that are already specified to mean something special like awk's /[\t]/).

Also please clarify whether s/[\n]// matches a newline or not. If not, make it unspecified instead of requiring it to match on \ and n. In practice few match on newline, but that's unfortunate as in those that means one can't do things like s/[^\n]/x/ other than with some convoluted: y/\n./.\n/;s/[^.]/x/;y/\n./.\n/
Tags No tags attached.
Attached Files

- Relationships

-  Notes
(0004291)
geoffclare (manager)
2019-03-08 18:18

The awk examples in the desired action don't match what you're asking for.
Since awk is specified as processing C-style escapes before the ERE is
interpreted as specified in XBD chapter 9, if XBD chapter 9 is changed to
say that backslash needs to be escaped in bracket expressions, then for
awk you would need:

   $0 ~ /[\\\\]/
   $0 ~ "[\\\\\\\\]"

So the behaviour of awk in practice is actually an argument against
changing XBD chapter 9 the way you are asking for.

Another reason not to change XBD chapter 9 is that it specifies the
behaviour of regcomp() and regexec(). Are you claiming that there
are implementations of those functions which treat backslash as
special in bracket expressions? There shouldn't be, since those
functions were invented by POSIX.

So I believe XBD chapter 9 should remain as is, and instead changes
should be considered for individual use cases. For example, we could
consider specifying that sed may process C-style escapes the way awk
does.
(0004292)
stephane (reporter)
2019-03-08 20:29

Re: Note: 0004291

> The awk examples in the desired action don't match what you're asking for.
> Since awk is specified as processing C-style escapes before the ERE is
> interpreted as specified in XBD chapter 9, if XBD chapter 9 is changed to
> say that backslash needs to be escaped in bracket expressions, then for
> awk you would need:
>
> $0 ~ /[\\\\]/
> $0 ~ "[\\\\\\\\]"

My point is that since several commands including awk treat \ specially within [...] (and it's not only about ANSI C escape sequences, /[\]]/ also matches on ] only in most awk implementations), the regexp syntax should mandate \ to be escaped inside bracket expressions to match a backslash literally.

I'm fine if instead of making the change to the regexp syntax you want to make it to all the standard utilities that use regexps.

And advantage of making the change to the regexp syntax is that it allows implementations to add extensions in a consistent fashion (as opposed to now different utilities adding their own independently).
(0004295)
kre (reporter)
2019-03-11 02:36

The most I believe is needed to deal with this is a note in some
application usage type section (perhaps one might need to be added)
noting that some applications have extended the use of \ as an escape
character to allow it to work inside [] expressions in matching, and
to advise using \\ whenever this is a possibiliy.

For applications that don't treat the \ in [] as anything other than
a character, this is harmless - having any character twice in a bracket
expression changes nothing - either it is there, and that character in
the string being matched against will match, or it is not there, and the
[] expression will not match if the character in question appears at the
relevant position in the string.

For applications where \n \t (etc) are treated differently than as being
the two characters, '\' and 'n' or 't' inside a bracket expression,
explicitly writing \\n or \\t when that is the intent is harmless, and
works in both cases. On the other hand, when a newline or tab is to
be matched, it should be entered literally, rather than simply assuming
that \n or \t will work (and yes, I know some apps do not allow literal
newlines in patterns, so this is not alwaye easy - but on the other hand,
sed excepted, most of those (grep, ...) also never attempt a match on a
string that can contain a newline, so it doesn't matter.

That the description notes:
   - in GNU sed (unless POSIXLY_CORRECT is in the environment)
makes it clear that they know that they are making an extension to
standard regular expressions, and consequently a method is provided
to return to the standard behaviour.

So, I wouldn't go as far as making it unspecified whether [\n] might be
a match against a newline - in a standard regular expression it is not.
But adding a note for application (script) writers than writing [\\n] is
safe, and avoid issues, is not an unreasonable request.
(0004302)
stephane (reporter)
2019-03-11 07:53

Let's not lose sight of the fact that POSIX is about specifying a portable API. When it comes to shell and utilities, it's about what people can safely put in their script so it can work everywhere. POSIX also constraints implementations.

Here by saying that \x is unspecified but [\x] or [x\] match on \ and x everywhere, first we're telling *applications* a lie because it's not the case in practice.

And then we're constraining *implementations* in a very counter-productive way. We're telling them, you can use \x as an extension, but not [^\x]. Note that it's not only about \n, \b ANSI sequences. See also the \d, \s of perl (shortcut for [[:digit:]] [[:space:]]) and of course \]

Those \s, \d are now found in many standard utility implementations (grep/sed/vi at least), but because of that POSIX requirement of [^\s] required to match characters other than \ and s, those \s are often not recognised inside bracket expressions, which is an annoying difference from perl. I tend to use grep -P where available because that's IMO a saner regexp syntax.

Relaxing that POSIX requirement (and for what? so users can use [\/] instead of [\\/] which they won't do anyway as it often doesn't work like in shells, awk, fnmatch) would allow utilities (and regcomp()) to implement more useful extensions to the regex syntax.

- Issue History
Date Modified Username Field Change
2019-03-08 17:32 stephane New Issue
2019-03-08 17:32 stephane Name => Stephane
2019-03-08 17:32 stephane Section => 9.3.5
2019-03-08 17:32 stephane Page Number => 184
2019-03-08 17:32 stephane Line Number => 6087
2019-03-08 18:18 geoffclare Note Added: 0004291
2019-03-08 20:29 stephane Note Added: 0004292
2019-03-11 02:36 kre Note Added: 0004295
2019-03-11 07:53 stephane Note Added: 0004302


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker