Austin Group Defect Tracker

Aardvark Mark III


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0001234 [1003.1(2016)/Issue7+TC2] Shell and Utilities Editorial Enhancement Request 2019-03-08 23:58 2019-03-12 22:37
Reporter stephane View Status public  
Assigned To
Priority normal Resolution Open  
Status New  
Name Stephane Chazelas
Organization
User Reference
Section 2.13.1
Page Number 2382
Line Number 76212-76215
Interp Status ---
Final Accepted Text
Summary 0001234: in most shells, backslash doesn't have two meaning wrt pattern matching
Description That's a follow-up on bug:1190 where Geoff claims backslash has
two meanings in shell, one as a quoting operator and another one
as an "escaping" pattern matching operator.

The relevant part of the spec in 2.13.1:

> A <backslash> character shall escape the following character.
> The escaping <backslash> shall be discarded. If a pattern ends
> with an unescaped <backslash>, it is unspecified whether the
> pattern does not match anything or the pattern is treated as
> invalid.

That text seems to me to be referring to what happens in
fnmatch() where backslash is used as a substitute to shell
quoting.

In most shells, beside quoting, backslash doesn't have that
"second" meaning for shell globbing and "case" pattern matching.

See also the last paragraph in that same section:

> When pattern matching is used where shell quote removal is not
> performed (such as in the argument to the find −name primary
> when find is being called using one of the exec functions as
> defined in the System Interfaces volume of POSIX.1-2017, or in
> the pattern argument to the fnmatch( ) function), special
> characters can be escaped to remove their special meaning by
> preceding them with a <backslash> character. This escaping
> <backslash> is discarded. The sequence "\\" represents one
> literal <backslash>. All of the requirements and effects of
> quoting on ordinary, shell special, and special pattern
> characters shall apply to escaping in this context.

That paragraph makes little sense if like Geoff we interpret the
first one above as meaning backslash has a special meaning
independent of quoting in all cases including shell and
fnmatch().

There is one and only one shell however that implemented Geoff's
interpretation: bash. And it's done partly (but differently) in
ksh93, some ash-derived shells and zsh. In all other shells:
Thomson sh, Mashey sh, Bourne sh, ksh88, Almquist sh, Forsyth
sh, pdksh, mksh, yash, backslash is just a quoting operator like
'...' and "..." in that regard (and quoting removes the special
meaning of wildcard operators), backslash doesn't have a special
meaning.

glob came from Unix V1 where the shell would invoke a /etc/glob
helper to expand globs. That /etc/glob didn't treat backslash
specially. sh would convey to glob what characters were quoted
by setting the 8th bit on them. /etc/glob was moved inside the
shell in the Mashey shell, and the Bourne shell fixed most of
the issues (and added some) but worked on the same principle:
quoting marks characters as "quoted" internally which prevents
them from being treated as wildcard by the globbing routine

The second meaning, if any, only reveals itself when the pattern
comes from some expansion, as in:

   files='\?*.ext'; ls -d -- $files

In all shells but bash and zsh (in sh emulation), that expands
to the file names that start with \ followed by one or more
characters and ends in .ext as \ is not a wildcard operator
there, just a quoting one and that \ is not used as a quoting
operator.

In bash and zsh, that expands to the file names that start with
? and end in .ext.

zsh doesn't implement Geoff's interpretation of the POSIX
requirement fully as the \ above is only considered as an escape
operator before wildcard operators. It's not removed when
followed by normal ones. You'll see the difference in:

   $ touch ab '\a\b' 'a*'
   $ p='\a\b' bash -c 'ls -d -- $p'
   \a\b
   $ p='\a\b' bash -c 'ls -d -- $p*'
   ab
   $ (p='\a\b' exec -a sh zsh -c 'ls -d -- $p*'
   \a\b

Where it gets a bit silly is in:

   $ p='a\*' bash -c 'printf "%s\n" $p
   a\*

That * is escaped (by a wildcard operator?) but since it doesn't
contain wildcard operators anymore (so the answer is "no"), it's
left as is and doesn't match the a* file. (see how it differs in
"case" statements below).

In dash and busybox sh (but not other ash derivatives like
NetBSD sh or FreeBSD sh) and ksh93, while backslash doesn't have
a second meaning for globs, it seems to have one in case
statements.

In those (and in bash)

   p='a\*'; case "a*" in $p) echo match; esac
   p='\a'; case "a" in $p) echo match; esac

outputs "match" twice (note the difference with globs for the
first one in bash and zsh).
Desired Action I don't think that second meaning was intended in the POSIX
specification. If it was, it was poorly worded. Still, some
shells (very few) have implemented it, if only partially.

I would recommend suggestion of that second meaning be removed
from the specification, but that shells that implemented it be
accommodated.

So remove the first quoted text above, undo the changes in
bug:1190 that suggest that backslash has a second meaning in
shell wildcards, keep the last paragraph about find/fnmatch()
(possibly clarified as per bug:1233), but make it clear to
application writers, that in a wildcard pattern that comes from
an unquoted parameter expansion or command substitution, a
backslash should be matched by [\\] (the double backslash as per
bug:1233), and a wildcard operator ([, *, ?) by [[], [*], [?]
respectively:

Maybe clarified with some examples:

- files='[?]*'; ls -d -- $files for the list of files that start
  with ?
- files='[\\]*'; ls -d -- $files for the list of files that start
  with \
- p=*; ls -d -- '\'$p for the list of files that start with \
- files='\?*'; ls -d -- $files unspecified
- files='\a*'; ls -d -- $files unspecified
- files='*\'; ls -d -- $files unspecified
- pattern='\a'; case $var in $pattern) ...; esac unspecified

Now for:

- files='\a'; ls -d -- $files
- files='a\*'; ls -d -- $files

(globs that contain backslashes but no unescaped wildcard) we
can make it unspecified for consistency with the "case"
equivalent above, or since even bash appears to not treat the \
specially there, require that it be expanded as is.
Tags No tags attached.
Attached Files

- Relationships
related to 0001190Interpretation Required backslash has two special meanings in the shell and only loses one of them in bracket expressions 

-  Notes
(0004296)
kre (reporter)
2019-03-11 02:42
edited on: 2019-03-11 03:16

I regard this:

    In most shells, beside quoting, backslash doesn't have that
    "second" meaning for shell globbing and "case" pattern matching.

as an indication that most shells have had bugs in the matching code.

That is, few even really considered what happens with
    ls $var

when var contains pattern matching characters. Everyone agrees
that pathname expansion is performed on the result, so if var='*.c'
the ls would list all files with names ending in .c

Similarly, all agree that if one wants to avoid pathname expansion,
the parameter expansion just gets quoted
    ls "$var"
will only list the file '*.c'

But it was not historic practice to correctly deal with the possibility
that var might contain both characters that are to be treated specially,
and those that are not. All shells correctly handle

    ls \**.c

to list all files with names starting with a '*' and ending with '.c'
but (until relatively recently) few provided any method to allow such
a pattern to be stored in a variable, and accessed using parameter
expansion.

Quoting the parameter expansion cannot work - that always disables pathname
expansion (as does "set -f"). So there *must be* a method to allow a
parameter expansion to contain both wildcard characters, and those characters
treated as literal characters.

The way this is done in all other contexts is by using \ as the (one and
only) escape character. In:
      find -name '\**.c' ...
works, the \ used this way works in regular expressions. Everywhere but
in broken shells.

I cannot treat this defect as anything other than a bug in those shells,
however common it is for this bug to exist.

Wrt:
        files='\?*.ext'; ls -d -- $files

    In all shells but bash and zsh (in sh emulation), that expands
    to the file names that start with \ followed by one or more
    characters and ends in .ext as \ is not a wildcard operator
    there, just a quoting one and that \ is not used as a quoting
    operator.

First, it would be wrong to ever consider a \ as a wildcard operator.
It is not, never was, never will be. It is a quoting character,
and should work there in the shell as it does in other applications.

Further, contrary to what that says, in the (fixed) NetBSD sh (fixed
since late last year, and will be in both 8.1 - as it is a bug fix, and
9 when that appears) this would expand to files whose names start with
a '?' and end in ".ext" - the ? is quoted, and so matches literally,
exactly the same as would happen if the contents of the var were listed
on the command line, or if one did

     eval command "${files}"

but without the side issues that has if files contains other quoting
characters (' or ") which have no effect on matching in any case, and
are simply characters.

Consequently I object to the desired action. I would tolerate some
words added noting that this is a common bug - but a bug is all this
one is, working the (various different) ways that the various shells
operate has no specifically designed benefit - this case simply has not
been properly considered in their implementations (and is not all that
hard to fix - I know, I have done that.)

There is a real need to be able to put generic patterns in variables,
and have them work, and work consistently in all 3 uses of patterns in
the shell (pathname expansion, case patterns, and substring matching in
parameter expansion) so that the same results occur in all 3 cases.

If the price for that is to have some shells listed as non-conformant
until they fix their bugs, then so be it.

The conformance test suite should be explicitly testing for these cases,
and failing shells that do not comply.

On the other hand, warning script writers that this is an area that is
fraught with danger, currently, is reasonable - even if there really is
no good workaround available currently, with a broken shell, there are
simply some things that cannot be accomplished, which should be possible.

(0004298)
kre (reporter)
2019-03-11 03:23

In note 4296 I wrote
   Quoting the parameter expansion cannot work - that always disables pathname
   expansion

That is loose (ok, incorrect) wording, parameter expansion isn't disabled
by quoting, rather it is just ineffective - as all of the quoted characters
are literals, so if the whole word is quoted (as in ls "${var}") pathname
expansion might as well be disabled, as only one pathname can possibly
result - that which is the exact contents of $var (with everything treated
literally). Still this is not really "disabling" pathname expansion, and I
should have been more precise in my wording. This glitch has no effect on
the substance of the note however.
(0004300)
stephane (reporter)
2019-03-11 07:24

Re: Note: 0004296

> All shells correctly handle
>
> ls \**.c
>
> to list all files with names starting with a '*' and ending with '.c'

As a historical note, that didn't work until V7. In the Thomson or even the Mashey shell (where the /etc/glob functionality was moved into the shell), that 8th bit on quoted character was removed just before calling exec(), so was still there at the filename matching stage, so you couldn't have globs with quoted characters. Even "a"*.c didn't work.

You would do [*]*.c then, just like you'd do

   files='[*]*.c'
   rm $files

now (except that it would delete a literal [*]*.c if there was no matching file, a misfeature/bug the Thomson/Mashey shells didn't have that was introduced by the Bourne shell (and no (cf 1235), it's not easy to work around and almost never worked around in real life scripts, but it's too late to fix now))

There is no need to add that confusing second meaning to backslash.

In awk (where different implementations handle backslash differently) and more generally in tools I'm not sure how escaping works, I tend to use [*] instead of \* as well to be on the safe side and which always works (the fish shell being an exception as it doesn't have the [...] wildcard operator).
(0004301)
stephane (reporter)
2019-03-11 07:31

Re: Note: 0004298

I would say that yes quoting a variable, or more generally making sure that a word in list context doesn't have any unquoted wildcard disables pathname expansion.

That's how it was done in the original glob implementation where /etc/glob would only do pathname expansion on arguments that contained unquoted */?/[.

"echo foo" would output foo, but "echo [f]oo" would fail with a "No match" error if there was no matching file. If it did output "No match" in the "echo foo" case, that would clearly be a bug.

Modern shells that still fail commands when glob don't match do the same.

And in those that don't, you wouldn't want echo foo/bar/baz to cause the shell to read the content of ., foo and foo/bar to realise there's only at most one match and then leave the result asis.
(0004303)
geoffclare (manager)
2019-03-11 09:41

This bug touches on many of the same issues that came up during discussion of bug 0000985. It may be best to close it as a duplicate of that bug and put our efforts into resolving the issues there instead of having a separate discussion here.
(0004306)
kre (reporter)
2019-03-12 03:24
edited on: 2019-03-12 12:14

Lacking the ability to write a pattern in anything other than original
shell code reliably is too big a loss to suffer - there must be a reliable
way to access a variable pattern, reliably, and use it (fully functional).

I'm not aware of any shell claiming that their broken way is correct, the
most I have seen is "posix doesn't mandate it" (which always was questionable)
"so we do not need to fix it".

What the Thompson (6th edn and earlier) shell, or the Mashey (PWD) shell
did, just as what csh (and derivatives) do is of no relevance whatever.
As I recall the Thompson shell (and probably the Mashey one too, though I
used that very little) didn't even have variables for this to be an issue.

And while the [] form of quoting works, most of the time, as an alernative
to \ quoting, it is needlessly different for no good reason, I can do

var='\**.c'; find /wherever -name "$var" -print

and that works as expected, but if I later do

ls -l $var

it does not? In your other issues you are arguing for consistency
across multiple different applications (extending as far away from sh
as perl) yet here you want things to remain different? Why?

Further, given that (for historical reasons) both ! and ^ are (or
might be) special when used as the first character of a bracket
expression, without allowing \ quoting, there's no way to write a
pattern that matches either of those chars (if one wanted to write
a pattern to look for glob patterns for example). One of the two
must be first, and thus have a special meaning which is unwanted.
Nothing else can be inserted into the bracket expression, as that
would result in a match against 3 chars instead of just the two.
On the other hand, if \ quoting works, then [\^!] or [\!^] work
just fine. Those do work when entered on the shell command line,
shouldn't they work in a variable as well?

Wrt Note: 0004301 ... whether pathname expansion is "disabled" or just ineffective
makes no difference (nor does the optimisation in the Thompson shell to not
exec /etc/glob when it would obviously change nothing). For most purposes this
is just semantics. But for the purposes of the standard, pathname expansion
is only disabled when done so via "set -f" (or "set -o noglob" of course).
In some situations it is not performed, and in others it is performed but
does nothing, and in others it produces a set of new words to replace the
original.

The details of how pathname expansion is performed - given that it must comply
with the constraint that 'r' permission is not to be required on any directory
component unless there is a wildcard operator in the source word component to
match (and one might even argue whether a usage like [a] is such, given the
only possible match is 'a') - is all up to the implementation, but as soon as
it starts looking at a word breaking it into pathname components and looking
to see if unquoted wildcard characters exist, it is performing pathname expansion (no other function in the shell requires such inspection) - and it
would not be doing any of that if pathname expansion were disabled.

And last, wrt Note: 0004303, I will go and (look again) at bug 0000985, though it
seems to me that when a bug is that old, and not yet resolved, it takes something like this to get it moved back into being under active consideration
rather than simply stagnating.

(0004307)
kre (reporter)
2019-03-12 03:37
edited on: 2019-03-12 12:16

Re Note: 0004303 again...

I suspect that you added that note to the wrong bug report. Bug 0001235
falls squarely into what is/was being discussed in bug 0000985 whereas
this one is not.

This is more related to bug 0001190, in which, in Note: 0004290, you explicitly
asked for new issues to be opened to discuss these side issues of the
original report.

(0004309)
Konrad_Schwarz (reporter)
2019-03-12 09:13

Re Note 4306:

why does

   eval ls -l $var

not suffice?
(0004310)
geoffclare (manager)
2019-03-12 09:58

Re Note: 0004307 I had misremembered where the relevant discussion occurred (it was on the mailing list, not initially in connection with bug 0000985), but I still believe the issue raised here about backslash in shell patterns should be dealt with in bug 985 because the proposed resolution for that bug puts the detail of pattern matching into fnmatch() and then rewrites 2.13 to refer to fnmatch(). The effect of the aforementioned discussion on how to rewrite 2.13 is the reason bug 985 was reopened (see Note: 0003947 and Note: 0003948).
(0004313)
kre (reporter)
2019-03-12 12:15
edited on: 2019-03-12 14:56

Re:Note: 0004309

You probably mean
    eval ls -l "$var"
without the quotes you'd get pathname expansion on $var (files that start
with a '*' for example - however that is encoded in var) and then the eval
would do pathname expansion again on each of the results from the first time.
That's rarely going to be what anyone wants.

But for the form with quotes in it, imagine var="[']*.c"

then what the eval sees is

     eval ls -l [']*.c

That isn't going to work at all. There's often (but not always) a way
to code around this, but then it means that the variable needs to be
different to be used in glob expansions like this, and in case patterns,
or parameter expansion substring operators - and certainly as an arg to
find. The same thing ought to be able to be used, with the same meaning
in all of those contexts.

(0004314)
kre (reporter)
2019-03-12 12:23
edited on: 2019-03-12 14:58

Re: Note: 0004310

I disagree actually - I'd keep 0000985 to discuss
just how case should be handled, and move all the discussions of how patterns
ought to be processed elsewhere (eg: here) as patterns affect much more than case
statements.

(0004317)
Konrad_Schwarz (reporter)
2019-03-12 14:29

Re: Note: 0004313

(I think) I understand.

But as desirable as it may be, there is not going to be a way for $var to
mean the same thing to the shell globber and to find -name, because when a backslash arrives at the shell globber, it is literal,
whereas it is an escape for find -name.

See also http://pubs.opengroup.org/onlinepubs/009695399/utilities/xcu_chap02.html#tag_02_13, [^] 2.13.1 Patterns Matching a Single Character:

When pattern matching is used where shell quote removal is not performed (such as in the argument to the find - name primary when find is being called using one of the exec functions as defined in the System Interfaces volume of IEEE Std 1003.1-2001, or in the pattern argument to the fnmatch() function), special characters can be escaped to remove their special meaning by preceding them with a backslash character. This escaping backslash is discarded. The sequence "\\" represents one literal backslash. All of the requirements and effects of quoting on ordinary, shell special, and special pattern characters shall apply to escaping in this context.
(0004318)
kre (reporter)
2019-03-12 15:11

Re Note: 0004317

     "because when a backslash arrives at the shell globber, it is literal,
      whereas it is an escape for find -name."

That's exactly what the issue is. And what needs to change in order for
patterns in variables to work correctly. And here, "change" means "be
interpreted in a reasonable way", all of that is actually correct as it is.

Certainly a \ entered in a script (part of the shell input) is a quoting
character, and if it gets through to the pattern matching, it must have
been quoted (and as such be in a place where quote removal will be
performed, or at least would be, when used in a place where that happens).
That (quoted) \ which gets to the pattern matching code is certainly a
literal backslash, and simply matches itself.

But when we have
     var='\**.c'
(or similar) and then
     ls $var
(which becomes after parameter expansion)
     ls \**.c
that \**.c (as it is not the original text) is not subject to quote removal
(only quoting characters that were in the original script get removed by
quote removal) and is "When pattern matching is used where shell quote
removal is not performed" and the "special characters can be escaped"
(as in the first '*') "by preceding them with a backslash character."
which is exactly what has been done here.

This is another case where the standard as written is correct, and doesn't
really need any changes - we just need to actually believe what it says,
and interpret it correctly.
(0004319)
stephane (reporter)
2019-03-12 22:37

Re: Note: 0004306

> Lacking the ability to write a pattern in anything other than original
> shell code reliably is too big a loss to suffer - there must be a reliable
> way to access a variable pattern, reliably, and use it (fully functional).

"loss" implies something that you had and that you no longer have. Again, that "essential" feature you couldn't do without is hardly implemented by any shell, not even yours (at least not in the NetBSD 7.1.2 test VM I have here).

On the other hand, [*]* has been the way to match strings starting with * for almost 50 years (hence my "historical" reference to the Thomson shell, which the Mashey shell built open which the Bourne shell built upon which the Korn shell built upon, a subset of which POSIX specifies as sh).

I avoid \ as a quoting operator whenever I can as \ is way too overloaded (as quoting operator (with several layers in shells with `...`), as escape sequence introducer, as line continuation, and again in most utilities called by the shell).

Rather than

    case $x in (\?*)

I prefer:

    case $x in ('?'*)

Should I argue that

    pattern="'?'*"
    case $x in ($pattern)

should match on strings that start with "?"?

How about

    pattern='$var*'

Should we ask for a second round of shell evaluation (which is what you're asking for \ hence Konrad's eval suggestion, why stop there?) on the content of a variable when that variable is used as a pattern (reminding me of the infamous wordexp(3))

Again, fnmatch() doing backslash handling is its own implementation of quoting as a poor man substitute to shell quoting (globs came from the shell, fnmatch() was an effort to bring it to other commands). It's only doing \, not "..." nor '...', possibly for simplicity and/or to align with regexps.

Note that I'm not asking that POSIX prohibit that double backslash evaluation. It's too late, some shells are already doing it. Since bash, one can no longer do files='\x*'; ls -d $files and expect that to list the filenames that start with \x (not that anybody is likely to want to do that).

What I'm saying is: allow it if you want, but do not mandate it.

I'm not going to argue much longer, that's enough of a corner case, that I don't really care. If POSIX decides to mandate it, that p='\**'; echo $p should match files starting with *, OK (though I'll probably keep using the more portable [*]*), but at least leave p='\x*'; echo $p (that is where \ is not followed by a glob or \) unspecified so implementations can keep matching on '\x'* files and break fewer existing scripts.

- Issue History
Date Modified Username Field Change
2019-03-08 23:58 stephane New Issue
2019-03-08 23:58 stephane Name => Stephane Chazelas
2019-03-08 23:58 stephane Section => 2.13.1
2019-03-08 23:58 stephane Page Number => 2382
2019-03-08 23:58 stephane Line Number => 76212-76215
2019-03-11 02:42 kre Note Added: 0004296
2019-03-11 02:58 kre Note Edited: 0004296
2019-03-11 03:07 kre Note Edited: 0004296
2019-03-11 03:10 kre Note Added: 0004297
2019-03-11 03:16 kre Note Edited: 0004296
2019-03-11 03:23 kre Note Added: 0004298
2019-03-11 07:24 stephane Note Added: 0004300
2019-03-11 07:31 stephane Note Added: 0004301
2019-03-11 09:41 geoffclare Note Added: 0004303
2019-03-12 03:24 kre Note Added: 0004306
2019-03-12 03:37 kre Note Added: 0004307
2019-03-12 03:38 kre Note Deleted: 0004297
2019-03-12 09:13 Konrad_Schwarz Note Added: 0004309
2019-03-12 09:58 geoffclare Note Added: 0004310
2019-03-12 12:13 Don Cragun Note Edited: 0004306
2019-03-12 12:14 Don Cragun Note Edited: 0004306
2019-03-12 12:15 kre Note Added: 0004313
2019-03-12 12:16 Don Cragun Note Edited: 0004307
2019-03-12 12:23 kre Note Added: 0004314
2019-03-12 12:24 kre Note Edited: 0004314
2019-03-12 14:29 Konrad_Schwarz Note Added: 0004317
2019-03-12 14:56 kre Note Edited: 0004313
2019-03-12 14:58 kre Note Edited: 0004314
2019-03-12 15:11 kre Note Added: 0004318
2019-03-12 22:37 stephane Note Added: 0004319
2019-03-14 15:58 Don Cragun Relationship added related to 0001190


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker