Austin Group Defect Tracker

Aardvark Mark III


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0001105 [1003.1(2016)/Issue7+TC2] Shell and Utilities Editorial Enhancement Request 2016-12-05 21:52 2018-11-12 19:47
Reporter stephane View Status public  
Assigned To
Priority normal Resolution Accepted As Marked  
Status Interpretation Required  
Name St├ęphane Chazelas
Organization
User Reference
Section awk
Page Number
Line Number
Interp Status Approved
Final Accepted Text Note: 0004019
Summary 0001105: problems with backslashes in awk strings and EREs
Description In awk's string constants ("string"), \ is used to quote itself and "
but also to introduce escape sequences like \n, \b, \56...

That is specified in the spec.

The spec does also specify the fact that that also happens for
awk -v var='string' and awk '{code}' var='string', though I wish
it made it clearer and for instance added a warning along the
lines of "applications wishing to pass arbitrary strings by way
of -v var="$value" or var="$value", should make sure backslashes
are escaped or use the ENVIRON or ARGV methods instead" as that's
a source of many bugs.

Now, the spec gives the same escaping rules for "...", /ERE/, -v
var=... and var=...

In particular, it says \/ means a slash for all and not only for
/ERE/ or that \" means a " for all and not only for "...".

IOW, it says -v a='\@' for instance is unspecified, but that
-v a='\"' is specified and means the same as -v a='"'

In practice that's not the case.

Solaris 10:
$ /usr/xpg4/bin/awk -v 'a=\"' 'BEGIN {print a}'
\"

$ gawk 'BEGIN{print "\/"}'
gawk: cmd. line:1: warning: escape sequence `\/' treated as plain `/'
/

When it comes to regexps and escape sequences, it gets worse.

It looks like the spec specifies (to some extent) some accident
of implementation of the historical awk implementation.

It requires for instance that both:

"\f" ~ "\f"

and

"\f" ~ "\\f"

be true. The first one matching against an ERE that consists in
a formfeed character. The second again an "\f" ERE that is meant
to match the formfeed character.

IOW, the backslash-f sequence is understood as a RE operator.

And /ERE/ is not a RE quoting operator that allows for escape
sequences like "...", but the /.../ are like strong quotes and
it's up to the regexp engine to interpret those \f.

If I understand correctly, it requires "x" ~ /\56/ (on ASCII
systems where \56 is ".") to return false.

While that's what I observe with bwk's awk (or FreeBSD awk based
on it), that's not what I observe with many other modern (even
certified) awk implementations:

$ /usr/xpg4/bin/awk 'BEGIN{print "\f" ~ "\\f"}'
0
$ /usr/xpg4/bin/awk 'BEGIN{print "x" ~ /\56/}'
1
$ gawk 'BEGIN{print "\f" ~ "\\f"}'
1
$ gawk 'BEGIN{print "x" ~ /\56/}'
1
$ busybox awk 'BEGIN{print "\f" ~ "\\f"}'
0
$ busybox awk 'BEGIN{print "x" ~ /\56/}'
1

Also, with the confusion between "\" as a RE quoting operator,
as the " and / escaping operator and its role in ANSI C escape
sequence expansion, it is unclear how we should understand \
inside bracket expressions.

It says awk EREs should be the POSIX EREs (where \ is not
special within [...]), and also that the ansi escape sequences
should be supported inside bracket expressions. \\ is listed as
a C escape expansion. Anything other than the list given is
unspecified, so it's even unclear whether /\./ is specified or
if /\\./ means the \. or \\. ERE (if it were not for the example
section that suggests the lattern). In ERE, [\xy] matches \, x
or y, what should it be in awk?

Note that I can find awk implementations (like nawk on Solaris)
where "[]]" or "[^]]" can't be used to match a "]" or non-].

Most awk implementations I've tried work with [\]] or [^\]], the
only exception I've found being Solaris 10 awk.

Also it doesn't make it clear what "\777" should expand to.
Historical implementations expand to the same as "\377" but some
expand to the same as "\0777" (taken as \77 followed by 7).
Desired Action It should be possible to have the spec cover most of the
existing behaviour without affecting functionality too much as
implementations currently are non-compliant mostly in areas that
are corner cases unlikely to be used in practice. IOW, I think
the spec should stop requiring behaviours that could be regarded
as accidents of implementation of historic nawk
implementations.

When it comes to \", \/, they should only be specified to escape
the corresponding delimiter for /ERE/ and "string".

"\/" -> unspecified
-v a='\/' -> unspecified
-v a='\"' -> unspecified
/\"/ -> unspecified
/\// -> specified
"\"" -> specified

Don't make the \a, \b... part of the ERE syntax. awk EREs to be
POSIX EREs except that within bracket expressions "\" has to be
doubled ($0 ~ "[\\x]" (or /[\x]/) unspecified, has to be
$0 ~ "[\\\\x]" (or /[\\x]/) to match \ or x)

In the ERE syntax, \a, \b... to be unspecified like in normal EREs
($0 ~ "\\a" unspecified), but the /ERE/ syntax allows for them
(/\a/ or /[\a]/ shall match the BEL character).

In /ERE/ make it unspecified whether those expansions happen
before or as part of the regex matching (unspecified whether /\56/
on ASCII-based systems matches any character or just "." or if
[a\55z] is like [a-z] or [-az]).

However make it clear that outside of bracket expressions /\X/
where X is an ERE operator (including \ itself), means the \X
ERE. (matches the X character literally).

Since POSIX only supports 8-bit bytes, and to allow
implementations to treat \456 as \0456, behaviour unspecified
for \ddd where ddd > 377.

Please also consider adding a note about -v var=value or
assignment arguments having problems with backslashes and the
safer ENVIRON/ARGV alternatives.

Features we'd lose with that change:

PATTERN='\f' awk '$0 ~ ENVIRON["PATTERN"]'
currently required to match a formfeed would cease to be
specified. But in practice, that already fails with some
implementations like Solaris /usr/xpg4/bin/awk or busybox awk.

LC_ALL=C awk '$0 ~ /[\123-\234]/'
could not be used reliably for some values in place of 123 or
234 (like the code point for "^" or "]" instead). Again, in
practice that's already the case with several implementations.
Tags tc3-2008
Attached Files

- Relationships

-  Notes
(0003999)
McDutchie (reporter)
2018-04-25 22:27

> [...] for instance added a warning along the
> lines of "applications wishing to pass arbitrary strings by way
> of -v var="$value" or var="$value", should make sure backslashes
> are escaped or use the ENVIRON or ARGV methods instead" as that's
> a source of many bugs.

I second that.

> When it comes to regexps and escape sequences, it gets worse.

This also affects the replacement argument in sub() and gsub(). One of my awk things outputs shell-quoted arguments for the shell to eval. To shell-quote, on Solaris awk we need

    gsub(/'/, "'\\\\''");
    print ("'")($0)("'");

but on every other awk I've come across we need

    gsub(/'/, "'\\''");
    print ("'")($0)("'");

It seems obvious on the surface that the Solaris behaviour is a bug, but the standard may be read to specify the unique Solaris behaviour. At the sub(ere,repl[,in]) spec, it says for the "repl" argument: "An occurrence of two consecutive <backslash> characters shall be interpreted as just a single literal <backslash> character". But that is also specified for double-quoted strings, so by the literal text of the standard, it seems like double-escaping the backslashes is needed so Solaris is in the right and everyone else in the wrong.
(0004000)
joerg (reporter)
2018-04-26 08:53

Did you use the POSIX /usr/xpg4/bin/awk or the non POSIX but SVID-3 awk?
(0004014)
geoffclare (manager)
2018-04-30 09:59

When this was discussed in the April 26th teleconference, one point that was made is that the statement in the bug description:

| And /ERE/ is not a RE quoting operator that allows for escape
| sequences like "...", but the /.../ are like strong quotes and
| it's up to the regexp engine to interpret those \f.

is incorrect. See "Lexical Conventions" item 6 (on page 2507). For the ERE token /\56/ the ERE seen by the regexp engine contains a <period> (assuming ASCII encoding).
(0004015)
stephane (reporter)
2018-04-30 10:39
edited on: 2018-04-30 12:05

Re: Note: 0004014

I must admit I don't remember how I reached that conclusion as it's been over a year since I looked into that. But from what I read of that "item 6", in itself, it would mean that

awk '/\./' would be unspecified and awk '/\\./' would match on literal "." (it would yield the "\." ERE which matches a literal dot), and /.../ would be no different from $0 ~ "...", which is not true in any awk implementation (and one of the main points of having a /.../ so we can express EREs less awkwardly than with "...").

(0004016)
McDutchie (reporter)
2018-04-30 11:06
edited on: 2018-04-30 12:28

Re: Note: 0004000

I used, and observed that behaviour described in Note: 0003999 on, the POSIX /usr/xpg4/bin/awk.

(0004017)
geoffclare (manager)
2018-04-30 11:26

Re: Note: 0004015 We agreed with your other points and are going to make changes to address them. This will include a fix for the /\./ issue (which consequently will mean /pat/ is different from $0 ~ "pat").

/\\./ is a new one which will need further consideration.
(0004018)
stephane (reporter)
2018-04-30 12:00
edited on: 2018-04-30 12:05

re: Note: 0003999

Yes, I came across that recently as well (though I didn't find the POSIX spec ambiguous then) at https://unix.stackexchange.com/questions/439752/replace-pattern-each-time-with-a-different-string-taken-from-external-file/439754#439754 [^]

I had to use

gsub(/[&\\]/, "\\\\&", repl)

to "escape" the & and \ characters in "repl" (so it can later be used verbatim in another sub()), and for gawk, I had to enable the POSIX mode for that "\\\\&" to mean a literal backslash followed by the matched string. I noticed it didn't work in heirloom awk_su3 (presumably the same as Solaris awk), I assumed a bug.

I agree that if the spec is not already clear there, it would be worth addressing as part of that same bug.

(0004019)
geoffclare (manager)
2018-05-03 15:54
edited on: 2018-05-17 15:16

Interpretation response
------------------------
The standard states the requirements for handling of <backslash> in awk, and conforming implementations must conform to this. However, concerns have been raised about this which are being referred to the sponsor.

Rationale:
-------------
The standard does not match historical and current practice.

Notes to the Editor (not part of this interpretation):
-------------------------------------------------------

On page 2491 line 80125 section awk change:
The awk utility shall make use of the extended regular expression notation (see [xref to XBD 9.4]) except that it shall allow the use of C-language conventions for escaping special characters within the EREs, as specified in the table in XBD Chapter 5 (on page 121) (<tt>'\\', '\a', '\b', '\f', '\n', '\r', '\t', '\v'</tt>) and the following table; these escape sequences shall be recognized both inside and outside bracket expressions.
to:
The awk utility shall make use of the extended regular expression notation (see [xref to XBD 9.4]) except that it shall allow the use of C-language conventions for escaping special characters within the EREs, as specified in the table in XBD Chapter 5 (on page 121) for <tt>'\a', '\b', '\f', '\n', '\r', '\t', '\v'</tt> and in the following table for other sequences; these escape sequences shall be recognized both inside and outside bracket expressions.

On page 2492 line 80136 section awk, change the Meaning column from:
<quotation-mark> character
to:
In the lexical token STRING, <quotation-mark> character. Otherwise undefined.

On page 2492 line 80137 section awk, change the Meaning column from:
<slash> character
to:
In the lexical token ERE, <slash> character. Otherwise undefined.

On page 2492 line 80143 section awk, add to the Description column:
If the digits produce a value greater than octal 377, the behavior is undefined.

On page 2492 line 80144 section awk, add two new rows to the table:
\., \[, \(, \*, \+,| A <backslash> character followed  | In the lexical token ERE when

\?, \{, \|, \^, \$ | by a character that has a special | not inside a bracket expression,
                   | meaning in EREs (see [xref to XBD | the sequence shall represent
                   | 9.4.3]), other than <backslash>.  | itself. Otherwise undefined.
-------------------+-----------------------------------+---------------------------------
\\                 | Two <backslash> characters.       | In the lexical token ERE, 
                   |                                   | the sequence shall represent 
                   |                                   | itself. In the lexical token 
                   |                                   | STRING, it shall 
                   |                                   | represent a single <backslash>.

(Note that the extra blank line is a Mantis artifact.)

On page 2492 line 80161 section awk, change:
Note that these same escape conventions shall also be applied ...
to:
Note that these escape conventions shall also be applied ...

On page 2508 line 80877 section awk add a new first paragraph to APPLICATION USAGE:
Since <backslash> has a special meaning both in the <assignment> option-argument to the -v option and in the assignment operand, applications that need to pass strings to awk without special interpretation of <backslash> should not use these methods but should instead make use of the ARGV or ENVIRON array.


(0004022)
geoffclare (manager)
2018-05-04 15:25

Reopening as there is still an issue with backslash in bracket expressions that we need to address.
(0004036)
geoffclare (manager)
2018-05-17 15:17

Note: 0004019 has been updated to add "when not inside a bracket expression" to the line 80144 change, and to add the line 80161 change.
(0004140)
ajosey (manager)
2018-09-30 18:39

Interpretation Proposed: 30 September 2018
(0004162)
ajosey (manager)
2018-11-12 19:47

Interpretation approved: 12 November 2018

- Issue History
Date Modified Username Field Change
2016-12-05 21:52 stephane New Issue
2016-12-05 21:52 stephane Name => St├ęphane Chazelas
2016-12-05 21:52 stephane Section => awk
2018-04-25 22:27 McDutchie Note Added: 0003999
2018-04-26 08:53 joerg Note Added: 0004000
2018-04-30 09:59 geoffclare Note Added: 0004014
2018-04-30 10:39 stephane Note Added: 0004015
2018-04-30 11:06 McDutchie Note Added: 0004016
2018-04-30 11:26 geoffclare Note Added: 0004017
2018-04-30 12:00 stephane Note Added: 0004018
2018-04-30 12:05 stephane Note Edited: 0004018
2018-04-30 12:05 stephane Note Edited: 0004015
2018-04-30 12:28 McDutchie Note Edited: 0004016
2018-05-03 15:54 geoffclare Note Added: 0004019
2018-05-03 15:57 geoffclare Note Edited: 0004019
2018-05-03 15:59 geoffclare Interp Status => Pending
2018-05-03 15:59 geoffclare Final Accepted Text => Note: 0004019
2018-05-03 15:59 geoffclare Status New => Interpretation Required
2018-05-03 15:59 geoffclare Resolution Open => Accepted As Marked
2018-05-03 15:59 geoffclare Tag Attached: tc3-2008
2018-05-03 16:00 geoffclare Note Added: 0004020
2018-05-03 16:00 geoffclare Note Edited: 0004020
2018-05-04 08:43 geoffclare Note Edited: 0004019
2018-05-04 08:47 geoffclare Note Edited: 0004019
2018-05-04 08:47 geoffclare Note Deleted: 0004020
2018-05-04 15:25 geoffclare Note Added: 0004022
2018-05-04 15:25 geoffclare Status Interpretation Required => Under Review
2018-05-04 15:25 geoffclare Resolution Accepted As Marked => Reopened
2018-05-17 15:11 geoffclare Note Edited: 0004019
2018-05-17 15:12 geoffclare Note Edited: 0004019
2018-05-17 15:14 geoffclare Note Edited: 0004019
2018-05-17 15:16 geoffclare Note Edited: 0004019
2018-05-17 15:17 geoffclare Note Added: 0004036
2018-05-17 15:17 geoffclare Status Under Review => Interpretation Required
2018-05-17 15:17 geoffclare Resolution Reopened => Accepted As Marked
2018-09-30 18:39 ajosey Interp Status Pending => Proposed
2018-09-30 18:39 ajosey Note Added: 0004140
2018-11-12 19:47 ajosey Interp Status Proposed => Approved
2018-11-12 19:47 ajosey Note Added: 0004162


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker