Anonymous | Login | 2024-11-13 15:00 UTC |
Main | My View | View Issues | Change Log | Docs |
Viewing Issue Simple Details [ Jump to Notes ] | [ Issue History ] [ Print ] | ||||||
ID | Category | Severity | Type | Date Submitted | Last Update | ||
0001105 | [1003.1(2016/18)/Issue7+TC2] Shell and Utilities | Editorial | Enhancement Request | 2016-12-05 21:52 | 2024-06-11 09:09 | ||
Reporter | stephane | View Status | public | ||||
Assigned To | |||||||
Priority | normal | Resolution | Accepted As Marked | ||||
Status | Closed | ||||||
Name | Stéphane Chazelas | ||||||
Organization | |||||||
User Reference | |||||||
Section | awk | ||||||
Page Number | |||||||
Line Number | |||||||
Interp Status | Approved | ||||||
Final Accepted Text | Note: 0004019 | ||||||
Summary | 0001105: problems with backslashes in awk strings and EREs | ||||||
Description |
In awk's string constants ("string"), \ is used to quote itself and " but also to introduce escape sequences like \n, \b, \56... That is specified in the spec. The spec does also specify the fact that that also happens for awk -v var='string' and awk '{code}' var='string', though I wish it made it clearer and for instance added a warning along the lines of "applications wishing to pass arbitrary strings by way of -v var="$value" or var="$value", should make sure backslashes are escaped or use the ENVIRON or ARGV methods instead" as that's a source of many bugs. Now, the spec gives the same escaping rules for "...", /ERE/, -v var=... and var=... In particular, it says \/ means a slash for all and not only for /ERE/ or that \" means a " for all and not only for "...". IOW, it says -v a='\@' for instance is unspecified, but that -v a='\"' is specified and means the same as -v a='"' In practice that's not the case. Solaris 10: $ /usr/xpg4/bin/awk -v 'a=\"' 'BEGIN {print a}' \" $ gawk 'BEGIN{print "\/"}' gawk: cmd. line:1: warning: escape sequence `\/' treated as plain `/' / When it comes to regexps and escape sequences, it gets worse. It looks like the spec specifies (to some extent) some accident of implementation of the historical awk implementation. It requires for instance that both: "\f" ~ "\f" and "\f" ~ "\\f" be true. The first one matching against an ERE that consists in a formfeed character. The second again an "\f" ERE that is meant to match the formfeed character. IOW, the backslash-f sequence is understood as a RE operator. And /ERE/ is not a RE quoting operator that allows for escape sequences like "...", but the /.../ are like strong quotes and it's up to the regexp engine to interpret those \f. If I understand correctly, it requires "x" ~ /\56/ (on ASCII systems where \56 is ".") to return false. While that's what I observe with bwk's awk (or FreeBSD awk based on it), that's not what I observe with many other modern (even certified) awk implementations: $ /usr/xpg4/bin/awk 'BEGIN{print "\f" ~ "\\f"}' 0 $ /usr/xpg4/bin/awk 'BEGIN{print "x" ~ /\56/}' 1 $ gawk 'BEGIN{print "\f" ~ "\\f"}' 1 $ gawk 'BEGIN{print "x" ~ /\56/}' 1 $ busybox awk 'BEGIN{print "\f" ~ "\\f"}' 0 $ busybox awk 'BEGIN{print "x" ~ /\56/}' 1 Also, with the confusion between "\" as a RE quoting operator, as the " and / escaping operator and its role in ANSI C escape sequence expansion, it is unclear how we should understand \ inside bracket expressions. It says awk EREs should be the POSIX EREs (where \ is not special within [...]), and also that the ansi escape sequences should be supported inside bracket expressions. \\ is listed as a C escape expansion. Anything other than the list given is unspecified, so it's even unclear whether /\./ is specified or if /\\./ means the \. or \\. ERE (if it were not for the example section that suggests the lattern). In ERE, [\xy] matches \, x or y, what should it be in awk? Note that I can find awk implementations (like nawk on Solaris) where "[]]" or "[^]]" can't be used to match a "]" or non-]. Most awk implementations I've tried work with [\]] or [^\]], the only exception I've found being Solaris 10 awk. Also it doesn't make it clear what "\777" should expand to. Historical implementations expand to the same as "\377" but some expand to the same as "\0777" (taken as \77 followed by 7). |
||||||
Desired Action |
It should be possible to have the spec cover most of the existing behaviour without affecting functionality too much as implementations currently are non-compliant mostly in areas that are corner cases unlikely to be used in practice. IOW, I think the spec should stop requiring behaviours that could be regarded as accidents of implementation of historic nawk implementations. When it comes to \", \/, they should only be specified to escape the corresponding delimiter for /ERE/ and "string". "\/" -> unspecified -v a='\/' -> unspecified -v a='\"' -> unspecified /\"/ -> unspecified /\// -> specified "\"" -> specified Don't make the \a, \b... part of the ERE syntax. awk EREs to be POSIX EREs except that within bracket expressions "\" has to be doubled ($0 ~ "[\\x]" (or /[\x]/) unspecified, has to be $0 ~ "[\\\\x]" (or /[\\x]/) to match \ or x) In the ERE syntax, \a, \b... to be unspecified like in normal EREs ($0 ~ "\\a" unspecified), but the /ERE/ syntax allows for them (/\a/ or /[\a]/ shall match the BEL character). In /ERE/ make it unspecified whether those expansions happen before or as part of the regex matching (unspecified whether /\56/ on ASCII-based systems matches any character or just "." or if [a\55z] is like [a-z] or [-az]). However make it clear that outside of bracket expressions /\X/ where X is an ERE operator (including \ itself), means the \X ERE. (matches the X character literally). Since POSIX only supports 8-bit bytes, and to allow implementations to treat \456 as \0456, behaviour unspecified for \ddd where ddd > 377. Please also consider adding a note about -v var=value or assignment arguments having problems with backslashes and the safer ENVIRON/ARGV alternatives. Features we'd lose with that change: PATTERN='\f' awk '$0 ~ ENVIRON["PATTERN"]' currently required to match a formfeed would cease to be specified. But in practice, that already fails with some implementations like Solaris /usr/xpg4/bin/awk or busybox awk. LC_ALL=C awk '$0 ~ /[\123-\234]/' could not be used reliably for some values in place of 123 or 234 (like the code point for "^" or "]" instead). Again, in practice that's already the case with several implementations. |
||||||
Tags | tc3-2008 | ||||||
Attached Files | |||||||
|
Relationships | ||||||
|
Notes | |
(0003999) McDutchie (reporter) 2018-04-25 22:27 |
> [...] for instance added a warning along the > lines of "applications wishing to pass arbitrary strings by way > of -v var="$value" or var="$value", should make sure backslashes > are escaped or use the ENVIRON or ARGV methods instead" as that's > a source of many bugs. I second that. > When it comes to regexps and escape sequences, it gets worse. This also affects the replacement argument in sub() and gsub(). One of my awk things outputs shell-quoted arguments for the shell to eval. To shell-quote, on Solaris awk we need gsub(/'/, "'\\\\''"); print ("'")($0)("'"); but on every other awk I've come across we need gsub(/'/, "'\\''"); print ("'")($0)("'"); It seems obvious on the surface that the Solaris behaviour is a bug, but the standard may be read to specify the unique Solaris behaviour. At the sub(ere,repl[,in]) spec, it says for the "repl" argument: "An occurrence of two consecutive <backslash> characters shall be interpreted as just a single literal <backslash> character". But that is also specified for double-quoted strings, so by the literal text of the standard, it seems like double-escaping the backslashes is needed so Solaris is in the right and everyone else in the wrong. |
(0004000) joerg (reporter) 2018-04-26 08:53 |
Did you use the POSIX /usr/xpg4/bin/awk or the non POSIX but SVID-3 awk? |
(0004014) geoffclare (manager) 2018-04-30 09:59 |
When this was discussed in the April 26th teleconference, one point that was made is that the statement in the bug description: | And /ERE/ is not a RE quoting operator that allows for escape | sequences like "...", but the /.../ are like strong quotes and | it's up to the regexp engine to interpret those \f. is incorrect. See "Lexical Conventions" item 6 (on page 2507). For the ERE token /\56/ the ERE seen by the regexp engine contains a <period> (assuming ASCII encoding). |
(0004015) stephane (reporter) 2018-04-30 10:39 edited on: 2018-04-30 12:05 |
Re: Note: 0004014 I must admit I don't remember how I reached that conclusion as it's been over a year since I looked into that. But from what I read of that "item 6", in itself, it would mean that awk '/\./' would be unspecified and awk '/\\./' would match on literal "." (it would yield the "\." ERE which matches a literal dot), and /.../ would be no different from $0 ~ "...", which is not true in any awk implementation (and one of the main points of having a /.../ so we can express EREs less awkwardly than with "..."). |
(0004016) McDutchie (reporter) 2018-04-30 11:06 edited on: 2018-04-30 12:28 |
Re: Note: 0004000 I used, and observed that behaviour described in Note: 0003999 on, the POSIX /usr/xpg4/bin/awk. |
(0004017) geoffclare (manager) 2018-04-30 11:26 |
Re: Note: 0004015 We agreed with your other points and are going to make changes to address them. This will include a fix for the /\./ issue (which consequently will mean /pat/ is different from $0 ~ "pat"). /\\./ is a new one which will need further consideration. |
(0004018) stephane (reporter) 2018-04-30 12:00 edited on: 2018-04-30 12:05 |
re: Note: 0003999 Yes, I came across that recently as well (though I didn't find the POSIX spec ambiguous then) at https://unix.stackexchange.com/questions/439752/replace-pattern-each-time-with-a-different-string-taken-from-external-file/439754#439754 [^] I had to use gsub(/[&\\]/, "\\\\&", repl) to "escape" the & and \ characters in "repl" (so it can later be used verbatim in another sub()), and for gawk, I had to enable the POSIX mode for that "\\\\&" to mean a literal backslash followed by the matched string. I noticed it didn't work in heirloom awk_su3 (presumably the same as Solaris awk), I assumed a bug. I agree that if the spec is not already clear there, it would be worth addressing as part of that same bug. |
(0004019) geoffclare (manager) 2018-05-03 15:54 edited on: 2018-05-17 15:16 |
Interpretation response ------------------------ The standard states the requirements for handling of <backslash> in awk, and conforming implementations must conform to this. However, concerns have been raised about this which are being referred to the sponsor. Rationale: ------------- The standard does not match historical and current practice. Notes to the Editor (not part of this interpretation): ------------------------------------------------------- On page 2491 line 80125 section awk change: The awk utility shall make use of the extended regular expression notation (see [xref to XBD 9.4]) except that it shall allow the use of C-language conventions for escaping special characters within the EREs, as specified in the table in XBD Chapter 5 (on page 121) (<tt>'\\', '\a', '\b', '\f', '\n', '\r', '\t', '\v'</tt>) and the following table; these escape sequences shall be recognized both inside and outside bracket expressions.to: The awk utility shall make use of the extended regular expression notation (see [xref to XBD 9.4]) except that it shall allow the use of C-language conventions for escaping special characters within the EREs, as specified in the table in XBD Chapter 5 (on page 121) for <tt>'\a', '\b', '\f', '\n', '\r', '\t', '\v'</tt> and in the following table for other sequences; these escape sequences shall be recognized both inside and outside bracket expressions. On page 2492 line 80136 section awk, change the Meaning column from: <quotation-mark> characterto: In the lexical token STRING, <quotation-mark> character. Otherwise undefined. On page 2492 line 80137 section awk, change the Meaning column from: <slash> characterto: In the lexical token ERE, <slash> character. Otherwise undefined. On page 2492 line 80143 section awk, add to the Description column: If the digits produce a value greater than octal 377, the behavior is undefined. On page 2492 line 80144 section awk, add two new rows to the table: \., \[, \(, \*, \+,| A <backslash> character followed | In the lexical token ERE when \?, \{, \|, \^, \$ | by a character that has a special | not inside a bracket expression, | meaning in EREs (see [xref to XBD | the sequence shall represent | 9.4.3]), other than <backslash>. | itself. Otherwise undefined. -------------------+-----------------------------------+--------------------------------- \\ | Two <backslash> characters. | In the lexical token ERE, | | the sequence shall represent | | itself. In the lexical token | | STRING, it shall | | represent a single <backslash>. (Note that the extra blank line is a Mantis artifact.) On page 2492 line 80161 section awk, change: Note that these same escape conventions shall also be applied ...to: Note that these escape conventions shall also be applied ... On page 2508 line 80877 section awk add a new first paragraph to APPLICATION USAGE: Since <backslash> has a special meaning both in the <assignment> option-argument to the -v option and in the assignment operand, applications that need to pass strings to awk without special interpretation of <backslash> should not use these methods but should instead make use of the ARGV or ENVIRON array. |
(0004022) geoffclare (manager) 2018-05-04 15:25 |
Reopening as there is still an issue with backslash in bracket expressions that we need to address. |
(0004036) geoffclare (manager) 2018-05-17 15:17 |
Note: 0004019 has been updated to add "when not inside a bracket expression" to the line 80144 change, and to add the line 80161 change. |
(0004140) ajosey (manager) 2018-09-30 18:39 |
Interpretation Proposed: 30 September 2018 |
(0004162) ajosey (manager) 2018-11-12 19:47 |
Interpretation approved: 12 November 2018 |
Issue History | |||
Date Modified | Username | Field | Change |
2016-12-05 21:52 | stephane | New Issue | |
2016-12-05 21:52 | stephane | Name | => Stéphane Chazelas |
2016-12-05 21:52 | stephane | Section | => awk |
2018-04-25 22:27 | McDutchie | Note Added: 0003999 | |
2018-04-26 08:53 | joerg | Note Added: 0004000 | |
2018-04-30 09:59 | geoffclare | Note Added: 0004014 | |
2018-04-30 10:39 | stephane | Note Added: 0004015 | |
2018-04-30 11:06 | McDutchie | Note Added: 0004016 | |
2018-04-30 11:26 | geoffclare | Note Added: 0004017 | |
2018-04-30 12:00 | stephane | Note Added: 0004018 | |
2018-04-30 12:05 | stephane | Note Edited: 0004018 | |
2018-04-30 12:05 | stephane | Note Edited: 0004015 | |
2018-04-30 12:28 | McDutchie | Note Edited: 0004016 | |
2018-05-03 15:54 | geoffclare | Note Added: 0004019 | |
2018-05-03 15:57 | geoffclare | Note Edited: 0004019 | |
2018-05-03 15:59 | geoffclare | Interp Status | => Pending |
2018-05-03 15:59 | geoffclare | Final Accepted Text | => Note: 0004019 |
2018-05-03 15:59 | geoffclare | Status | New => Interpretation Required |
2018-05-03 15:59 | geoffclare | Resolution | Open => Accepted As Marked |
2018-05-03 15:59 | geoffclare | Tag Attached: tc3-2008 | |
2018-05-03 16:00 | geoffclare | Note Added: 0004020 | |
2018-05-03 16:00 | geoffclare | Note Edited: 0004020 | |
2018-05-04 08:43 | geoffclare | Note Edited: 0004019 | |
2018-05-04 08:47 | geoffclare | Note Edited: 0004019 | |
2018-05-04 08:47 | geoffclare | Note Deleted: 0004020 | |
2018-05-04 15:25 | geoffclare | Note Added: 0004022 | |
2018-05-04 15:25 | geoffclare | Status | Interpretation Required => Under Review |
2018-05-04 15:25 | geoffclare | Resolution | Accepted As Marked => Reopened |
2018-05-17 15:11 | geoffclare | Note Edited: 0004019 | |
2018-05-17 15:12 | geoffclare | Note Edited: 0004019 | |
2018-05-17 15:14 | geoffclare | Note Edited: 0004019 | |
2018-05-17 15:16 | geoffclare | Note Edited: 0004019 | |
2018-05-17 15:17 | geoffclare | Note Added: 0004036 | |
2018-05-17 15:17 | geoffclare | Status | Under Review => Interpretation Required |
2018-05-17 15:17 | geoffclare | Resolution | Reopened => Accepted As Marked |
2018-09-30 18:39 | ajosey | Interp Status | Pending => Proposed |
2018-09-30 18:39 | ajosey | Note Added: 0004140 | |
2018-11-12 19:47 | ajosey | Interp Status | Proposed => Approved |
2018-11-12 19:47 | ajosey | Note Added: 0004162 | |
2019-07-31 14:34 | geoffclare | Relationship added | related to 0001277 |
2019-10-30 11:43 | geoffclare | Status | Interpretation Required => Applied |
2024-06-11 09:09 | agadmin | Status | Applied => Closed |
Mantis 1.1.6[^] Copyright © 2000 - 2008 Mantis Group |