0001556: clarify meaning of \n used in a bracket expression in a sed context address or s-command

Notes
(0005621) geoffclare (manager) 2022-01-18 09:41	I believe the text is already clear what is required here. When it says "The escape sequence '\n' shall match a <newline> embedded in the pattern space", this clearly only applies when the backslash is special (otherwise it isn't an "escape sequence"). So \n does not match newline when it appears within \\n for example. The general rule for BREs is that backslash loses its special meaning in a bracket expression, therefore [\n] matches backslash or 'n', it does not match newline. GNU sed has the required behaviour when POSIXLY_CORRECT is set in the environment: $ printf 'a\nb\n' \| (POSIXLY_CORRECT=1 sed -n 'N;/a[\n]b/p') $ $ printf 'a\nb\n' \| (unset POSIXLY_CORRECT; sed -n 'N;/a[\n]b/p') a b $

(0005622) calestyo (reporter) 2022-01-18 13:26 edited on: 2022-04-02 02:38	I personally wouldn't agree with that. As described in 1551, it's nowhere really specified how the rules from REs themselves and that what sed defines in addition work together, or whichever wins in case of conflict. E.g. sed also says that a delimiter preceded by \ would be taken literal, where it's also not clear what: snAAA[\n]nXXXn should be taken for: - \n being a literal n? - \n being a newline? (So that would actually be another point... which of those two "wins"?) "So \n does not match newline when it appears within \\n for example." => IMO that ends up with the ambiguity I tried to describe in 1551. There doesn't seem to be a general rule how the strings are parsed, and the text says just: "The escape sequence '\n' shall match a <newline> embedded in the pattern space" It doesn't say however: "… unless the preceding \ is itself quoted or the '\n' appears in a context (like bracket expression) where the \ is already taken literally" "The general rule for BREs is that backslash loses its special meaning in a bracket expression" => yes, but the BRE section also specifies: "The interpretation of an ordinary character preceded by an unescaped <backslash> ( '\\' ) is undefined, except for:" Which is already "overridden" by the rules in sed (i.e. that \n does have a maning), so I don't think that one could argue that any "general rules for BREs" would have clear precedence over that what sed says in addition.

(0005623) kre (reporter) 2022-01-18 16:30	Re Note: 0005622 I think you're misreading "The escape sequence '\n' shall ..." as: "The two character sequence backslash followed by n, which we call an escape sequence, shall ..." Whereas what it really is, is: "The escape sequence (which is an escape character, followed by some other defined sequence of characters [in sed, I think, but might be wrong, only ever one following character]) with an 'n' as its second and final character (that is '\n') shall ..." If you read it that way, you can see there is no need for anything to ever say anything like "… unless the preceding \ is itself quoted or ...' as in those situations there is no \n escape sequence. With \\n there is an escape sequence, \\ which produces a literal \ in the output, and there is a plain ordinary n. in [\n] there is no escape sequence at all, as there's no escape character there. That \ happens to look like the escape character,, in situations where it isn't, is immaterial, either we have an escape char, or a backslash, they aren't the same thing, whatever they look like.

(0005624) calestyo (reporter) 2022-01-18 17:12	Well one can read it like that, but it feels that requires quite some imagination an is not really obvious. Plus that concept doesn't seem to be used throughout the standard: In most places \ is rather regarded as "special character", e.g. \1 or \2. Or you have phrases like "The interpretation of an ordinary character preceded by an unescaped <backslash> ( '\\' ) is undefined, except for:" => If the standard would always use the semantics as you imply, there should be no need here to mention that the backslash needs to be unescaped, because it would be "clear" that it needs to be so, in order to have an escaping meaning on the following. Or "Within the BRE and the replacement, the BRE delimiter itself can be used as a literal character if it is preceded by a <backslash>." => there it should (probably?) also mean the escape sqeuence \ delimiter ... but id doesn't say that... it merely says delimiter preceded by backslash. >in [\n] there is no escape sequence at all, >as there's no escape character there. => Cannot really agree, that this is so obviously clear. If I just follow what's written in 9. Regular Expressions, then yes, it would be clear. But with what's written in "Regular Expressions in sed", I don't think it's really 100% clear what's meant. In the first bullet point here: »In a context address, the construction "\cBREc", where c is any character other than <backslash> or <newline>, shall be identical to "/BRE/". If the character designated by c appears following a <backslash>, then it shall be considered to be that literal character, which shall not terminate the BRE. For example, in the context address "\xabc\xdefx", the second x stands for itself, so that the BRE is "abcxdef".« It doesn't use any such meaning as "escape sequence" in the sense that it only applies when it really is one. In the 2nd bullet point there: »The escape sequence '\n' shall match a <newline> embedded in the pattern space. A literal <newline> shall not be used in the BRE of a context address or in the substitute function.« it does... but it's in principle still open, whether that rule is applied before or (from left-to-right) at the same time with the normal rules for REs. Similar to above, what would the follwoing be: snAAA\nnXXXn There is no [...], so it would be a perfect escape sequence in your sense. But it also fits the "If the character designated by c appears following a <backslash>, then it shall be considered to be that literal character, which shall not terminate the BRE."

(0005625) shware_systems (reporter) 2022-01-18 18:41	To me the controlling language for the 's' command is at page 3137, line 10622, Draft 2.1, to wit: Any <backslash> used to alter the default meaning of a subsequent character shall be discarded from the RE or the replacement before evaluating the RE or using the replacement. From this the conforming behavior is \n is converted to <newline>, even inside bracket expressions. For context addresses it is ambiguous, that I see, whether 2nd bullet at line 106092 supercedes 9.3.5. For consistency with the 's' command it should, in my opinion, be explicitly converted before any other evaluation of the RE commences. The letter <n> should be added to <backslash> and <newline> as disallowed alternate delimiters, because it does have a meaning other than as a literal <n>. I.E. at line 106087-8 and 106203 change: character other than <backslash> to: character other than <n>, <backslash>

(0005626) calestyo (reporter) 2022-01-18 21:17 edited on: 2022-04-02 22:00	That: »Any <backslash> used to alter the default meaning of a subsequent character shall be discarded from the RE or the replacement before evaluating the RE or using the replacement.« (page 3137, line 106222, Draft 2.1) Hmm that's quite "hidden" between paragraphs only dealing with the replacement. But IMO it merely says, that the escaping \ is "removed" (after it has done it's job)... which - for a change - I think was in fact already clear. One could, however, indeed follow from that, that [\n] is in fact a bracket expression containing a newline, cause the \ needs to be discarded BEFORE evaluating the RE/replacement. But even then... things are so scattered over many different places... and quite ambiguously written... And that would e.g. mean that GNU sed does it just right withOUT POSIXLY_CORRECT, and just wrong WITH). And it still wouldn't explain, whether \n in a sed command is newline or the delimiter n, if the delimiter was n. What I would have kinda wanted is a clear algorithm like the following (just a hypothetical one): » When REs and/or replacements are used in context addresses respectively the s-command, the following applies: The string is parsed from left to right, with the rules for REs and the specific rules for their use within delimiters being applied at the same time according to the following precedence: 1) A '\' (that is itself not escaped with '\') followed by a delimiter character, causes (the '\' to be removed and) the delimiter character not to be interpreted as delimiter but as normal part of the RE respectively the replacement. This also means, that if the delimiter is a RE/replacement special character, that it will have the special meaning with respect to the RE respectively the replacement and that it won't be possible to get its literal meaning (with that delimiter). For example 's.\..x.' is the same as 's/./x/' and 's/\./x/' cannot be obtained. It further means, that if the delimiter is a character, that would get its RE/replacement special meaning only when preceded by a `\` (that is itself not escaped with '\'), that this character always retains its literal meaning with respect to the RE/replacement (and that it's special meaning cannot be gained with that delimiter). For example 's(\((x(' is the same as 's/(/x/' and 's/\(/x/' cannot be obtained. [Depending on how it should work:] This is also the case when inside a RE bracket expression. 2) In the RE (but not the replacement), when the character 'n' is preceded by '\' (that is itself not escaped with '\') AND when rule (1) didn't apply (that is: when the delimiter is not 'n'), it shall be interpreted as a newline character. [So in this example, \n being a delimiter would win over \n being a newline] [Depending on how it should work:] This is also the case when inside a RE bracket expression. 3) If neither (1) nor (2) applied, the rules for RE (see chapter...) respectively the replacement shall apply. « (beware that I haven't spend too much time on this, so it might not actually cover all cases or not even those I've brought up in #0001551) - Placing (3) as 3rd would also already make clear (in that example), that s/[\n]/x/ would in fact be a bracket expression with a newlin in it, because the newline rule (2) comes before (3) (which "contains" the rule that everything in a BE is literal). - Whereas placing (2) after (1), would make clear that sn\nnxn i seffectively s/n/x/ and not s/\n/x/ . - And (1) would make clear, what "literal" means when a delimiter is escaped by \ ... here that it retains it's sepcial meaning, when it would have one, respectively wouldn't gain a special meaning when it wouldn't have one. E.g. s.\..x. would be s/./x/ (and not s/\./x/) ... and s($x$(x( would be s/(x\)/x/ and not s/$x$/x/ . Of course one could also define all that differently (e.g. that [\n] would NOT be a newline in a BE)... but the above is IMO how a proper definition would look like, without having multiple pieces of text that could be part of the definition scatter over n places, where one needs to guess about the meant context (like as in "escape sequence" means by it's context that it cannot be escaped itself). And even if something of it couldn't be clearly specified (because of already incompatible major implementations), it should clearly say which behaviour is undefined.

(0005762) geoffclare (manager) 2022-03-25 16:37	In my opinion the only point raised in this bug that had any need to be addressed is the part of Note: 0005622 that gave the example of snAAA[\n]nXXXn but only because of the ambiguity pointed out in bug 0001550 regarding delimiter escaping. I believe the changes proposed in Note: 0005761 for the s command resolve that ambiguity and therefore no further changes are needed for this bug.

(0005778) calestyo (reporter) 2022-04-03 01:59	Geoff, I've looked at your proposal at https://austingroupbugs.net/view.php?id=1550#c5761 [^] and with respect to this ticket I'd say the following: Then obvious consensus seems to be, that regardless of context addresses or s-command.... in any RE (BRE respectively ERE), anything within a bracket expression (except for the characters special to that itself, like a leading '^' or a '-' that is part of a range) is considered to be literal. Thus '\' never serves as an escape character, thus there are no escape sequences with a RE bracket expression. Right?! a) It was said in previous note, that this would have already been clear, because the paragraph says "The escape sequence '\n'…",... and in a RE bracket expression, this would simply not be an escape sequence as the '\' is literal. In a way that feels temptingly reasonable, especially with the upcoming changes that Geoff mentioned in https://austingroupbugs.net/view.php?id=1546#c5768 [^] (with which the RE chapter now clearly defines what an escape sequence is). However... - There are many kinds of escape sequences in POSIX. It's IMO not absolutely clear by the term itself, that the escape sequence '\n' defined on page 3134 line 106092 in the sed chapter, is necessarily of the same kind than those of the REs. After all, the escape sequence defined by the sed chapter ('\n' and '\c' with c being a delimiter) could be interpreted as a completely different layer of escape sequences, e.g. parsed in a first step, while the RE resulting from that first pass is only parsed afterwards. - It does in fact get more clear, when one realises the strong weight of the sentence "Both BREs and EREs shall also support the following additions:" which supposedly kinda means.. »the following points (including the definition of \n and \c) are an additional part of the RE language, with respect to sed«. So in a way you're right,... strictly speaking it's already there,... but IMO that really requires to think around several edges: * first realising: this (\n and \c) is not just some sed specific thing, but that in the context og sed it's considered to be part of the RE language * then realising: in REs, it's (since recently) clear that the terms "escape sequence/character" always means those, that really act as such... and don't just define the string/character. * then realising: in REs, there's also the rule that <backslash> is literal within bracket expressions * then following, that e.g. '[\n]' is the same than [n\], i.e. either 'n' or <backslash> IMO, the fact that GNU sed breaks POSIX here so drastically (whereas in most other cases it would do it much more gracefully or rather just extend POSIX behaviour) seems already like an indicator, that it's not that obviously clear, even for experts. (Unless of course, GNU deliberately chose that way.) Also, the standard isn't just read by experts who know it more or less by hard ;-) ... and immediately understand every phrasing and it's deeper meaning. => For that reason, I still think, that it wouldn't harm if the definition in line 106092 would be amended by a small hint like "(note that within the bracket expression of a RE, <backslash> is always taken literally and not as escape character and thus there cannot be any escape sequences within such bracket expression)". (For '\c', as escaped delimiter, you also mention that - in the current proposal in "Addresses in sed" and in the s-command. You even say "unescaped <backslash> (except inside a bracket expression)"... so why not the same for '\n'?) b) If it would be more obviously clear, that in the context of sed, the escape sequence '\n' is not just some sed speciality, but an addition to the RE language, things like: "So \n does not match newline when it appears within \\n for example." would also become more obvious. That is, it would become clear, that sed cannot just e.g. first look for any '\n' and replace them with <newline> and only then does the RE parsing on the result. => So perhaps, instead of "Both BREs and EREs shall also support the following additions:" something like: "In the context of sed, REs should be considered as if they had been defined with the following additions (with other rules from REs also applying to them, too):" ? c) I've written in several places, e.g. also in ticket #1551, that it's IMO not definitely clear how sed parses scripts (context addresses or s-command) that include RE: - it could e.g. first look for any sed specialities, like '\n' (for newline) or '\c' (for a escaped delimiter), process those and only then parse the RE (exactly according to the RE rules) - or it could do all in one pass from left to right, considering sed specialities ('\n' and '\c') as integral part of the RE I guess it's informally clear, that it's the latter. The proposed change in (b) would make this IMO already quite clear, at least if the part that describe these extensions all move back to "Regular Expressions in sed". I mean the parts which describe how the escape sequence '\c' (with c the delimiter) is processed, in the current proposal: - in "Addresses in sed" - and in the s-command => Perhaps one should additionally also add something like: "The delimiters, RE and any extensions to regular expressions from sed (like '\n' and \c) are processed in one pass." d) If: - the upcoming changes that Geoff mentioned in https://austingroupbugs.net/view.php?id=1546#c5768 [^] - what I proposed above in (a), (b), (c) were adopted,... ... then I'd agree with what KRE said above in note #5623 and my objections from note #5624 would become moot, except for one point perhaps: In the current proposal, there is still: 'In a context address, any character other than <backslash> or <newline> can be specified for use as the delimiter by means of the construction "\cREc", where c is the chosen delimiter character.' For that very first \c (which isn't an escaped delimiter in the sense of "the RE should see it as literal") it's never mentioned that this is an escape sequence, but it's rather called "construction". Only, if one again thinks around some edges, one realises: - '\\cREc' wouldn't be something strange like, a literal \ followed by a delimiter c of a context address... it would be the try to make '\' the delimiter and 'c' would actually be part of the RE. - However, having '\' as delimiter is forbidden. Thus there's no strict need to mention that the '\c' must be an escape sequence respectively mustn't be preceded by an unescaped <backspace>. I still wonder whether we should hint that somehow, like a »('\c' is always an escape sequence as its escape character cannot be validly escaped)«. But I wouldn't insist on that (not that my opinion would matter much anyway ;-) ). e) shware_systems mentioned in note #5625 above, the page 3137, line 10622 which says: "Any <backslash> used to alter the default meaning of a subsequent character shall be discarded from the RE or the replacement before evaluating the RE or using the replacement." => I would start a new paragraph before that. It doesn't seem to be related to the other sentence ("A substitution shall be considered to have been performed") of that paragraph. Second, what does it mean? - Any <backslash> used to alter the default meaning of the following character... That would include: - '\n' - '\c' (with c being the delimiter) - with the changes proposed in https://austingroupbugs.net/view.php?id=1546, [^] depending on the implementation '\+' - in principle even the special chars from BRE/ERE and those that become special when preceded by backslash - And it's discarded before the RE is evaluated, respectively the replacement is used. As shware_systems, says... that could e.g. be interpreted that [\n] would be a bracket expression with <newline>. => Doesn't that sentence need to be clarified? f) Page 167, line 5802 says: "When not inside a bracket expression, the interpretation of an ordinary character preceded by an unescaped <backslash> is undefined, except for..." and page 173, line 6044 says: "When not inside a bracket expression, the interpretation of an ordinary character preceded by an unescaped <backslash> is undefined, except for…" The additions by sed ('\c' and '\n') obviously touch this. Shall we do anything about that? Like e.g. adding a note to the RE chapter, that for REs in the context of sed, there are more definitions? I wouldn't strongly insist on that (not that my opinion would count much anyway ^^)... but since implementations do extend the RE language (like GNU's '\s', etc.) it might be reasonable to note, that in the context of sed, the escape secquence '\n' is already used and '\c' with c being the delimiter is also special. g) The last unclear point, with respect to this ticket, is IMO: What the following means: snAAA\nnXXXn Is '\n' a: - escaped delimiter and thus the command equivalent to: s/AAAn/XXX/ or a: - newline and thus the command equivalent to: s/AAA<newline>/XXX/ ?? The following is tricky: snAAA[\n]nXXXn It wwould IMO be already clear, if: - the upcoming changes that Geoff mentioned in https://austingroupbugs.net/view.php?id=1546#c5768 [^] - what I proposed above in (a), (b), (c) were adopted. I.e. '\n' (for newline) and '\c' (which is here also '\n') would need to be considered part of the RE language (in the context of sed). Only then, it would be fully clear, that the s-command above equivalent to: 's/AAA[n/]/XXX/' which is the same as 's%AAA[n/]%XXX%' h) Unrelated to this ticket, but since it just fits in: What in the standard makes it clear that in 's/AAA[n/]/XXX/' the 2nd '/' (the one in the bracket expression) is not a delimiter? It's the non-escaped delimiter, which is (unlike \c) is not defined to be part of the RE language... so without what I propose in (c) above (which also includes non-escaped delimiters) one could still think of a hypothetical 2-pass parsing that gives: RE: AAA[n replacement: ] flags :XXX/ The RE rules on page 168, line 5840 merely say that '\' and the BRE respectivel ERE character loose their special meanings (in bracket expressions of BREs respectively EREs). '/' however isn't special to any of them... so one could argue, that it retains the special meaning (as delimiter).

(0005826) geoffclare (manager) 2022-04-28 15:16	This is being closed as a duplicate of bug 0001550 as Note: 0005816 includes changes to address it.

Issue History
Date Modified	Username	Field	Change
2022-01-18 01:07	calestyo	New Issue
2022-01-18 01:07	calestyo	Name	=> Christoph Anton Mitterer
2022-01-18 01:07	calestyo	Section	=> Utilities, sed / 9.3.5 RE Bracket Expression
2022-01-18 01:07	calestyo	Page Number	=> -
2022-01-18 01:07	calestyo	Line Number	=> -
2022-01-18 09:41	geoffclare	Note Added: 0005621
2022-01-18 13:26	calestyo	Note Added: 0005622
2022-01-18 16:30	kre	Note Added: 0005623
2022-01-18 17:12	calestyo	Note Added: 0005624
2022-01-18 18:41	shware_systems	Note Added: 0005625
2022-01-18 21:17	calestyo	Note Added: 0005626
2022-01-18 21:19	calestyo	Note Edited: 0005626
2022-03-25 16:37	geoffclare	Note Added: 0005762
2022-03-31 16:00	nick	Relationship added	related to 0001550
2022-04-02 02:38	calestyo	Note Edited: 0005622
2022-04-02 22:00	calestyo	Note Edited: 0005626
2022-04-03 01:59	calestyo	Note Added: 0005778
2022-04-28 15:16	geoffclare	Note Added: 0005826
2022-04-28 15:16	geoffclare	Status	New => Closed
2022-04-28 15:16	geoffclare	Resolution	Open => Duplicate

Aardvark Mark IV