View Issue Details

IDProjectCategoryView StatusLast Update
00010841003.1(2016/18)/Issue7+TC2Shell and Utilitiespublic2018-05-10 15:52
ReporterMark_Galeck Assigned To 
PrioritynormalSeverityEditorialTypeError
Status ResolvedResolutionDuplicate 
NameMark Galeck
Organization
User Reference
Section2.3 Token Recognition
Page Number2347-2348
Line Number74761-74780
Interp Status---
Final Accepted Text
Summary0001084: rule 3, 4, 5 do not say that a token is started, if needed
DescriptionIt says in line 74749 "break its input into tokens by applying the
first applicable rule", but if rule 3, 4, or 5 is the first applicable and a token is intended to start at the current character, then that token is never started.
Desired ActionAdd to the end of rule 3: "In addition to this rule, continue to the next applicable rule for the current character, to decide if the current character starts a new token or is discarded."

Add to the ends of rule 4 and 5: "If there is no current token, then the current character starts a token".
TagsNo tags attached.

Relationships

duplicate of 0001085 Closed "token shall be from the current position in the input" is incorrect 
related to 0001100 Closed Rewrite of Section 2.10 Shell Grammar, of the Shell Standard, to fix previous reports, fix new issues, and improve presentation. 

Activities

shware_systems

2016-10-12 23:29

reporter   bugnote:0003408

This is handled by Rule 10., as rule of last resort.

Mark_Galeck

2016-10-13 01:44

reporter   bugnote:0003409

No it's not. That is my whole point. I repeat, line 74749 says "applying the _first_ applicable rule". If rule 3, 4, 5 applies, that is it, there is no "rule of last resort".

shware_systems

2016-10-14 22:16

reporter   bugnote:0003416

Rule 3, 4, 5 basically apply once a token is started, which isn't the case for the circumstance cited, so Rule 10 establishes that a word token is expected, as the first fully applicable rule, then Rule 3, 4, 5 applies with the same current char and the token being started as possibly a word, is how I read it. They may force a previous token to terminate but do not force the start of a token because a sequence like "" amounts to empty text that will get subsequently discarded anyways as being not materially part of the token during quote removal, and an implementation employing limited look-ahead can safely preevaluate and skip over sequences like that as an optimization. On some platforms such optimizations aren't practical either, for architectural reasons.

Granted, as stated this isn't immediately intuitive, but it has to cover recursive evaluation of tokens inside tokens too, and potential side effects of alias processing. A precondition, charclasses, postcondition matrix would have a lot more rows than the rules, however, especially if the additional rules incorporated by reference to 2.6 are added for the various substitution types.

It is last resort because the other rules handle whether:
  the char starts an operator token,
  is text that should be ignored without starting a token, or
  is an interruption of the recognition context of a word token that has
    non-discardable parts already,
as higher priority evaluations that may terminate preceding tokens as side effects. Rule 1 is first, I believe, because it has the additional side effect of completing the top level productions of the grammar, as highest priority. If it can't be one of those, in other words, then by default it must be the potential start of a word token.

Mark_Galeck

2016-10-15 01:31

reporter   bugnote:0003417

Before I reply, I want to clarify something. It is not my intention to argue on any of my reports - I will accept any reply and resolution , even if I disagree personally, so long as:

1. The reply answers questions that were left unanswered in the current standard, as explained in the report. This includes pointing to a place in the standard, if I missed it.

2. The reply states the facts correctly that are in the standard and does not contradict the standard.


So, you don't have to spend your valuable time arguing your points with me. I will take them as granted so long as they satisfy 1 and 2 above. All you have to do is satisfy 1 and 2 and that is it.


This reply IMHO does not satisfy condition 2. I will explain why:

>Rule 3, 4, 5 basically apply once a token is started
No that is impossible, the whole thing would not work if that were the case.

  Rule 3 must apply in a case like

>foobar

When the current character is "f", Rule 3 must apply. If not, and as you say, Rule 10 applied, then the previous token would not be delimited, because Rule 10 does not say to delimit it.

Rule 4 must apply in a case like

'foo'bar'baz'

if it did not and only Rule 10 applied, then Rule 10 would start the word, and then after that, we have a current token, so when the second ' is seen, Rule 4 would apply and according to that rule the text 'bar' would be quoted, which is false.

Rule 5 must apply in a case like

$$#

If Rule 10 applied at the first $ , then Rule 5 would apply at the second $, identifying '$#' as a parameter expansion, elsewhere in the standard it is explained that would yield the string '0', and then we would have the simple command '$0'. But that is not the intent of the standard and that is not how existing shells behave: they yield the simple command such as '7689#'.

shware_systems

2016-10-25 13:04

reporter   bugnote:0003456

I am explaining how the standard can be interpreted so what's there does match existing behaviors and not arguing, per your point 1. as something questionable just looking at the normative text. I'm more right than not here, whether that's easy to believe or not, as someone that has been involved in the more recent changes to that text. I leave it as my opinion because it's easy enough to overlook nuances intended by the original authors of those sections, some of whom (if not all) are still involved with the list, so I do not speak for them.

Note the 'same current char' and 'may terminate preceding tokens' clauses I used. The basic loop isn't:
while (*curchar++!=EOL) {apply a single rule};

it's more a:
if (not empty input) do {apply rules, maybe recursively, and at EOL applying the grammar to see if maybe io_here bodies need to be processed, and then possibly reapplying rules due to detected alias expansions, again potentially with recursion} until (*curchar==EOI);

one. It is the individual rules that say when curchar++ may be executed, possibly as part of a sub-loop specific to that rule or to do look-ahead checking, such as for on-the-fly line joining when *curchar=='\'. The standard leaves open *curchar may be referencing an input buffer where line joining has been preprocessed as much as practical as well.

Rule 10 does not say use the current char as first character of a new token and access a new char as current char, just to use it as first char.
 
Rule 3 does apply to terminate, per above, the '>' token of '>foobar', but it does not necessarily start a new token. Rule 3 applies for '>#foobar' also, as terminating '>', but Rule 9 is what determines how the rest of that line is classified, with '#' as the current char, as beginning a comment.

The delimiting newline Rule 9 says to look for and move past can also be overridden by Rule 1 if the comment is on the last line of a file that isn't terminated by an EOL. It also applies if the last character is a NUL, if the source is a C string as a sh -c argument or system() interface call, or a Ctrl-Z if that's the interactive EOF control character, as variants all symbolically EOI.

For 'foo'bar'baz' the first ' gets classified by Rule 10, and then by Rule 4 as the beginning of a single quoted string. A new current char, the 'f', is then accessed according to that rule.
 
For the last case, Rule 10 starts the token, $$ is a valid special parameter (the shells' numeric pid after evaluation) by Rule 5 and 2.5.2, and '#' by Rule 10 followed by Rule 9 again begins a comment.

No, it's not straightforward, but it is essentially correct as is in describing how various implementations process most scripts.

geoffclare

2018-05-10 15:52

manager   bugnote:0004028

This is being resolved as a duplicate of 0001085 because we believe the resolution of 1085 addresses this problem as well.

Issue History

Date Modified Username Field Change
2016-10-12 08:56 Mark_Galeck New Issue
2016-10-12 08:56 Mark_Galeck Name => Mark Galeck
2016-10-12 08:56 Mark_Galeck Section => 2.3 Token Recognition
2016-10-12 08:56 Mark_Galeck Page Number => 2347-2348
2016-10-12 08:56 Mark_Galeck Line Number => 74761-74780
2016-10-12 23:29 shware_systems Note Added: 0003408
2016-10-13 01:44 Mark_Galeck Note Added: 0003409
2016-10-14 22:16 shware_systems Note Added: 0003416
2016-10-15 01:31 Mark_Galeck Note Added: 0003417
2016-10-25 13:04 shware_systems Note Added: 0003456
2016-10-28 08:20 geoffclare Relationship added related to 0001100
2018-05-10 15:52 geoffclare Interp Status => ---
2018-05-10 15:52 geoffclare Note Added: 0004028
2018-05-10 15:52 geoffclare Status New => Resolved
2018-05-10 15:52 geoffclare Resolution Open => Duplicate
2018-05-10 15:52 geoffclare Relationship added duplicate of 0001085