Austin Group Defect Tracker

Aardvark Mark IV


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0001084 [1003.1(2016/18)/Issue7+TC2] Shell and Utilities Editorial Error 2016-10-12 08:56 2018-05-10 15:52
Reporter Mark_Galeck View Status public  
Assigned To
Priority normal Resolution Duplicate  
Status Resolved  
Name Mark Galeck
Organization
User Reference
Section 2.3 Token Recognition
Page Number 2347-2348
Line Number 74761-74780
Interp Status ---
Final Accepted Text
Summary 0001084: rule 3, 4, 5 do not say that a token is started, if needed
Description It says in line 74749 "break its input into tokens by applying the
first applicable rule", but if rule 3, 4, or 5 is the first applicable and a token is intended to start at the current character, then that token is never started.
Desired Action Add to the end of rule 3: "In addition to this rule, continue to the next applicable rule for the current character, to decide if the current character starts a new token or is discarded."

Add to the ends of rule 4 and 5: "If there is no current token, then the current character starts a token".
Tags No tags attached.
Attached Files

- Relationships
duplicate of 0001085Applied "token shall be from the current position in the input" is incorrect 
related to 0001100Closed Rewrite of Section 2.10 Shell Grammar, of the Shell Standard, to fix previous reports, fix new issues, and improve presentation. 

-  Notes
(0003408)
shware_systems (reporter)
2016-10-12 23:29

This is handled by Rule 10., as rule of last resort.
(0003409)
Mark_Galeck (reporter)
2016-10-13 01:44

No it's not. That is my whole point. I repeat, line 74749 says "applying the _first_ applicable rule". If rule 3, 4, 5 applies, that is it, there is no "rule of last resort".
(0003416)
shware_systems (reporter)
2016-10-14 22:16

Rule 3, 4, 5 basically apply once a token is started, which isn't the case for the circumstance cited, so Rule 10 establishes that a word token is expected, as the first fully applicable rule, then Rule 3, 4, 5 applies with the same current char and the token being started as possibly a word, is how I read it. They may force a previous token to terminate but do not force the start of a token because a sequence like "" amounts to empty text that will get subsequently discarded anyways as being not materially part of the token during quote removal, and an implementation employing limited look-ahead can safely preevaluate and skip over sequences like that as an optimization. On some platforms such optimizations aren't practical either, for architectural reasons.

Granted, as stated this isn't immediately intuitive, but it has to cover recursive evaluation of tokens inside tokens too, and potential side effects of alias processing. A precondition, charclasses, postcondition matrix would have a lot more rows than the rules, however, especially if the additional rules incorporated by reference to 2.6 are added for the various substitution types.

It is last resort because the other rules handle whether:
  the char starts an operator token,
  is text that should be ignored without starting a token, or
  is an interruption of the recognition context of a word token that has
    non-discardable parts already,
as higher priority evaluations that may terminate preceding tokens as side effects. Rule 1 is first, I believe, because it has the additional side effect of completing the top level productions of the grammar, as highest priority. If it can't be one of those, in other words, then by default it must be the potential start of a word token.
(0003417)
Mark_Galeck (reporter)
2016-10-15 01:31

Before I reply, I want to clarify something. It is not my intention to argue on any of my reports - I will accept any reply and resolution , even if I disagree personally, so long as:

1. The reply answers questions that were left unanswered in the current standard, as explained in the report. This includes pointing to a place in the standard, if I missed it.

2. The reply states the facts correctly that are in the standard and does not contradict the standard.


So, you don't have to spend your valuable time arguing your points with me. I will take them as granted so long as they satisfy 1 and 2 above. All you have to do is satisfy 1 and 2 and that is it.


This reply IMHO does not satisfy condition 2. I will explain why:

>Rule 3, 4, 5 basically apply once a token is started
No that is impossible, the whole thing would not work if that were the case.

  Rule 3 must apply in a case like

>foobar

When the current character is "f", Rule 3 must apply. If not, and as you say, Rule 10 applied, then the previous token would not be delimited, because Rule 10 does not say to delimit it.

Rule 4 must apply in a case like

'foo'bar'baz'

if it did not and only Rule 10 applied, then Rule 10 would start the word, and then after that, we have a current token, so when the second ' is seen, Rule 4 would apply and according to that rule the text 'bar' would be quoted, which is false.

Rule 5 must apply in a case like

$$#

If Rule 10 applied at the first $ , then Rule 5 would apply at the second $, identifying '$#' as a parameter expansion, elsewhere in the standard it is explained that would yield the string '0', and then we would have the simple command '$0'. But that is not the intent of the standard and that is not how existing shells behave: they yield the simple command such as '7689#'.
(0003456)
shware_systems (reporter)
2016-10-25 13:04

I am explaining how the standard can be interpreted so what's there does match existing behaviors and not arguing, per your point 1. as something questionable just looking at the normative text. I'm more right than not here, whether that's easy to believe or not, as someone that has been involved in the more recent changes to that text. I leave it as my opinion because it's easy enough to overlook nuances intended by the original authors of those sections, some of whom (if not all) are still involved with the list, so I do not speak for them.

Note the 'same current char' and 'may terminate preceding tokens' clauses I used. The basic loop isn't:
while (*curchar++!=EOL) {apply a single rule};

it's more a:
if (not empty input) do {apply rules, maybe recursively, and at EOL applying the grammar to see if maybe io_here bodies need to be processed, and then possibly reapplying rules due to detected alias expansions, again potentially with recursion} until (*curchar==EOI);

one. It is the individual rules that say when curchar++ may be executed, possibly as part of a sub-loop specific to that rule or to do look-ahead checking, such as for on-the-fly line joining when *curchar=='\'. The standard leaves open *curchar may be referencing an input buffer where line joining has been preprocessed as much as practical as well.

Rule 10 does not say use the current char as first character of a new token and access a new char as current char, just to use it as first char.
 
Rule 3 does apply to terminate, per above, the '>' token of '>foobar', but it does not necessarily start a new token. Rule 3 applies for '>#foobar' also, as terminating '>', but Rule 9 is what determines how the rest of that line is classified, with '#' as the current char, as beginning a comment.

The delimiting newline Rule 9 says to look for and move past can also be overridden by Rule 1 if the comment is on the last line of a file that isn't terminated by an EOL. It also applies if the last character is a NUL, if the source is a C string as a sh -c argument or system() interface call, or a Ctrl-Z if that's the interactive EOF control character, as variants all symbolically EOI.

For 'foo'bar'baz' the first ' gets classified by Rule 10, and then by Rule 4 as the beginning of a single quoted string. A new current char, the 'f', is then accessed according to that rule.
 
For the last case, Rule 10 starts the token, $$ is a valid special parameter (the shells' numeric pid after evaluation) by Rule 5 and 2.5.2, and '#' by Rule 10 followed by Rule 9 again begins a comment.

No, it's not straightforward, but it is essentially correct as is in describing how various implementations process most scripts.
(0004028)
geoffclare (manager)
2018-05-10 15:52

This is being resolved as a duplicate of 0001085 because we believe the resolution of 1085 addresses this problem as well.

- Issue History
Date Modified Username Field Change
2016-10-12 08:56 Mark_Galeck New Issue
2016-10-12 08:56 Mark_Galeck Name => Mark Galeck
2016-10-12 08:56 Mark_Galeck Section => 2.3 Token Recognition
2016-10-12 08:56 Mark_Galeck Page Number => 2347-2348
2016-10-12 08:56 Mark_Galeck Line Number => 74761-74780
2016-10-12 23:29 shware_systems Note Added: 0003408
2016-10-13 01:44 Mark_Galeck Note Added: 0003409
2016-10-14 22:16 shware_systems Note Added: 0003416
2016-10-15 01:31 Mark_Galeck Note Added: 0003417
2016-10-25 13:04 shware_systems Note Added: 0003456
2016-10-28 08:20 geoffclare Relationship added related to 0001100
2018-05-10 15:52 geoffclare Interp Status => ---
2018-05-10 15:52 geoffclare Note Added: 0004028
2018-05-10 15:52 geoffclare Status New => Resolved
2018-05-10 15:52 geoffclare Resolution Open => Duplicate
2018-05-10 15:52 geoffclare Relationship added duplicate of 0001085


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker