Austin Group Defect Tracker

Aardvark Mark IV


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0001778 [Issue 8 drafts] Shell and Utilities Objection Enhancement Request 2023-10-02 13:58 2024-06-11 09:12
Reporter kre View Status public  
Assigned To
Priority normal Resolution Accepted As Marked  
Status Closed   Product Version Draft 3
Name Robert Elz
Organization
User Reference
Section XCU 3/read
Page Number 3291-3294
Line Number 111869-111878, 111961-111963, 11946, 11979-11980
Final Accepted Text See Note: 0006592
Summary 0001778: The read utility needs field splitting updates/corrections )and a little more)
Description The description of how field splitting is used to process the input
and allocate it to the named variables has similar problems as those
with field splitting itself. The latter is being fixed in bug:1649.

The description of how this is done with read needs similar updates
(though less extensive) - whether or not bug:1649 is finally accepted.
 
In addition, starting on line 11961 the description of read says:
 
   (If IFS is not set to the null string this applies even when using -d "",
   because the field splitting performed by read is a character-based
   operation.)

which I am not sure remains true after the effects of bug:1560 (as
reinterpreted perhaps in bug:1649) are applied, but I will leave it
to others more knowledgble about what is or was intended here, and
whether or not it still applies, to decide if anything needs to be
done with that sentence.
 
Also the RATIONALE says (starting line 111979)

    Since read affects the current shell execution environment,
    it is generally provided as a shell regular built-in.
   
which doesn't make a lot of sense to say to me, given that line 11946 says:

    This utility is required to be intrinsic. See Section 1.7 ...

but again I will leave it for others to decide what to do about that, and
the immediately following text, as the RATIONALE isn't exactly important.
Desired Action For the parts I can suggest a solution, see a note below (I prefer to do it
that way rather than include the text here, as that way I can edit it if it
turns out to look absurd - which it very well may do).
Tags applied_after_i8d3, issue8
Attached Files

- Relationships
related to 0001649Closed Field splitting is woefully under specified, and in places, simply wrong 
parent of 0001779Closed Standard does not say how much read should do if a read-only variable is given 

-  Notes
(0006500)
kre (reporter)
2023-10-02 14:00
edited on: 2023-10-02 14:44

Replace the paragraph beginning
The terminating logical line delimiter

at line 11869, and continuing down past the three bullet points
to line 11878 (inclusive) with the following:

The terminating logical line delimiter (if any) shall be
removed from the input. Then if the shell IFS variable
[xref XCU 2.5.3] is set, and its value is an empty string, the resulting
data shall be assigned to the first variable var named upon the
command line, with any other variables named being set to the empty string.
No other processing takes place.

If IFS is unset, or is set to any non-empty value,
then a modified version of the field splitting algorithm specified in
[xref XCU 2.6.5] shall be applied, with the modifications as follows:
  1. The input to the algorithm shall be the result obtained from
    the read, and shall be considered as a single initial field,
    all of which resulted from an expansion.

  2. The loop over the contents of that field shall cease when either
    the input is empty, or when n output fields have been generated,
    where n is one less than the number of variable name arguments to
    the read command. Any remaining input in the original field
    being processed shall be returned to the read command, unmodified,
    except that any leading or trailing IFS white space, as defined in
    XCU 2.6.5, shall be removed.

The number of output fields generated by the field splitting
algorithm shall be called m. The first m (which must be
less than or equal to n) variables named upon the read command
line are assigned the values from the ordered list of output fields, in the
order they were generated, starting with the first variable named, and
continuing sequentially until m variables have had values assigned.
If m = n then the remaining unsplit input returned from the
modified field splitting algorithm shall be assigned to the n+1'st
(final) variable named upon the command line.
Any variables to which no value has yet been assigned shall be assigned
an empty string.

Note that in the case where just one variable name was given
on the read command line, the algorithm in XCU 2.6.5 ceases after producing
zero (m) output fields, and simply returns the input field with
any leading and trailing IFS white space removed. No output fields
exist to assign to variables, so the remaining modified input is assigned
to the first (and only) variable named.


In addition to this, somewhere we should probably also add either:
If any variable name is specified more than once as an
argument to read the results are unspecified.

or, as all shells seem to implement it this way, even though it doesn't
seem even vaguely useful to me:
If any variable name is specified more than once as an
argument to read then the values are assigned in order as
specified above, with any later assignment replacing the earlier,
including when an empty string is assigned because no field data was
available.


(0006502)
kre (reporter)
2023-10-02 14:20
edited on: 2023-10-02 14:34

Re Note: 0006498 (a note added to 0001649), where it says:

   Where IFS=: read -r header rest fails to read the "rest" properly if it
   contains one and only one ":" and it's at the end of the line.

I'm afraid I don't understand the perceived problem. What is meant there?

If I have the following script (which is a bit more general than it needs
to be for this, as it was a modified version of another script I used for
testing just how shells process read(1)):

I=${IFS-$' \t\n'}

case "$1" in
-i*) I=${1#-i}; shift;;
-I) I=unset; shift;;
esac

printf %s\\n "$*" |
        {
                case "$I" in
                unset) unset IFS;;
                *) IFS="$I";;
                esac

                read -r var rest &&
                    printf %s\\n var="[${var-unset}] rest=[${rest-unset}]"
        }

and I call it /tmp/r2 and run it as

        /usr/pkg/bin/bash /tmp/r2 -i: line:

then every shell I tested produces:

var=[line] rest=[]

as the result. I'm not sure what different you'd be expecting there.

Here "every shell" includes zsh, even when not in --emulate sh mode, and
NetBSD's crappy ancient pkksh implementation. That is, apart from ksh88,
which I don't have, just about every currently used Bourne shell derivative
which is still in active use.

(0006503)
kre (reporter)
2023-10-02 14:30

Oops, I submitted this as if it applied to Issue 7 TC1 (of all things), it
is meant to be (and the page and line numbers are from) Issue 8, draft 3.

Could someone who is able please move it where it belongs? Apologies, I
should have noticed the missing "which draft" field on the form, but didn't!
(0006507)
geoffclare (manager)
2023-10-02 16:20

Re Note: 0006503 I have changed this to be an Issue 8 draft 3 bug, as requested.
(0006509)
geoffclare (manager)
2023-10-03 09:23

Re the last part of Note: 0006500, when there are duplicate variable names passed to read the standard already clearly requires the "assigned in order" behaviour. I don't see the need to add anything about it, except perhaps as rationale.
(0006510)
kre (reporter)
2023-10-03 11:59

Re Note: 0006509

Could you quote the text which you believe clearly requires that, as I see
nothing at all about the order in which the assignments take place.

The current version does say:
    the first field shall be assigned to the first variable var,
    the second field to the second variable var, and so on.

but that doesn't say anything about the order in which those
assignments are to occur. Just which field goes to which variable.

Further it also says:

   If there are fewer fields than there are var operands, the
   remaining vars shall be set to empty strings.

but that says nothing about the order in which that occurs relative
to the other assignments. One reasonable implementation might be
to simply start by setting all the vars to empty strings, and then
update those for which there is data from the read result. Nothing
here says that cannot be done.

While I'm here, the current text also doesn't say what to do if there's
an error making one of those assignments - certainly read exits with a
status > 1 in that case, that's clear, but is read expected to go ahead
and assign to the other variables, or may it abort processing as soon as
the error is detected? That's unstated as well as best I can tell.
As a hint: not all shells do the same thing here, there are actually 3
different behaviours I have detected. Further not all shells do the >1
exit status either, some exit 1 in this case (for some the shell (or subshell)
exits (treating this as a variable assignment error and doing as specified
by table 2.8.1 I presume).
(0006511)
kre (reporter)
2023-10-03 12:11

Further re Note: 0006509

I assumed in Note: 0006510 that your "already clearly states" meant
what is in Draft 3 (and earlier) already, rather than what is in
my proposed (but needs wordsmithing) text in Note: 0006500

In the latter I do specify left to right (or first to last) assignment
of the variables to which data is to be applied - as that's what shells
do, so it seemed reasonable - but there is a potential problem for the
duplicate var name case, it also says:

   Any variables to which no value has yet been assigned shall be assigned
   an empty string.

which would imply that the empty string assignments happen after the
others, but (even though I wrote that) is unclear as to what "variables
to which no value has yet been assigned" means in the case we have

    read a b a <<\!
    x y
    !

clearly a=x and b=y happens, but now are we supposed to also do a=''
for the duplicate use of a (as that's what shells do) or is a in that
case a variable to which a value has already been assigned.

This kind of thing needs to be bulletproof.
(0006512)
geoffclare (manager)
2023-10-03 13:42

> Could you quote the text which you believe clearly requires that

Utility Description Defaults, para 3 under OPERANDS (in draft 3 it's P2449 L79351):
Unless otherwise stated, the standard utilities that accept operands shall process those operands in the order specified in the command line.
(0006513)
kre (reporter)
2023-10-04 14:42
edited on: 2023-10-05 15:41

Re Note: 0006512 .. OK, but that still leaves open the question of what it
means to "process those operands" in the case of read (1).

For utilities like ls, cp, mkdir, ... where there is a basic operation to
perform, and then that operation is performed upon each operand in turn,
the "process" is easy to understand, and doing everything in sequence is
natural.

Since read's operation isn't (really) to process each operand in turn (that
is, it is required to read the input up to the end delimiter before doing
anything else, not to read one field at a time for each var named) - that
simply setting up the vars named ready to be assigned to, once data is
available, is all the processing that is required. Then the actual order
in which the assignments get made would not be specified. And yes, I know
that's something of a stretch, which is why I did not write the text to allow
that, Note: 0006500 requires that the fields be assigned in order, and probably
should be updated to ensure that all variable names (args on the command line)
for which no data is available do get assigned the empty string, even if the
same var name was previously the object of an assignment.

But there's another more specific (related) requirement in Note: 0006500 than
what is in the current text - my new version requires that the field splitting
complete before any variable assignments are made, whereas while the current
text suggests that, it isn't actually required. This makes a difference if
a variable assigned to can alter the way the field splitting works, for which
the obvious example is IFS - but (some of) the locale variables can have an
effect as well, as they can affect how IFS is interpreted.

What's more, when we do this we have some wild behaviours amongst shells,
a majority (all ash derived shells, yash, ksh93) behave as described in Note: 0006500
but several (older ksh's, bosh, and zsh) assign the variables as each field is
produced, such that altering IFS alters how the next field is delimited, and
bash seems to simply disable (further) field splitting if IFS is altered, and
the remainder of the input is assigned to the next variable. Or at least
that's the general behaviour, things get more strange if IFS white space is assigned to IFS during the process.

Given:
      IFS=: ; printf %s\\n "$input" | read -r IFS var rest

then with input=',:abc:def,ghi' I see as the final values of IFS var and rest

IFS=[,] var=[abc] rest=[def,ghi] (shells where vars are assigned late)
IFS=[,] var=[abc:def,ghi] rest=[] (bash)
IFS=[,] var=[abc:def] rest=[ghi] (shells which do field by field assignments)

With input=':abc:def,ghi' only the final set changes (and in a consistent way)

IFS=[] var=[abc:def,ghi] rest=[]

With input=' : abc:def,ghi '

IFS=[ ] var=[ abc] rest=[def,ghi ] (shells which field split first)
IFS=[ ] var=[ abc:def,ghi ] rest=[] (bash)
       Both of those are consistent with the first set of results, just
       with different input data

IFS=[ ] var=[abc:def,ghi] rest=[]
IFS=[ ] var=[] rest=[abc:def,ghi]
IFS=[ ] var=[] rest=[ abc:def,ghi]
      from the shells which do immediate updates to vars when fields are split
      (but only the first of those makes any real sense to me at all).

What we ought to do about this I am unwilling to say - keep Note: 0006500
as it is, or make it unspecified whether vars are assigned immediately each
output field is produced by field splitting, either of those would be possible.
Mandating immediate assignment would be unacceptable I believe.

(0006514)
kre (reporter)
2023-10-04 17:36

There's another issue with read which I decided to do nothing about in the
Note: 0006500 text - partly to see if anyone would notice it, and partly
because I have no idea what (precisely) should be done about it.

Text in (draft 3 and earlier) of read (@line 111859... in draft 3) says:

    By default, unless the -r option is specified, <backslash> shall act
    as an escape character. An unescaped <backslash> shall preserve the
    literal value of the following character, with the exception of either
    <newline> or [...]

The "or" stuff at the end is new, but \<newline> (or its new alternative)
isn't the issue here, it is the:

    preserve the literal value of the following character,

which is a meaningless thing to say, when we're considering field splitting,
for two reasons:

First, the data being split is supposed to be (subsequent to 0001560) a
byte string, not characters -- and that's reinforced in read, at line 111905...
where it says:

    If the -d delim option is specified and delim is the null string,
    the standard input shall contain zero or more bytes (which need not
    form valid characters).

but the use of <escape> as affecting the following 'character' does not
depend upon the -d setting. (Also, behaviour is unspecified if the -d
(delim) arg is not a single byte character, or empty which also suggests
that random encoded chars in the input are not really intended to be handled,
or we'd need to allow applications to specify the delim to be any arbitrary
character as well, not make it unspecified if they do).

And while this issue is why I am unwilling to propose complete text to fix
this problem, as I have no idea what is actually supposed to happen in cases
like this, and no good way to test what shells do, this isn't the issue
which is most significant.


Second, "literal value" has no meaning with field splitting - doesn't as it is
proposed to be processed in 0001649 and didn't as field splitting was
specified before. For this, there's no question what is actually intended,
the idea is that the escaped character (or byte) can never be an IFS character,
it is always to be treated as field data, and so assigned to a variable.
But there isn't anything which actually says this - field splitting, as used
in the rest of the shell has no need for anything like this, what it deals
with is quoted, or unquoted, expansion results. Anything quoted by a \
can't be either of those, that's all shell quoting, and is handled by the
tokenisation of the input, and then quote removal.

What we have with read (and all shells implement it as intended I believe)
is something similar with the issue with \ in patterns - in patterns written
directly in the shell, there never was an issue, as the \ is either a quoting
character, or a (must have been quoted) ordinary character - and the normal
rules for interpreting quoting applied, and things just worked. But when
a \ appears in the result of an unquoted expansion then we need to handle it
specially, even though normal pattern processing wouldn't normally be
concerned with it at all. All this resulted in more text to handle \ in
patterns, when it does appear.

Since we're treating the data being obtained by read (after line joining
happens in this case, since this is only relevant when -r is not given)
as if it were the result of an unquoted expansion, we have a similar issue
with field splitting, we want the \ in there to be magic - but normal field
splitting has no need for any magic like that, so doesn't specify it.

Here, I believe, there is (aside from dealing with the first issue above,
which I will return to a little more below) a relatively simple solution
available.

Where Note: 0006500 says (bullet point 1)

    The input to the algorithm shall be the result obtained from
    the read, and shall be considered as a single initial field,
    all of which resulted from an expansion.

(which should probably have said "unquoted expansion") - we'd just
change that to be something like

    The input to the algorithm shall be the result obtained from
    the read, and shall be considered as a single initial field,
    all of which resulted from expansions, any data bytes specified
    above to be treated literally, and the preceding escape character
    shall be treated as if they were the results of a quoted expansion,
    all other data shall be treated as if it were (perhaps several)
    results from unquoted expansions.

Then field splitting will leave the \X sequences alone, and split only
when it finds IFS delimiters in the parts designated as unquoted.

But we still need to deal with the first issue, and this one is not at
all simple (but I guess lots more unspecified can be thrown around if
needed).

The problem is that the whole point of \ escaping is to hide a character
(which is, I suppose, why the current text says "character" and not byte)
which might be in IFS, but which is not intended to be a field delimiter.
It is fairly clear from the draft 3 text (and this was not altered in 0001649)
that IFS contains characters. If IFS contains a locale specific encoded
non-ascii character (let's say, a greek lower case alpha, in UTF-8 encoding)
then that's going to (usually) be 2 or more data bytes (2 in this case I
believe). If we don't want a specific greek alpha that appears in the input
data to be treated as a field delimiter, we need to escape that character
(\alpha). This means we need to be treating the data after the \ as a
character (which is what the text currently demands) but which then leaves
open the question of what happens when the data read has a \ followed by
bytes which do not properly form a character. I have no idea what shells
do with this, so I can't suggest any text to handle this issue. (Most shells
probably simply don't deal).


Note however, that if the suggested (type of) fix for this issue suggested
just above is adopted, then it becomes irrelevant that field splitting deals
with strings of arbitrary bytes, and we're trying somehow to deal with
characters in those bytes here - as by specifying the escaped text as the
result of a quoted expansion, field splitting will simply leave that section
alone, as bytes to go in a field, which is exactly what we want - the field
splitting algorithm doesn't need to know whether those bytes form chars or not.

So, what's left is to work out what the "data bytes specified above to be
treated literally" (in the proposed text in this note) actually is to be in
the various cases, without reverting to the earlier presumption that
everything obtained by read is necessarily well formed characters in all cases.
(0006515)
geoffclare (manager)
2023-10-05 08:59
edited on: 2023-10-10 11:06

Re Note: 0006514 [misinformation about backslash and field splitting removed - see Note: 0006517]

You have a point about character v. byte and we should address that.

(0006516)
kre (reporter)
2023-10-05 11:08
edited on: 2023-10-12 18:45

Re Note: 0006515

You might be right that I am misinterpreting that section, but if I am,
the current specification of how read works (including what is in Note: 0006500)
is even more broken than I thought.

You did note that I said in that note:

    all shells implement it as intended

by which I (hope obviously) meant as I considered it to be intended.

Try it yourself:

printf %s\\n ' a x:c\:c d\ d e\:\ e f ' |
    $SHELL -c 'IFS=": "; read a b c d e f g h ; printf "%s=<%s>\\n" a "$a" b "$b" c "$c" d "$d" e "$e" f "$f" g "$g" h "$h"'


The results I get from every shell I have to test are:

a=<a>
b=<x>
c=<c:c>
d=<d d>
e=<e: e>
f=<f>
g=<>
h=<>

If you're correct about how that paragraph in the text is supposed to be
interpreted, then what explains those results? Is every single shell broken?

But I don't believe your interpretation makes sense, if that was what was
intended, the obvious thing to do would have been to specify that any
escape chars be removed from the line (or whatever we call it these days)
that read fetched from stdin, after line joining, and before field splitting.

But it doesn't, it explicitly says that the escape chars are removed after
the fields have been split. The only reason for requiring that would be if
those escape characters were intended to apply (somehow) to the field splitting
operation. We just currently have the situation that whereas all shells
agree on how that should be done (at least for single byte char sets, no idea
about others) the standard doesn't say anything, and is simply assuming that
"treated literally" will have the desired effect. And there I agree with your
interpretation of that - those words don't actually say that. That was the
point of Note: 0006514

(0006517)
geoffclare (manager)
2023-10-05 14:53

Re Note: 0006516 Mea culpa. I have never seen backslash used to quote IFS characters in the input to read, so I have always assumed it doesn't work. I should have tested it before posting my note, so thank you for putting me right.
(0006519)
kre (reporter)
2023-10-05 15:38

No problem. I also finally looked here, and saw that I had forgotten
(yet again) that less-than b greater-than doesn't produce those 3 chars
in mantis, and after a few attempts I couldn't see how to fix it. So
I just changed the data (by editing the note - not rerunning) from 'b' to
'x' for what the b variable is set to.
(0006524)
chet_ramey (reporter)
2023-10-06 21:41

Re: kre's observations in 6513: that's pretty clearly a bug in bash, and the intent is to assign the variables `late'. I'll fix it.
(0006526)
geoffclare (manager)
2023-10-10 14:36

Suggested changes ...

On page 3291 line 111859 section read, change:
By default, unless the -r option is specified, <backslash> shall act as an escape character. An unescaped <backslash> shall preserve the literal value of the following character, with the exception of ...
to:
If the -r option is not specified, <backslash> shall act as an escape character. An unescaped <backslash> shall preserve the literal value of a following <backslash> and shall prevent a following byte (if any) from being used to split fields, with the exception of ...

On page 3291 line 111869 section read, replace the paragraph beginning
The terminating logical line delimiter
and continuing down past the three bullet points to line 11878 (inclusive) with the following:
The terminating logical line delimiter (if any) shall be removed from the input. Then if the shell variable IFS (see [xref XCU 2.5.3]) is set, and its value is an empty string, the resulting data shall be assigned to the variable named by the first var operand, and the variables named by other var operands (if any) shall be set to the empty string. No other processing shall be performed in this case.

If IFS is unset, or is set to any non-empty value, then a modified version of the field splitting algorithm specified in [xref XCU 2.6.5] shall be applied, with the modifications as follows:
  1. The input to the algorithm shall be the logical line (minus terminating delimiter) that was read from standard input, and shall be considered as a single initial field, all of which resulted from expansions, with any escaped byte and the preceding <backslash> escape character treated as if they were the result of a quoted expansion, and all other bytes treated as if they were the results of unquoted expansions.

  2. The loop over the contents of that initial field shall cease when either the input is empty or n output fields have been generated, where n is one less than the number of var operands passed to the read utility. Any remaining input in the original field being processed shall be returned to the read utility ``unsplit''; that is, unmodified except that any leading or trailing IFS white space, as defined in [xref to XCU 2.6.5], shall be removed.

The specified var operands shall be processed in the order they appear on the command line, and the output fields generated by the field splitting algorithm shall be used in the order they were generated, by repeating the following checks until neither is true:
  • If more than one var operand is yet to be processed and one or more output fields are yet to be used, the variable named by the first unprocessed var operand shall be assigned the value of the first unused output field.

  • If exactly one var operand is yet to be processed and there was some remaining unsplit input returned from the modified field splitting algorithm, the variable named by the unprocessed var operand shall be assigned the unsplit input.
If there are still one or more unprocessed var operands, each of the variables names by those operands shall be assigned an empty string.

Note that in the case where just one var operand is given on the read command line, the modified field splitting algorithm ceases after producing zero output fields and simply returns the original input field, with any leading and trailing IFS white space removed, as unsplit input. This unsplit input is assigned to the variable named by the var operand.

On page 3292 line 111900 section read (OPERANDS), after:
The name of an existing or nonexisting shell variable.
append:
If a var operand names the variable IFS, the behavior is unspecified.

On page 3292 line 111902 section read (STDIN), change:
If the -d delim option is not specified, or if it is specified and delim consists of one single-byte character, the standard input shall contain zero or more characters and shall not contain any null bytes.
to:
If the -d delim option is not specified, or if it is specified and delim is not the null string, the standard input shall contain zero or more bytes (which need not form valid characters) and shall not contain any null bytes.

On page 3293 line 111959 section read (APPLICATION USAGE), change:
When the current locale is not the C or POSIX locale, pathnames can contain bytes that do not form part of a valid character, and therefore portable applications need to ensure that the current locale is the C or POSIX locale when using read with arbitrary pathnames as input. (If IFS is not set to the null string this applies even when using -d "", because the field splitting performed by read is a character-based operation.) When reading a pathname it is also inadvisable ...
to:
When reading a pathname it is inadvisable ...

After page 3293 line 111967 section read (APPLICATION USAGE), add:
Since the var operands are processed in the order specified on the command line, if any variable name is specified more than once as a var operand, the last assignment made is the one that is in effect when read returns, including when an empty string is assigned because no field data was available.
(0006527)
kre (reporter)
2023-10-10 17:19

That (Note: 0006526) mostly looks OK, but there is one of minor nit...

The change at line 111900 (unspecified what happens if IFS is one of the
vars) probably should also mention the locale vars (or at least LANG,
LC_CTYPE and LC_ALL) as changes to those can affect how field splitting
interprets what is in IFS.
(0006528)
geoffclare (manager)
2023-10-11 08:04

> changes to those can affect how field splitting interprets what is in IFS

The word "can" there is doing a lot of work. Given this text from the resolution of bug 0001649 ...
It is implementation defined whether other white-space characters which appear in the value of IFS are also considered as "IFS white space". The three characters above specified as IFS white-space bytes are always IFS white space, when they occur in the value of IFS, regardless of whether they are white-space characters in any relevant locale. For other locale specific white-space characters allowed by the implementation it is unspecified whether the character is considered as IFS white space if it is white space at the time it is assigned to the IFS variable, or if it is white space at the time field splitting occurs (the locale may have changed between those events).
... perhaps something like the following would work:
If a var operand names one of the variables LANG, LC_CTYPE, or LC_ALL and the new value assigned to the variable would change which characters in IFS are considered to be IFS white space (see [xref XCU 2.6.5]), it is unspecified what, if any, effect the change has on how read performs field splitting.
(0006529)
kre (reporter)
2023-10-11 17:13

Re: Note: 0006528

It is more than working out what is IFS white space - since IFS is defined
to contain characters, a change to the encoding (switching from UTF-8 to
8859-1 for example) can alter what characters appear there, a 2 byte encoded
UTF-8 character might now be treated as 2 characters instead.

Any change to whichever of those 3 locale vars actually applies (changes to
ones that are overridden by another of them are obviously harmless) has the
potential to affect the results that field splitting produces - if the
vars actually change while read is still splitting the fields.

Your text doesn't forbid that (whereas mine did) - and I understand that,
as several shells work that way, but the effect is that changes to anything
which can affect the results of field splitting needs to result in
unspecified behaviour (as implementations aren't required to have the vars
updates immediately a field is delimited either).
(0006530)
geoffclare (manager)
2023-10-12 08:30

> changes to anything which can affect the results of field splitting
> needs to result in unspecified behaviour

I agree, but the difficulty is how to describe, precisely, which changes qualify for this treatment. We should not make the behaviour unspecified in cases where field splitting would definitely not be affected, for example:

* Any locale change if IFS is empty, unset, or has its default value of <space><tab><newline>.

* A locale change that does not change the encoding and would not change which characters in IFS are considered to be IFS white space.

* Changes to LANG or LC_CTYPE that do not change the locale (e.g. if LC_ALL is set)

Here's my previous attempt with an added condition related to encoding:
If a var operand names one of the variables LANG, LC_CTYPE, or LC_ALL and the new value assigned to the variable would change how the bytes in IFS form characters or which characters in IFS are considered to be IFS white space (see [xref XCU 2.6.5]), it is unspecified what, if any, effect the change has on how read performs field splitting.
(0006531)
kre (reporter)
2023-10-12 18:33

Re Note: 0006530

For the 3 bullet points, OK for the first, the second would be impossible
to explain accurately in any way almost anyone could comprehend, and for
the third, yes, of course.

But:
    We should not make the behaviour unspecified in cases where field splitting
    would definitely not be affected

Normally I'd agree with the sentiment behind that, less unspecified stuff
is better, when there isn't a very good reason for it.

But here?

Only a demented lunatic would use IFS or one of the locale vars, any of them,
including the ones (like LC_TIME) that can't possibly affect field splitting,
as the variables used in a read command.

I doubt anyone has ever even considered doing that in the 44 years (or whatever
it is) that Bourne type shells have existed, until I did (and you can draw
whatever conclusions you like about my mental state for attempting it.)

I had to fix a bug in the NetBSD shell (it still exists in FreeBSD's) to get
sane results from more complex tests using IFS than I showed here. Chet
has acknowledged a bug in bash in this area. All of these shells are very
old, and as best I can tell, no-one has ever reported any bug in this area
about any of them. The NetBSD one would have been detected by a use after
free type sanitizer, if anyone had used one and thought of trying with it
"read IFS xxx" (needed at least one more var after IFS, and data read to
assign to it, to trigger, was only externally visible with at least 2
more, and did not occur at all if IFS was initially unset, or of course if
it was initially set to "", as nothing is split in that case).

So, in this case I don't think it would do any harm to simply make doing
a read into any of these vars be unspecified (even add in 100% irrelevant
ones like OPTIND etc if desired). After all if someone really wanted to do:
       read LC_CTYPE
all they need do instead is
       read var && LC_CTYPE=$var

(or use ; instead of && if it is OK to set LC_CTYPE to "" in the
case that nothing is read, which the simple "read LC_CTYPE" would do.

It wouldn't exactly be a huge imposition.

Better application code would be to validate the value read before using
it for something that has consequences of course, doing that demands the
use of an intermediate variable.

So I'd just make the line 111900 change say

        If a var operand names any of the the variables IFS, LC_ALL,
        LC_CTYPE, or LANG, the behavior is unspecified.

That's clear, can be understood by anyone, and in practice, affects no-one,
and doesn't even require shells with bugs in this area to fix them, unless
perhaps they cause crashes, and none I have encountered do that - just strange
results from read.

(Personally I'd spell it behaviour, but never mind! No expl'n req'd.)
(0006532)
kre (reporter)
2023-10-12 18:54

One more thing, in Note: 0006510, I ended with the following paragraph:

  While I'm here, the current text also doesn't say what to do if there's
  an error making one of those assignments - certainly read exits with a
  status > 1 in that case, that's clear, but is read expected to go ahead
  and assign to the other variables, or may it abort processing as soon as
  the error is detected? That's unstated as well as best I can tell.
  As a hint: not all shells do the same thing here, there are actually 3
  different behaviours I have detected. Further not all shells do the >1
  exit status either, some exit 1 in this case (for some the shell (or subshell)
  exits (treating this as a variable assignment error and doing as specified
  by table 2.8.1 I presume).

As best I can see, the proposed resolution in Note: 0006526 does not address
this issue. We should be fixing this one as well.
(0006533)
geoffclare (manager)
2023-10-16 09:02
edited on: 2023-10-16 09:03

> Only a demented lunatic would use IFS or one of the locale vars, any of them,
> including the ones (like LC_TIME) that can't possibly affect field splitting,
> as the variables used in a read command.

I agree for IFS but not for locale vars. I can imagine situations in which an application writer might choose to do that, and the choice would be reasonable. For example, suppose an application has an installation script that needs to validate format strings in a file structured as pairs of lines with the first of each pair containing a locale name and the second containing the format string for that locale. This could be done using code like the following:
unset LANG LC_ALL LC_CTYPE ...
export LANG=C
while IFS="" read -r LANG
do
    ... validate $LANG ...
    IFS="" read -r format || ...
    ... validate $format ...
done < file

> if someone really wanted to do:
> read LC_CTYPE
> all they need do instead is
> read var && LC_CTYPE=$var

True, but why force them to do that if there's no need?

Doesn't my suggestion at the end of Note: 0006530 adequately distinguish cases that need to be unspecified from those that don't?

(0006534)
geoffclare (manager)
2023-10-16 09:19

> the current text also doesn't say what to do if there's an error making one of those assignments

If there's a problem here, it should be the subject of a separate bug as it isn't related to field splitting, which is what this bug is about.

Currently this is what draft 3 has to say:
An error in setting any variable (such as if a var has previously been marked readonly) shall be considered an error of read processing, and shall result in a return value greater than one.
It is silent about when read exits (immediately or after processing later operands) so both behaviours are allowed.
(0006537)
kre (reporter)
2023-10-16 21:43

Re Note: 0006533

    I can imagine situations in which an application writer might choose
    to do that, and the choice would be reasonable.

I cannot, or not in any case where using IFS would not also be acceptable
(perhaps even more acceptable than altering the locale vars).

In the case in the example you gave

        IFS="" read -r LANG

there is no problem, as field splitting never happens in this case, and
so changes of locale cannot generate unspecified results, one could also do

       IFS= ; read -r IFS

where I made a separate assignment to IFS, rather than using a var-assign,
purely because the effects of a var-assign, and an assignment to the same
variable in a non-special builtin are not very well defined (many shells
will simply reset the var-assigned variable to its value in the shell env
from before the command was executed, nullifying any change the builtin
might have made to it) - that issue however has nothing to do with the
current one (in this example returning IFS to some other more appropriate
value for the commands that follow is assumed to be done elsewhere if
required).

Similarly the -r is irrelevant to this, whether \ escaping (and line joining)
happens is also immaterial.

Where things get weird is in a case like:

     unset IFS
     read [-r] LANG var1 var2 LANG var3 var4

   Doesn't my suggestion at the end of Note: 0006530 adequately
   distinguish cases

I have two issues with it. First it is too complex, and very few people
are going to be able to work out just when it applies, and worse, as a script
writer, how am I supposed to know whether the user supplied data being read
is going to:
   change how the bytes in IFS form characters or which characters in IFS are
   considered to be IFS white space
??

If the script writer knew what value was in the data, they wouldn't use
read to set it, they'd just do LANG=xxx and be done.

Beyond that, the Note: 0006526 text doesn't specify whether vars are
assigned as the fields are split, or after all the field splitting has
completed - which I understand, as shells exist which do it both ways,
so users cannot rely upon it doing anything useful at all for field
splitting during the read command, nor can they assume it won't (even
if the script writer somehow believes that the conditions in the Note: 0006530
text are met, and so no effect upon the field splitting results was
planned).

Last, for this, this technique

    while IFS="" read -r LANG
    do
        .. validate $LANG ...

is OK for most variables, where in the time between when the var
is set by read, and the time it is validated, we know the var is
not going to be used (as it is only used when expanded in the script)
for the variables the shell uses itself, this is a horrible, and
very unreliable technique, and should never even be hinted at as
unacceptable practice. The "validate $LANG" (unspecified there
how it is done) is likely to use come comparisons using test, or
pattern matching in a case statememt - for which the value of LANG
can affect the results (different collating sequences, different
char orders for [a-z] type expressions - lots of possible variations).

The only safe way to update any of the variables the shell actually
uses as part of its processing (PATH, IFS, all the locale vars, OPTIND,
etc), based upon externally supplied input (however it is supplied,
as a command like arg, via read, from the results of a command substitution
which is just returning user data (like $(sed 1q < file)) or anything else)
is to validate the data first, and if it is OK, assign it to the variable
afterwards.

So:
    True, but why force them to do that if there's no need?

There is a need, In your example, as you wrote it, what LC_MESSAGES file
is going to be used for any message from the "validate $LANG" code, if
that discovers there is a problem? I suppose the script is supposed to
reset things to LANG=C in that case. But then we'd have lost the value
that we consider faulty, which we'd want to include in the error message,
so we'd need to have another variable and do
     BADLANG=$LANG
     LANG=C
     printf whatever ... "${BADLANG}" ...

so in practice, we're saving nothing, the extra var is still needed, we're
making the validation process almost impossible to code, and making the
language in the standard almost incomprehensible, and all for what purpose?
(0006538)
kre (reporter)
2023-10-16 22:10

Re Note: 0006534

OK (though the practice of rolling in other changes that are noticed
when a bug is being fixed is not all that uncommon here) - 0001779
has now been filed for that issue.
(0006543)
geoffclare (manager)
2023-10-17 13:34

> as a script writer, how am I supposed to know whether the user supplied data being read is going to:
> change how the bytes in IFS form characters or which characters in IFS are
> considered to be IFS white space
> ??

Fair point. Here's a new attempt that doesn't try to cover the second bullet point from Note: 0006530 (you "okayed" the other bullet points):
If a var operand names one of the variables LANG, LC_CTYPE, or LC_ALL, the new value assigned to the variable would change the current locale, and IFS contains any characters other than <space>, <tab>, and <newline>, it is unspecified what, if any, effect the change has on how read performs field splitting.
(0006547)
kre (reporter)
2023-10-18 00:08

Re Note: 0006543

This proposed new text has the same problem as (one of the ones) in the
previous version:

    the new value assigned to the variable would change the current locale

again, that's unknowable to the script writer, so their only possible
response is to not read directly into whichever of those vars you
imagine that they're using (LANG in the example from your previous note).

It truly is not worth attempting to make this as specified as possible,
there is no rational safe way to do a read into IFS, PATH, any of the locale
vars (and more) directly - ones like PATH don't need mention here, as
using that (or anything similar) would be perfectly defined for how read
processes it, but that doesn't make it sane for an application to do (and
perhaps the APPLICATION USAGE section should make that point) - vars which
immediately affect how the shell works need to be validated before being
altered if the input comes from outside the script itself. That's just
sensible programming.

So, I still claim there is no possible point attempting to find language
to do other than simply specify that any attempt to read into any var
whose value is (or ever could be) used as part of the read processing
produces unspecified results (at least in cases when read processes the
input, if you want to exempt the case when IFS="" then that's fine - that's
simple enough - I'd also need to think about \ processing when -r isn't used,
not for line joining, that's simple enough, but for removing it afterwards
if the
    IFS='' read LANG
has managed to make a change such that the code point that represents \
is no longer the same after the read as it was before, since even though
no field splitting happens, the escape chars still need to be removed, and
it would be truly weird, if the char that enabled lines to be joined, and
the one later removed, were not the same one).
(0006548)
geoffclare (manager)
2023-10-19 14:05

> the new value assigned to the variable would change the current locale
>
> again, that's unknowable to the script writer

Under certain circumstances that might be true, e.g. if the script just runs with inherited environment variables, but the script writer can certainly control whether the new value assigned to the variable would change the current locale. For example:
export LC_ALL=C
read LANG < file
is guaranteed not to change it.

My concern is that there might be scripts in existence which, for some strange reason, pass LANG or LC_CTYPE (or even LC_ALL) as an operand to read in circumstances where there is no reason to make the behaviour unspecified. In which case, doing so would unnecessarily affect the conformance of those scripts.
(0006550)
kre (reporter)
2023-10-19 16:33
edited on: 2023-10-19 16:37

Re Note: 0006548

I mentioned earlier (not sure which note it was) that a change to an
env var which is being overridden by a higher priority one would be OK,
but to specify just that the language would have to be much more specific
than your last attempt, rather than "would change the current locale"
it would need to say something more like "is guaranteed to be unable to
alter the current locale" (and you could add that example somewhere in the
application usage showing one example of how to make that guarantee if that
would be useful).


WRT:
    My concern is that there might be scripts in existence [...]

When making something unspecified here (as distinct from when deciding to
specify some particular behaviour) this isn't an issue we need to worry about
at all. Explicitly making something unspecified is our way of telling
implementers that they do not need to change whatever they currently do.

If a script writer wants a script that will work on any POSIX compat shell,
they need to worry about this, but if someone has a script that is currently
working in the environment in which they use it, that they are using unspecified
behaviour is irrelevant - it works, there's no reason (from here) for the
implementation they're using to alter, so it will just keep on working.

But now, if they were to try to move the script to a different environment,
where things currently (and will continue to) work differently, we would be
avoiding arguments between users and implementations about "your shell is
broken because my script works fine with this other shell" since the "your
shell" maintainers can simply point to the standard to show that the script
is using something not guaranteed to work.

Note that the alternative to all of this concern about particular vars being
unspecified to use with read would have been to simply specify late binding
of the vars. All field splitting first, then all assignments after that, as
I did in my original suggestion.

But that one would have required some implementations to alter to remain
conformant, and that might have broken some scripts (as unlikely as it seems,
as there are really just these few variables where it makes any difference)
in which case your concern would have been valid.

So I'd still suggest opting for maximum simplicity here, along with
maximum clarity, and minimum conditional "unspecified unless" type
complexity. There really is no harm to that.

(0006592)
nick (manager)
2023-11-27 17:29

On page 3291 line 111859 section read, change:
    
By default, unless the -r option is specified, <backslash> shall act as an escape character. An unescaped <backslash> shall preserve the literal value of the following character, with the exception of ...

to:
    
If the -r option is not specified, <backslash> shall act as an escape character. An unescaped <backslash> shall preserve the literal value of a following <backslash> and shall prevent a following byte (if any) from being used to split fields, with the exception of ...

On page 3291 line 111869 section read, replace the paragraph beginning
  
The terminating logical line delimiter

and continuing down past the three bullet points to line 11878 (inclusive) with the following:
    
The terminating logical line delimiter (if any) shall be removed from the input. Then if the shell variable IFS (see [xref XCU 2.5.3]) is set, and its value is an empty string, the resulting data shall be assigned to the variable named by the first var operand, and the variables named by other var operands (if any) shall be set to the empty string. No other processing shall be performed in this case.

If IFS is unset, or is set to any non-empty value, then a modified version of the field splitting algorithm specified in [xref XCU 2.6.5] shall be applied, with the modifications as follows:
    

          
  1. The input to the algorithm shall be the logical line (minus terminating delimiter) that was read from standard input, and shall be considered as a single initial field, all of which resulted from expansions, with any escaped byte and the preceding <backslash> escape character treated as if they were the result of a quoted expansion, and all other bytes treated as if they were the results of unquoted expansions.

  2.     
  3. The loop over the contents of that initial field shall cease when either the input is empty or n output fields have been generated, where n is one less than the number of var operands passed to the read utility. Any remaining input in the original field being processed shall be returned to the read utility ``unsplit''; that is, unmodified except that any leading or trailing IFS white space, as defined in [xref to XCU 2.6.5], shall be removed.

  4.   

The specified var operands shall be processed in the order they appear on the command line, and the output fields generated by the field splitting algorithm shall be used in the order they were generated, by repeating the following checks until neither is true:

      
  • If more than one var operand is yet to be processed and one or more output fields are yet to be used, the variable named by the first unprocessed var operand shall be assigned the value of the first unused output field.

  •   
  • If exactly one var operand is yet to be processed and there was some remaining unsplit input returned from the modified field splitting algorithm, the variable named by the unprocessed var operand shall be assigned the unsplit input.
If there are still one or more unprocessed var operands, each of the variables names by those operands shall be assigned an empty string.

Note that in the case where just one var operand is given on the read command line, the modified field splitting algorithm ceases after producing zero output fields and simply returns the original input field, with any leading and trailing IFS white space removed, as unsplit input. This unsplit input is assigned to the variable named by the var operand.


On page 3292 line 111900 section read (OPERANDS), after:
    
The name of an existing or nonexisting shell variable.

append:
    
If a var operand names the variable IFS, the behavior is unspecified.
    
    If a var operand names one of the variables LANG, LC_CTYPE, or LC_ALL and the new value assigned to the variable would change how the bytes in IFS form characters, or which characters in IFS are considered to be IFS white space (see [xref XCU 2.6.5]), it is unspecified what effects, if any, the change has on how read performs field splitting.

    
On page 3292 line 111902 section read (STDIN), change:
    
If the -d delim option is not specified, or if it is specified and delim consists of one single-byte character, the standard input shall contain zero or more characters and shall not contain any null bytes.

to:
    
If the -d delim option is not specified, or if it is specified and delim is not the null string, the standard input shall contain zero or more bytes (which need not form valid characters) and shall not contain any null bytes.

On page 3293 line 111959 section read (APPLICATION USAGE), change:
    
When the current locale is not the C or POSIX locale, pathnames can contain bytes that do not form part of a valid character, and therefore portable applications need to ensure that the current locale is the C or POSIX locale when using read with arbitrary pathnames as input. (If IFS is not set to the null string this applies even when using -d "", because the field splitting performed by read is a character-based operation.) When reading a pathname it is also inadvisable ...

to:
    
When reading a pathname it is inadvisable ...

After page 3293 line 111967 section read (APPLICATION USAGE), add:
    
Since the var operands are processed in the order specified on the command line, if any variable name is specified more than once as a var operand, the last assignment made is the one that is in effect when read returns, including when an empty string is assigned because no field data was available.


On page 3293, lines 111979-111980, change:
    
Since read affects the current shell execution environment, it is generally provided as a shell |regular built-in.

to:
    
Since read affects the current shell execution environment, it is required to be intrinsic.

- Issue History
Date Modified Username Field Change
2023-10-02 13:58 kre New Issue
2023-10-02 13:58 kre Name => Robert Elz
2023-10-02 13:58 kre Section => XCU 3/read
2023-10-02 13:58 kre Page Number => 3291-3294
2023-10-02 13:58 kre Line Number => 111869-111878, 111961-111963, 11946, 11979-11980
2023-10-02 14:00 kre Note Added: 0006500
2023-10-02 14:02 kre Note Edited: 0006500
2023-10-02 14:20 kre Note Added: 0006502
2023-10-02 14:22 kre Note Edited: 0006502
2023-10-02 14:24 kre Note Edited: 0006502
2023-10-02 14:30 kre Note Added: 0006503
2023-10-02 14:33 kre Note Edited: 0006502
2023-10-02 14:34 kre Note Edited: 0006502
2023-10-02 14:44 kre Note Edited: 0006500
2023-10-02 16:18 geoffclare Project 1003.1(2013)/Issue7+TC1 => Issue 8 drafts
2023-10-02 16:19 geoffclare version => Draft 3
2023-10-02 16:20 geoffclare Note Added: 0006507
2023-10-02 16:21 geoffclare Relationship added related to 0001649
2023-10-03 09:23 geoffclare Note Added: 0006509
2023-10-03 11:59 kre Note Added: 0006510
2023-10-03 12:11 kre Note Added: 0006511
2023-10-03 13:42 geoffclare Note Added: 0006512
2023-10-04 14:42 kre Note Added: 0006513
2023-10-04 17:36 kre Note Added: 0006514
2023-10-05 08:59 geoffclare Note Added: 0006515
2023-10-05 11:08 kre Note Added: 0006516
2023-10-05 14:53 geoffclare Note Added: 0006517
2023-10-05 15:34 kre Note Edited: 0006516
2023-10-05 15:35 kre Note Edited: 0006516
2023-10-05 15:36 kre Note Edited: 0006516
2023-10-05 15:38 kre Note Added: 0006519
2023-10-05 15:41 kre Note Edited: 0006513
2023-10-06 07:58 geoffclare Note Edited: 0006515
2023-10-06 21:41 chet_ramey Note Added: 0006524
2023-10-10 11:06 geoffclare Note Edited: 0006515
2023-10-10 14:36 geoffclare Note Added: 0006526
2023-10-10 17:19 kre Note Added: 0006527
2023-10-11 08:04 geoffclare Note Added: 0006528
2023-10-11 17:13 kre Note Added: 0006529
2023-10-12 08:30 geoffclare Note Added: 0006530
2023-10-12 18:33 kre Note Added: 0006531
2023-10-12 18:45 kre Note Edited: 0006516
2023-10-12 18:54 kre Note Added: 0006532
2023-10-16 09:02 geoffclare Note Added: 0006533
2023-10-16 09:03 geoffclare Note Edited: 0006533
2023-10-16 09:19 geoffclare Note Added: 0006534
2023-10-16 21:43 kre Note Added: 0006537
2023-10-16 22:10 kre Note Added: 0006538
2023-10-17 13:34 geoffclare Note Added: 0006543
2023-10-18 00:08 kre Note Added: 0006547
2023-10-19 14:05 geoffclare Note Added: 0006548
2023-10-19 16:33 kre Note Added: 0006550
2023-10-19 16:37 kre Note Edited: 0006550
2023-11-27 17:29 nick Note Added: 0006592
2023-11-27 17:30 nick Final Accepted Text => See Note: 0006592
2023-11-27 17:30 nick Status New => Resolved
2023-11-27 17:30 nick Resolution Open => Accepted As Marked
2023-11-27 17:31 nick Tag Attached: issue8
2023-11-30 16:18 nick Relationship added child of 0001779
2023-11-30 16:20 nick Relationship replaced parent of 0001779
2023-12-07 14:20 geoffclare Status Resolved => Applied
2023-12-07 14:20 geoffclare Tag Attached: applied_after_i8d3
2024-06-11 09:12 agadmin Status Applied => Closed


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker