Austin Group Defect Tracker

Aardvark Mark III


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0000249 [1003.1(2008)/Issue 7] Shell and Utilities Objection Enhancement Request 2010-04-30 21:42 2010-11-11 16:44
Reporter dwheeler View Status public  
Assigned To ajosey
Priority normal Resolution Accepted As Marked  
Status Resolved  
Name David A. Wheeler
Organization
User Reference
Section 2.2 Quoting
Page Number 2298-2299
Line Number 72348-72401
Interp Status ---
Final Accepted Text See Note: 0000590
Summary 0000249: Add standard support for $'...' in shell
Description It is annoyingly difficult to set shell variables with certain characters, for example, to set a shell variable to a value that ends with "\n". This is particularly a problem when manipulating IFS, but the problem occurs in many other situations too. One reason is because command substitution consumes all trailing newlines, so constructs like this don't work:
 IFS=$(printf "\n")

It's possible to work around this (e.g., with clever parameter expansions), but the work-arounds are ugly and error-prone. It would be better to support simple, clear scripts when you want to insert characters like newline.

Thus, many shells add the $'...' construct, which is *widely* supported. It is supported by ksh, zsh, bash, and the busybox "sh" command at *least*. This widespread support suggests that this is useful and easy to implement. The fact that even *busybox* supports it, even though they focus on being very small, suggests that this is *important* functionality.

As the korn shell documentation says, "the purpose of $'...' is to solve the problem of entering special characters in scripts [using] ANSI-C rules to translate the string... It would have been cleaner to have all "..." strings handle ANSI-C escapes, but that would not be backwards compatible."
http://kornshell.com/doc/faq.html [^]

I used bash's list of constructs supported in $'...'. I think supporting hex values is a good one. \cX constructs are less necessary, but since they're in bash, may as well include it too (it's trivial to support).

Desired Action Page 2229, line 72401: At the end of section 2.2, add a new subsection, "2.2.4 Dollar-single-quotes", with this text:

A word beginning with dollar-single-quote ($'), continuing all the way to an unescaped matching single-quote ('), shall preserve the literal value of all characters within, with the exception of certain backslash escape sequences.
The defined backslash escape sequences are:
              \a alert (bell)
              \b backspace
              \e an escape character
              \f form feed
              \n new line
              \r carriage return
              \t horizontal tab
              \v vertical tab
              \\ backslash
              \' single quote
              \nnn the eight-bit character whose value is the octal value
                     nnn (one to three digits)
              \xHH the eight-bit character whose value is the hexadecimal
                     value HH (one or two hex digits)
              \cx a control-x character; \cA is 1, while \cZ is 26.
If a backslash is followed by any other sequence, the result is implementation-defined.

The expanded result is treated as if it was single-quoted (as if the dollar sign had not been present); in particular, it is *not* split using the IFS values.



On page 2298, after line 72385, add:
Within the string of characters from an enclosed "$'" to the matching "'", processing shall occur as described in section 2.2.4.

In 0000049 , need to add to the end of the list:
      * a <single-quote>
See: http://austingroupbugs.net/view.php?id=49 [^]

Tags c99, issue8
Attached Files htm file icon n1534.htm [^] (1,668 bytes) 2010-11-05 14:54
htm file icon n1534_original.htm [^] (3,813 bytes) 2010-11-05 14:56

- Relationships
related to 0000322Closedajosey 1003.1(2004)/Issue 6 Defect in XCU File Format Notation 

-  Notes
(0000548)
nick (manager)
2010-09-16 16:17

The C standard also includes Universal Character constants, and these should also be considered.

  6.4.3 Universal character names
  Syntax
1 universal-character-name:
                   \u hex-quad
                   \U hex-quad hex-quad
           hex-quad:
                   hexadecimal-digit hexadecimal-digit
                                  hexadecimal-digit hexadecimal-digit
  Constraints
2 A universal character name shall not specify a character whose short identifier is less than
  00A0 other than 0024 ($), 0040 (@), or 0060 (‘), nor one in the range D800 through
  DFFF inclusive.)
  Description
3 Universal character names may be used in identifiers, character constants, and string
  literals to designate characters that are not in the basic character set.
  Semantics
4 The universal character name \Unnnnnnnn designates the character whose eight-digit
  short identifier (as specified by ISO/IEC 10646) is nnnnnnnn.) Similarly, the universal
  character name \unnnn designates the character whose four-digit short identifier is nnnn
  (and whose eight-digit short identifier is 0000nnnn).
(0000560)
geoffclare (manager)
2010-10-01 12:48

So far, discussion of this bug has been about which escapes to include.
There are some problems with other aspects of the proposed text that
need to be sorted out. The ones I have spotted are:

Use of "A word beginning" is wrong, as the expansion can be used
within a word. E.g. a$'b'c expands to abc.

The sentence beginning "The expanded result is treated as if it was
single-quoted" needs to use "shall", and the rest of that sentence
should either be removed or reworked.

This addition to 2.2.3 Double-Quotes is not right:

    On page 2298, after line 72385, add:
    Within the string of characters from an enclosed "$'" to the
    matching "'", processing shall occur as described in section 2.2.4.

since ksh and bash do not expand $'...' within double quotes.
A possible alternative might be to append the following to line 72376:

    The <dollar-sign> shall not retain its special meaning introducing
    the dollar-single-quotes form of quoting (see [xref to 2.2.4]).

Finally an additional change is needed in 2.3 Token Recognition in
list item 4.
(0000565)
mirabilos (reporter)
2010-10-08 17:28

The following documents existing practice for $'…' in mksh, called “C style” there to distinguish from the printf builtin’s escape sequence parsing (where, for example, \123 is invalid and must be written as \0123).

Supported escapes:
\a \b \c# \E \e \f \n \r \t \v \### \U######## \u#### \x#… \' \\

Notes on some:

\c# where # is ? will yield 0x7F, every other octet is ANDed with 0x1F (using non-ASCII chars is undefined behaviour as I plan to let mksh use wide characters internally in $some $large time)

\E and \e yield 0x1B

\### looks for one, two or three octal digits and converts to octet, stops if encountering a non-digit octet

\U and \u look for up to eight/four hexadecimal digits and convert to UTF-8 (actually, mksh only supports the BMP though; may as well be CESU-8)

\x in “C style” mode gobbles up as many hexadecimal characters as it can (doesn’t just eat one or two) stopping at a non-digit octet and converts to octets if <= 0xFF, to UTF-8 (see above) otherwise

\' allows this to work: x=a$'\''b

\\ should be self-explanatory

Every other escaped character \X is replaced with itself X (although the amount of supported escape sequences may vary, so it’s in practice UD too). (The print builtin mode excludes e.g. \? from this for compatibility reasons.)

This means we have the following differences:

• \E is new (from GNU bash… although why POSIX (GNU’s Not Unix) would look at it is beyond me), which counts as UD, so is ok

• \x doesn’t terminate after two hex digits; I’d like to make this a user error and standardise only for the case where the third octet after \x is NOT in the range [0-9a-fA-F] for the sake of compatibility

• \u supports manual creation of surrogates in CESU-8 encoding, they’re not forbidden

• \U for values outside the BMP is undesirable in mksh (but this can be viewed as limitation of the operating environment; while mksh doesn’t use the OS’ implementation it shares MirBSD’s which uses an encoding based on CESU-8 internally and has a 16-bit wchar_t)

• in practice, \u/\U > 0xFFFD yield 0xFFFD (0xFFFE and 0xFFFF are invalid anyway)

• since this escape method is more generic than identifiers, all values from 0x0000 (inclusive) and up are supported (this may be desirable)

In mksh, a$'b'c expands to abc and "a$'b'c" expands to a$'b'c as suggested above as well.

There’s a sort of testsuite at https://www.mirbsd.org/cvs.cgi/src/bin/mksh/check.t?rev=HEAD [^] – search for “dollar-quoted-strings” and follow it down, including “dollar-quotes-in-heredocs” (use <<$'…' delimiters) and “dollar-quotes-in-herestrings” (for the <<< extension). These are somewhat modelled after ksh93 while looking at GNU bash and zsh, with some real-life testing added.

---

So, to summarise, I’d like to get the following as implementation-defined behaviour:

When a sequence backslash 'x' is followed by two octets in the range [0-9a-zA-Z], the behaviour if the next (third) octet is also in the range [0-9a-zA-Z].

The rest is merely for consideration; it might be wise to not limit the u and U escapes to characters valid for identifiers.

From a lession learned with JSON, it should be explicitly stated that the octet following the backslash character is matched case-sensitively. (Do we assume ASCII here?)
(0000590)
Don Cragun (manager)
2010-10-25 06:17
edited on: 2010-12-09 16:12

The following is a complete replacement for the Desired Action to
be taken to resolve this issue. It is based on other notes in this
bug report, numerous e-mail messages on the austin-group-l
alias, and discussions held during Austin Group conference calls:

In XCU section 2.2 (Quoting) change:
        "The various quoting mechanisms are the escape character,
        single-quotes, and double-quotes."
on P2298, L72359 to:
        "The various quoting mechanisms are the escape character,
        single-quotes, double-quotes, and dollar-single-quotes."

Add the following new paragraph to XCU section 2.2.3 (Double-Quotes)
after P2298, L72385 (indented to the same level as the paragraph
on L72382-72385):
        "The <dollar-sign> shall not retain its special meaning
        introducing the dollar-single-quotes form of quoting (see
        [xref to 2.2.4])."

In XCU 2.2.3 (Double-Quotes) change:
        "A single-quoted or double-quoted string that begins, but
        does not end, within the "‘...‘" sequence"
on P2299, L72392-72393 to:
        "A quoted (single-quoted, double-quoted, or dollar-single-quoted)
        string that begins, but does not end, within the "‘...‘"
        sequence"

At the end of XCU section 2.2 after P2299, L72401 add a new subsection,
with this text:
   2.2.4 Dollar-Single-Quotes
        A sequence of characters starting with a <dollar-sign>
        immediately followed by a single-quote ($') shall preserve
        the literal value of all characters up to an unescaped
        terminating single-quote ('), with the exception of certain
        backslash escape sequences, as follows:
                \" yields a <quotation-mark> (double-quote) character.
                \' yields an <apostrophe> (single-quote) character.
                \\ yields a <backslash> character.
                \a yields an <alert> character.
                \b yields a <backspace> character.
                \e yields an <ESC> character.
                \f yields a <form-feed>> character.
                \n yields a <newline> character.
                \r yields a <carriage-return> character.
                \t yields a <tab> character.
                \v yields a <vertical-tab> character.
                \cX yields the control character listed in the Value
                       column of Table 4.21 on page 3203 (xref to
                       XCU Table 4.21) in the Operands section of
                       the stty utility when X is one of the
                       characters listed in the ^c column of the
                       same table, except that \c\\ yields the
                       <FS> control character since the <backslash>
                       character must be escaped.
                \uXXXX yields the character specified by ISO/IEC 10646
                       whose four-digit short universal character
                       name is XXXX (and whose eight-digit short
                       universal character name is 0000XXXX).
                \UXXXXXXXX yields the character specified by ISO/IEC
                       10646 whose eight-digit short universal
                       character name is XXXXXXXX.
                \xXX yields the byte whose value is the hexadecimal
                       value XX (one or two hex digits)
                \XXX yields the byte whose value is the octal value XXX
                       (one to three octal digits)

        If a \xXX or \XXX escape sequence yields a byte whose value
        is 0, that byte and any following regular characters and
        escape sequences up to the terminating unescaped single-quote
        shall be evaluated and discarded.

        If a universal character name specified by \uXXXX or
        \UXXXXXXXX is less than 00A0 other than 0024 ($), 0040 (@),
        or 0060('), the results are undefined.  If a universal
        character name specified by \uXXXX or \UXXXXXXXX is in the
        range D800 through DFFF inclusive, the results are undefined.

        If more than two hexadecimal digits immediately follow \x or
        if the octal value specified by \XXX will not fit in a byte, the
        results are unspecified.

        If a backslash is followed by any sequence not listed above
        or if \e, \cX, \uXXXX, or \UXXXXXXXX yields a character
        that is not present in the current locale, the result is
        implementation-defined.

        If a backslash escape sequence represents a single-quote
        character (for example \u0060, \U00000060, or \'), that
        sequence shall not terminate the dollar-single-quote
        sequence.

In XCU section 2.3 (Token Recognition) point 4, change:
        "If the current character is <backslash>, single-quote, or
        double-quote and it is not quoted, it shall affect quoting
        for subsequent characters up to the end of the quoted text.
        The rules for quoting are as described in Section 2.2 (on
        page 2298). During token recognition no substitutions shall
        be actually performed, and the result token shall contain
        exactly the characters that appear in the input (except for
        <newline> joining), unmodified, including any embedded or
        enclosing quotes or substitution operators, between the
        <quotationmark> and the end of the quoted text. The token
        shall not be delimited by the end of the quoted field."
on P2299-2300, L72425-72432 to:
        "If the current character is an unquoted <backslash>,
        single-quote, or double-quote or is the first character of
        an unquoted <dollar-sign> single-quote sequence, it shall
        affect quoting for subsequent characters up to the end of
        the quoted text.  The rules for quoting are as described
        in Section 2.2 (on page 2298). During token recognition no
        substitutions shall be actually performed, and the result
        token shall contain exactly the characters that appear in
        the input (except for <newline> joining), unmodified,
        including any embedded or enclosing quotes or substitution
        operators, between the start and the end of the quoted text.
        The token shall not be delimited by the end of the quoted
        field."

In XCU section 2.6, add:
        * a <single-quote>
to the end of the list created by the changes identified in Mantis
BugID 49.  (BugID 49 modifies text on P2305, L72669-72671 in XCU7.)

In XCU 2.6.3 (Command Substitution) change:
        "A single-quoted or double-quoted string that begins, but
        does not end, within the "‘...‘" sequence produces undefined
        results."
on P2309, L72827-72829 to:
        "A quoted string that begins, but does not end, within the
        "‘...‘" sequence produces undefined results."

In XCU section 2.6.7 (Quote Removal), change:
        "The quote characters (<backslash>, single-quote, and
        double-quote) that were present in the original word shall
        be removed unless they have themselves been quoted."
on P2311, L72904-72905 to:
        "The quote character sequence <dollar-sign> single-quote
        and the characters (<backslash>, single-quote, and
        double-quote) that were present in the original word shall
        be removed unless they have themselves been quoted."

In the Note in XCU section 2.10.2 (Shell Grammar Rules) rule 1, change:
        "Because at this point <quotation-mark> characters are
        retained in the token, quoted strings cannot be recognized
        as reserved words."
on P2325, L73436-73437 to:
        "Because at this point quoting characters (<backslash>,
        single-quote, <quotation-mark>, and the <dollar-sign>
        single-quote sequence) are retained in the token, quoted
        strings cannot be recognized as reserved words."

Add a new section in XRAT following P3649, L124068:
        <italic>C.2.2.4   Dollar-Single-Quotes</italic>
                The $'...' quoting construct has been implemented
                in several recent shells.  It is similar to
                character string literals ("...") in the C Standard
                with the following exceptions:
                1.  The \x escape sequence in C can be followed by an
                     arbitrary number of hexadecimal digits.  The
                     ksh93 implementation of $'...' also consumes
                     an arbitrary number of hexadecimal digits; bash
                     consumes at most two hexadecimal digits in this
                     case.  This standard leaves the result unspecified
                     if more than two hexadecimal digits follow \x.
                     (Note that a hexadecimal escape followed by a
                     literal hexadecimal character can always be
                     represented as $'\xXX'X)

                2.  The \c escape sequence is not included in the C
                    standard.  There was also some disagreement in
                    shells that historically supported \c escape
                    sequences in $'...'.  These include:
                    A.  whether \cA through \cZ produced the byte values
                         1 through 26, respectively or supported the
                         codeset independent control character as
                         specified by the stty utility.  This standard
                         requires codeset independence.
                    B.  whether \c[, \c\\, \c], \c^, \c_, and \? could
                         be used to yield the <ESC>, <FS>, <GS>,
                         <RS>, <US>, and <DEL> control characters,
                         respectively.  This standard requires support
                         for all of the control characters except
                         NULL (matching what is done in the stty
                         utility).
                    C.  whether \c\\ or \c\ was used to represent <FS>.
                         This standard requires \c\\ to make backslash
                         escape processing consistent.

                    The implementors of the most common shells that
                    implement $'\cX' agreed to convert to the
                    behavior specified in this standard.

                    Some shells also allow \c<arbitrary_control_character>
                    to act as an inverse function to \cX (i.e., \cm
                    and \cM yield <CR> and \c<CR> yields m or M.
                    This standard leaves this behavior
                    implementation-defined.

                3.  The \e escape sequence is not included in the C
                    Standard, but was provided by all historical
                    shells that supported $'...'.  Some also supported
                    \E as a synonym.  One member of the group
                    objected to adding \e because the <ESC> control
                    character is not required to be in the portable
                    character set.  The \e sequence is included
                    because many historical users of $'...' expect
                    it to be there.  The \E sequence is not included
                    in this standard because <backslash> escape
                    sequences that start with <backslash> followed
                    by an uppercase letter (except \U) are reserved
                    by the C Standard for implementation use.

                4.  The \XXX octal escape sequence and the \xXX
                    hexadecimal escape sequence can be used to
                    insert a null byte into a C Standard character
                    string literal and into a $'...' quoted word
                    in this standard.  In C, any characters specified
                    after that null byte (including escape sequences)
                    continue to be processed and added to the
                    expansion of the character string literal.  In
                    $'...' in the shell, however, this standard
                    requires the null byte and all remaining
                    characters up to the terminating unescaped
                    single-quote be evaluated and discarded.  (This
                    was historic practice in bash, but not in ksh93.)
                    This allows an escape sequence producing a null
                    byte to terminate the dollar-single-quoted
                    expansion, but not terminate the token in which
                    it appears if there are characters remaining
                    in the token.  For example: 
                        printf a$'b\0c\''d
                    is required by this standard to produce:
                        abd
                    while historic versions of ksh93 produced:
                        ab
                5.  The double-quote (") character can be used
                    literally, while the single-quote (') character must
                    be represented as an escape sequence.  In C,
                    single-quote can be used literally, while
                    double-quote requires an escape sequence.
                In C, if a character specified by \uXXXX or \UXXXXXXXX
                in "..." does not exist in the execution character
                set, the results are implementation-defined.
                Similarly, this standard makes the results
                implementation-defined if \e, \cX, \uXXXX, or
                \UXXXXXXXX  specifies a character that is not present
                in the current locale.

                Note that the escape sequences recognized by $'...',
                file format notation (see Table 5-1 on page 121),
                XSI-conforming implementations of the echo utility
                (see the echo utility's operands section on page
                2615), and the printf utility's format operand (see
                the printf utility's extended description section
                on page 3050) are not the same.  Some escape sequences
                are not recognized by all of the above, the \c
                escape sequence in echo is not at all like the \c
                escape sequence in $'...', octal escape sequences
                in some of the above accept one to four octal digits
                and require a leading zero while others accept one
                to three octal digits and do not require a leading
                zero.


(0000599)
nick (manager)
2010-11-04 16:06

The issue of adding \cX and \e was discussed at the WG14 meeting. A paper to add these character constants was requested and submitted (see http://wiki.dinkumware.com/twiki/pub/WG14/Documents/n1534.htm). [^] However, since this paper was in response to a late liaison statement, objection to considering it was raised.

The committee agreed that POSIX was free to use these characters, and encouraged resubmission of N1534 as a liaison ballot comment, probably reworded to be an update to the future directions section. WG14 has indicated that POSIX use of the "Future Standardization" is satisfactory.
(0000601)
Don Cragun (manager)
2010-11-05 03:00
edited on: 2010-11-05 03:04

Note: 0000590 was updated today to account for another difference in the way bash and
ksh93 processed $'...'. If an escape sequence expanded to a null byte, bash terminated
the $'...' expansion, but still concatenated any following characters that appeared after
the $'...' expansion that were in the same lexical token. On the other hand, ksh93
terminated the entire token in this case. The ksh93 developers agreed to change to
match bash in this case if the standard requires this behavior.

Also note that if the C1x standard adopts the changes suggested in the document
referenced in Note: 0000599, the
"\e, "
in the last paragraph of the
newly created 2.2.4 section can be removed (since the <ESC> character would be
moved into the basic character set in C1X). If this happens we should also consider
adding \e to XBD Table 5-1 and checking the places in the standard that refer to
that table to determine whether \e processing should also be required in those places.

(0000609)
nick (manager)
2010-11-05 14:52

This issue was presented to WG14 (C). The initial reaction was to request a paper to add \cX and \e as character constants (see http://wiki.dinkumware.com/twiki/pub/WG14/Documents/n1534.htm, [^] attached). This paper was subsequently procedurally rejected as it was late and made fundamental differences, which might also impact C++.

Instead, the paper was revised to propose adding these characters to the Future Directions section, and a liaison comment on the CD ballot to ask for this would be appropriate.

The WG14 committee agreed that there was no problem in POSIX adding these characters to the shell, and did not object to the use of the namespace.

- Issue History
Date Modified Username Field Change
2010-04-30 21:42 dwheeler New Issue
2010-04-30 21:42 dwheeler Status New => Under Review
2010-04-30 21:42 dwheeler Assigned To => ajosey
2010-04-30 21:42 dwheeler Name => David A. Wheeler
2010-04-30 21:42 dwheeler Section => 2.2 Quoting
2010-04-30 21:42 dwheeler Page Number => 2298-2299
2010-04-30 21:42 dwheeler Line Number => 72348-72401
2010-09-16 16:17 nick Note Added: 0000548
2010-09-18 18:12 Don Cragun Relationship added related to 0000322
2010-10-01 12:48 geoffclare Note Added: 0000560
2010-10-06 01:26 nick Tag Attached: c99
2010-10-08 17:28 mirabilos Note Added: 0000565
2010-10-25 06:17 Don Cragun Note Added: 0000590
2010-10-25 14:51 Don Cragun Note Edited: 0000590
2010-10-25 15:55 Don Cragun Note Edited: 0000590
2010-10-26 06:44 Don Cragun Note Edited: 0000590
2010-10-26 20:39 Don Cragun Note Edited: 0000590
2010-10-26 20:40 Don Cragun Note Edited: 0000590
2010-10-26 20:40 Don Cragun Note Edited: 0000590
2010-10-26 20:45 Don Cragun Note Edited: 0000590
2010-10-26 21:04 Don Cragun Note Edited: 0000590
2010-10-27 03:29 Don Cragun Note Edited: 0000590
2010-11-04 16:07 nick Note Added: 0000599
2010-11-05 02:34 Don Cragun Note Edited: 0000590
2010-11-05 03:00 Don Cragun Note Added: 0000601
2010-11-05 03:04 Don Cragun Note Edited: 0000601
2010-11-05 14:52 nick Note Added: 0000609
2010-11-05 14:54 nick File Added: n1534.htm
2010-11-05 14:56 nick File Added: n1534_original.htm
2010-11-11 16:40 Don Cragun Note Edited: 0000590
2010-11-11 16:42 Don Cragun Note Edited: 0000590
2010-11-11 16:44 Don Cragun Interp Status => ---
2010-11-11 16:44 Don Cragun Final Accepted Text => See Note: 0000590
2010-11-11 16:44 Don Cragun Status Under Review => Resolved
2010-11-11 16:44 Don Cragun Resolution Open => Accepted As Marked
2010-11-11 16:44 Don Cragun Tag Attached: issue8
2010-12-09 16:12 Don Cragun Note Edited: 0000590


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker