Austin Group Defect Tracker

Aardvark Mark III


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0000249 [1003.1(2008)/Issue 7] Shell and Utilities Objection Enhancement Request 2010-04-30 21:42 2016-06-09 18:42
Reporter dwheeler View Status public  
Assigned To ajosey
Priority normal Resolution Reopened  
Status Under Review  
Name David A. Wheeler
Organization
User Reference
Section 2.2 Quoting
Page Number 2298-2299
Line Number 72348-72401
Interp Status ---
Final Accepted Text Previously accepted text is in Note: 0000590.
Summary 0000249: Add standard support for $'...' in shell
Description It is annoyingly difficult to set shell variables with certain characters, for example, to set a shell variable to a value that ends with "\n". This is particularly a problem when manipulating IFS, but the problem occurs in many other situations too. One reason is because command substitution consumes all trailing newlines, so constructs like this don't work:
 IFS=$(printf "\n")

It's possible to work around this (e.g., with clever parameter expansions), but the work-arounds are ugly and error-prone. It would be better to support simple, clear scripts when you want to insert characters like newline.

Thus, many shells add the $'...' construct, which is *widely* supported. It is supported by ksh, zsh, bash, and the busybox "sh" command at *least*. This widespread support suggests that this is useful and easy to implement. The fact that even *busybox* supports it, even though they focus on being very small, suggests that this is *important* functionality.

As the korn shell documentation says, "the purpose of $'...' is to solve the problem of entering special characters in scripts [using] ANSI-C rules to translate the string... It would have been cleaner to have all "..." strings handle ANSI-C escapes, but that would not be backwards compatible."
http://kornshell.com/doc/faq.html [^]

I used bash's list of constructs supported in $'...'. I think supporting hex values is a good one. \cX constructs are less necessary, but since they're in bash, may as well include it too (it's trivial to support).

Desired Action Page 2229, line 72401: At the end of section 2.2, add a new subsection, "2.2.4 Dollar-single-quotes", with this text:

A word beginning with dollar-single-quote ($'), continuing all the way to an unescaped matching single-quote ('), shall preserve the literal value of all characters within, with the exception of certain backslash escape sequences.
The defined backslash escape sequences are:
              \a alert (bell)
              \b backspace
              \e an escape character
              \f form feed
              \n new line
              \r carriage return
              \t horizontal tab
              \v vertical tab
              \\ backslash
              \' single quote
              \nnn the eight-bit character whose value is the octal value
                     nnn (one to three digits)
              \xHH the eight-bit character whose value is the hexadecimal
                     value HH (one or two hex digits)
              \cx a control-x character; \cA is 1, while \cZ is 26.
If a backslash is followed by any other sequence, the result is implementation-defined.

The expanded result is treated as if it was single-quoted (as if the dollar sign had not been present); in particular, it is *not* split using the IFS values.



On page 2298, after line 72385, add:
Within the string of characters from an enclosed "$'" to the matching "'", processing shall occur as described in section 2.2.4.

In 0000049 , need to add to the end of the list:
      * a <single-quote>
See: http://austingroupbugs.net/view.php?id=49 [^]

Tags c99, issue8, UTF-8_Locale
Attached Files htm file icon n1534.htm [^] (1,668 bytes) 2010-11-05 14:54
htm file icon n1534_original.htm [^] (3,813 bytes) 2010-11-05 14:56

- Relationships
related to 0000322Closedajosey 1003.1(2004)/Issue 6 Defect in XCU File Format Notation 
related to 0000985Interpretation Required 1003.1(2013)/Issue7+TC1 quote removal missing from case statement patterns and alternative expansions 

-  Notes
(0000548)
nick (manager)
2010-09-16 16:17

The C standard also includes Universal Character constants, and these should also be considered.

  6.4.3 Universal character names
  Syntax
1 universal-character-name:
                   \u hex-quad
                   \U hex-quad hex-quad
           hex-quad:
                   hexadecimal-digit hexadecimal-digit
                                  hexadecimal-digit hexadecimal-digit
  Constraints
2 A universal character name shall not specify a character whose short identifier is less than
  00A0 other than 0024 ($), 0040 (@), or 0060 (‘), nor one in the range D800 through
  DFFF inclusive.)
  Description
3 Universal character names may be used in identifiers, character constants, and string
  literals to designate characters that are not in the basic character set.
  Semantics
4 The universal character name \Unnnnnnnn designates the character whose eight-digit
  short identifier (as specified by ISO/IEC 10646) is nnnnnnnn.) Similarly, the universal
  character name \unnnn designates the character whose four-digit short identifier is nnnn
  (and whose eight-digit short identifier is 0000nnnn).
(0000560)
geoffclare (manager)
2010-10-01 12:48

So far, discussion of this bug has been about which escapes to include.
There are some problems with other aspects of the proposed text that
need to be sorted out. The ones I have spotted are:

Use of "A word beginning" is wrong, as the expansion can be used
within a word. E.g. a$'b'c expands to abc.

The sentence beginning "The expanded result is treated as if it was
single-quoted" needs to use "shall", and the rest of that sentence
should either be removed or reworked.

This addition to 2.2.3 Double-Quotes is not right:

    On page 2298, after line 72385, add:
    Within the string of characters from an enclosed "$'" to the
    matching "'", processing shall occur as described in section 2.2.4.

since ksh and bash do not expand $'...' within double quotes.
A possible alternative might be to append the following to line 72376:

    The <dollar-sign> shall not retain its special meaning introducing
    the dollar-single-quotes form of quoting (see [xref to 2.2.4]).

Finally an additional change is needed in 2.3 Token Recognition in
list item 4.
(0000565)
mirabilos (reporter)
2010-10-08 17:28

The following documents existing practice for $'…' in mksh, called “C style” there to distinguish from the printf builtin’s escape sequence parsing (where, for example, \123 is invalid and must be written as \0123).

Supported escapes:
\a \b \c# \E \e \f \n \r \t \v \### \U######## \u#### \x#… \' \\

Notes on some:

\c# where # is ? will yield 0x7F, every other octet is ANDed with 0x1F (using non-ASCII chars is undefined behaviour as I plan to let mksh use wide characters internally in $some $large time)

\E and \e yield 0x1B

\### looks for one, two or three octal digits and converts to octet, stops if encountering a non-digit octet

\U and \u look for up to eight/four hexadecimal digits and convert to UTF-8 (actually, mksh only supports the BMP though; may as well be CESU-8)

\x in “C style” mode gobbles up as many hexadecimal characters as it can (doesn’t just eat one or two) stopping at a non-digit octet and converts to octets if <= 0xFF, to UTF-8 (see above) otherwise

\' allows this to work: x=a$'\''b

\\ should be self-explanatory

Every other escaped character \X is replaced with itself X (although the amount of supported escape sequences may vary, so it’s in practice UD too). (The print builtin mode excludes e.g. \? from this for compatibility reasons.)

This means we have the following differences:

• \E is new (from GNU bash… although why POSIX (GNU’s Not Unix) would look at it is beyond me), which counts as UD, so is ok

• \x doesn’t terminate after two hex digits; I’d like to make this a user error and standardise only for the case where the third octet after \x is NOT in the range [0-9a-fA-F] for the sake of compatibility

• \u supports manual creation of surrogates in CESU-8 encoding, they’re not forbidden

• \U for values outside the BMP is undesirable in mksh (but this can be viewed as limitation of the operating environment; while mksh doesn’t use the OS’ implementation it shares MirBSD’s which uses an encoding based on CESU-8 internally and has a 16-bit wchar_t)

• in practice, \u/\U > 0xFFFD yield 0xFFFD (0xFFFE and 0xFFFF are invalid anyway)

• since this escape method is more generic than identifiers, all values from 0x0000 (inclusive) and up are supported (this may be desirable)

In mksh, a$'b'c expands to abc and "a$'b'c" expands to a$'b'c as suggested above as well.

There’s a sort of testsuite at https://www.mirbsd.org/cvs.cgi/src/bin/mksh/check.t?rev=HEAD [^] – search for “dollar-quoted-strings” and follow it down, including “dollar-quotes-in-heredocs” (use <<$'…' delimiters) and “dollar-quotes-in-herestrings” (for the <<< extension). These are somewhat modelled after ksh93 while looking at GNU bash and zsh, with some real-life testing added.

---

So, to summarise, I’d like to get the following as implementation-defined behaviour:

When a sequence backslash 'x' is followed by two octets in the range [0-9a-zA-Z], the behaviour if the next (third) octet is also in the range [0-9a-zA-Z].

The rest is merely for consideration; it might be wise to not limit the u and U escapes to characters valid for identifiers.

From a lession learned with JSON, it should be explicitly stated that the octet following the backslash character is matched case-sensitively. (Do we assume ASCII here?)
(0000590)
Don Cragun (manager)
2010-10-25 06:17
edited on: 2015-08-20 15:25

NB: this note has been revised following the 2015-08-20 conference call:

The following is a complete replacement for the Desired Action to
be taken to resolve this issue. It is based on other notes in this
bug report, numerous e-mail messages on the austin-group-l
alias, and discussions held during Austin Group conference calls:

In XCU section 2.2 (Quoting) change:
        "The various quoting mechanisms are the escape character,
        single-quotes, and double-quotes."
on P2298, L72359 to:
        "The various quoting mechanisms are the escape character,
        single-quotes, double-quotes, and dollar-single-quotes."

Add the following new paragraph to XCU section 2.2.3 (Double-Quotes)
after P2298, L72385 (indented to the same level as the paragraph
on L72382-72385):
        "The <dollar-sign> shall not retain its special meaning
        introducing the dollar-single-quotes form of quoting (see
        [xref to 2.2.4])."

In XCU 2.2.3 (Double-Quotes) change:
        "A single-quoted or double-quoted string that begins, but
        does not end, within the "‘...‘" sequence"
on P2299, L72392-72393 to:
        "A quoted (single-quoted, double-quoted, or dollar-single-quoted)
        string that begins, but does not end, within the "‘...‘"
        sequence"

At the end of XCU section 2.2 after P2299, L72401 add a new subsection,
with this text:
   2.2.4 Dollar-Single-Quotes
        A sequence of characters starting with a <dollar-sign>
        immediately followed by a single-quote ($') shall preserve
        the literal value of all characters up to an unescaped
        terminating single-quote ('), with the exception of certain
        backslash escape sequences, as follows:
                \" yields a <quotation-mark> (double-quote) character.
                \' yields an <apostrophe> (single-quote) character.
                \\ yields a <backslash> character.
                \a yields an <alert> character.
                \b yields a <backspace> character.
                \e yields an <ESC> character.
                \f yields a <form-feed>> character.
                \n yields a <newline> character.
                \r yields a <carriage-return> character.
                \t yields a <tab> character.
                \v yields a <vertical-tab> character.
                \cX yields the control character listed in the Value
                       column of Table 4.21 on page 3203 (xref to
                       XCU Table 4.21) in the Operands section of
                       the stty utility when X is one of the
                       characters listed in the ^c column of the
                       same table, except that \c\\ yields the
                       <FS> control character since the <backslash>
                       character must be escaped.
               \uXXXX yields the character specified by ISO/IEC 10646
                       where XXXX is two to four hexadecimal digits (with leading
                       zeros supplied for missing digits) whose four-digit short
                       universal character name is XXXX (and whose eight-digit
                       short universal character name is 0000XXXX).
               \UXXXXXXXX yields the character specified by ISO/IEC 10646 where
                       XXXXXXXX is two to eight hexadecimal digits (with
                       leading zeros supplied for missing digits) whose
                       eight-digit short universal character name is XXXXXXXX.
                \xXX yields the byte whose value is the hexadecimal
                       value XX (one or two hex digits)
                \XXX yields the byte whose value is the octal value XXX
                       (one to three octal digits)

        If a \xXX or \XXX escape sequence yields a byte whose value is 0,
        it is unspecified whether that nul byte is included in the result
        or if that byte and any following regular characters and escape
        sequences up to the terminating unescaped single-quote are evaluated
        and discarded.

        If a universal character name specified by \uXXXX or  \UXXXXXXXX is
        less than 0020 (<space>), the results are undefined.

        If more than two hexadecimal digits immediately follow \x or
        if the octal value specified by \XXX will not fit in a byte, the
        results are unspecified.

        If a backslash is followed by any sequence not listed above
        or if \e, \cX, \uXXXX, or \UXXXXXXXX yields a character
        that is not present in the current locale, the result is
        implementation-defined.

        If a backslash escape sequence represents a single-quote
        character (for example \u0060, \U00000060, or \'), that
        sequence shall not terminate the dollar-single-quote
        sequence.

In XCU section 2.3 (Token Recognition) point 4, change:
        "If the current character is <backslash>, single-quote, or
        double-quote and it is not quoted, it shall affect quoting
        for subsequent characters up to the end of the quoted text.
        The rules for quoting are as described in Section 2.2 (on
        page 2298). During token recognition no substitutions shall
        be actually performed, and the result token shall contain
        exactly the characters that appear in the input (except for
        <newline> joining), unmodified, including any embedded or
        enclosing quotes or substitution operators, between the
        <quotationmark> and the end of the quoted text. The token
        shall not be delimited by the end of the quoted field."
on P2299-2300, L72425-72432 to:
        "If the current character is an unquoted <backslash>,
        single-quote, or double-quote or is the first character of
        an unquoted <dollar-sign> single-quote sequence, it shall
        affect quoting for subsequent characters up to the end of
        the quoted text.  The rules for quoting are as described
        in Section 2.2 (on page 2298). During token recognition no
        substitutions shall be actually performed, and the result
        token shall contain exactly the characters that appear in
        the input (except for <newline> joining), unmodified,
        including any embedded or enclosing quotes or substitution
        operators, between the start and the end of the quoted text.
        The token shall not be delimited by the end of the quoted
        field."

In XCU section 2.6, add:
        * a <single-quote>
to the end of the list created by the changes identified in Mantis
BugID 49.  (BugID 49 modifies text on P2305, L72669-72671 in XCU7.)

In XCU 2.6.3 (Command Substitution) change:
        "A single-quoted or double-quoted string that begins, but
        does not end, within the "‘...‘" sequence produces undefined
        results."
on P2309, L72827-72829 to:
        "A quoted string that begins, but does not end, within the
        "‘...‘" sequence produces undefined results."

In XCU section 2.6.7 (Quote Removal), change:
        "The quote characters (<backslash>, single-quote, and
        double-quote) that were present in the original word shall
        be removed unless they have themselves been quoted."
on P2311, L72904-72905 to:
        "The quote character sequence <dollar-sign> single-quote
        and the characters (<backslash>, single-quote, and
        double-quote) that were present in the original word shall
        be removed unless they have themselves been quoted."

In the Note in XCU section 2.10.2 (Shell Grammar Rules) rule 1, change:
        "Because at this point <quotation-mark> characters are
        retained in the token, quoted strings cannot be recognized
        as reserved words."
on P2325, L73436-73437 to:
        "Because at this point quoting characters (<backslash>,
        single-quote, <quotation-mark>, and the <dollar-sign>
        single-quote sequence) are retained in the token, quoted
        strings cannot be recognized as reserved words."

Add a new section in XRAT following P3649, L124068:
        <italic>C.2.2.4   Dollar-Single-Quotes</italic>
                The $'...' quoting construct has been implemented
                in several recent shells.  It is similar to
                character string literals ("...") in the C Standard
                with the following exceptions:
                1.  The \x escape sequence in C can be followed by an
                     arbitrary number of hexadecimal digits.  The
                     ksh93 implementation of $'...' also consumes
                     an arbitrary number of hexadecimal digits; bash
                     consumes at most two hexadecimal digits in this
                     case.  This standard leaves the result unspecified
                     if more than two hexadecimal digits follow \x.
                     (Note that a hexadecimal escape followed by a
                     literal hexadecimal character can always be
                     represented as $'\xXX'X)

                2.  The \c escape sequence is not included in the C
                    standard.  There was also some disagreement in
                    shells that historically supported \c escape
                    sequences in $'...'.  These include:
                    A.  whether \cA through \cZ produced the byte values
                         1 through 26, respectively or supported the
                         codeset independent control character as
                         specified by the stty utility.  This standard
                         requires codeset independence.
                    B.  whether \c[, \c\\, \c], \c^, \c_, and \? could
                         be used to yield the <ESC>, <FS>, <GS>,
                         <RS>, <US>, and <DEL> control characters,
                         respectively.  This standard requires support
                         for all of the control characters except
                         NULL (matching what is done in the stty
                         utility).
                    C.  whether \c\\ or \c\ was used to represent <FS>.
                         This standard requires \c\\ to make backslash
                         escape processing consistent.

                    The implementors of the most common shells that
                    implement $'\cX' agreed to convert to the
                    behavior specified in this standard.

                    Some shells also allow \c<arbitrary_control_character>
                    to act as an inverse function to \cX (i.e., \cm
                    and \cM yield <CR> and \c<CR> yields m or M.
                    This standard leaves this behavior
                    implementation-defined.

                3.  The \e escape sequence is not included in the C
                    Standard, but was provided by all historical
                    shells that supported $'...'.  Some also supported
                    \E as a synonym.  One member of the group
                    objected to adding \e because the <ESC> control
                    character is not required to be in the portable
                    character set.  The \e sequence is included
                    because many historical users of $'...' expect
                    it to be there.  The \E sequence is not included
                    in this standard because <backslash> escape
                    sequences that start with <backslash> followed
                    by an uppercase letter (except \U) are reserved
                    by the C Standard for implementation use.

                4.  The \XXX octal escape sequence and the \xXX
                    hexadecimal escape sequence can be used to
                    insert a null byte into a C Standard character
                    string literal and into a $'...' quoted word
                    in this standard.
                    In C, any characters specified after that null byte
                    (including escape sequences) continue to be processed
                    and added to the expansion of the character string literal.
                    In $'...' in the shell this standard allows the
                    equivalent behavior but also allows the null byte and
                    all remaining characters up to the terminating unescaped
                    single-quote be evaluated and discarded.
                    The latter (which was historic practice in bash, but not
                    in ksh93) allows an escape sequence producing a null
                    byte to terminate the dollar-single-quoted
                    expansion, but not terminate the token in which
                    it appears if there are characters remaining
                    in the token.  For example:
                        printf a$'b\0c\''d
                    is required by this standard to produce:
                        abd
                    while historic versions of ksh93 produced:
                        ab
                5.  The \uXXXX and \UXXXXXXXX escape sequences must have
                    exactly 4 and 8 hexadecimal digits, respectively, in C.
                    In $'...' in the shell, fewer digits are accepted.
                    If the character following either form is itself a
                    hexadecimal digit, the application shall ensure that
                    the full 4 or 8 character width ( using leading zeros
                    as necessary ) is used to disambiguate the interpretation.
                6.  The double-quote (") character can be used
                    literally, while the single-quote (') character must
                    be represented as an escape sequence.  In C,
                    single-quote can be used literally, while
                    double-quote requires an escape sequence.
                In C, if a character specified by \uXXXX or \UXXXXXXXX
                in "..." does not exist in the execution character
                set, the results are implementation-defined.
                Similarly, this standard makes the results
                implementation-defined if \e, \cX, \uXXXX, or
                \UXXXXXXXX  specifies a character that is not present
                in the current locale.

                Note that the escape sequences recognized by $'...',
                file format notation (see Table 5-1 on page 121),
                XSI-conforming implementations of the echo utility
                (see the echo utility's operands section on page
                2615), and the printf utility's format operand (see
                the printf utility's extended description section
                on page 3050) are not the same.  Some escape sequences
                are not recognized by all of the above, the \c
                escape sequence in echo is not at all like the \c
                escape sequence in $'...', octal escape sequences
                in some of the above accept one to four octal digits
                and require a leading zero while others accept one
                to three octal digits and do not require a leading
                zero.


(0000599)
nick (manager)
2010-11-04 16:06

The issue of adding \cX and \e was discussed at the WG14 meeting. A paper to add these character constants was requested and submitted (see http://wiki.dinkumware.com/twiki/pub/WG14/Documents/n1534.htm). [^] However, since this paper was in response to a late liaison statement, objection to considering it was raised.

The committee agreed that POSIX was free to use these characters, and encouraged resubmission of N1534 as a liaison ballot comment, probably reworded to be an update to the future directions section. WG14 has indicated that POSIX use of the "Future Standardization" is satisfactory.
(0000601)
Don Cragun (manager)
2010-11-05 03:00
edited on: 2010-11-05 03:04

Note: 0000590 was updated today to account for another difference in the way bash and
ksh93 processed $'...'. If an escape sequence expanded to a null byte, bash terminated
the $'...' expansion, but still concatenated any following characters that appeared after
the $'...' expansion that were in the same lexical token. On the other hand, ksh93
terminated the entire token in this case. The ksh93 developers agreed to change to
match bash in this case if the standard requires this behavior.

Also note that if the C1x standard adopts the changes suggested in the document
referenced in Note: 0000599, the
"\e, "
in the last paragraph of the
newly created 2.2.4 section can be removed (since the <ESC> character would be
moved into the basic character set in C1X). If this happens we should also consider
adding \e to XBD Table 5-1 and checking the places in the standard that refer to
that table to determine whether \e processing should also be required in those places.

(0000609)
nick (manager)
2010-11-05 14:52

This issue was presented to WG14 (C). The initial reaction was to request a paper to add \cX and \e as character constants (see http://wiki.dinkumware.com/twiki/pub/WG14/Documents/n1534.htm, [^] attached). This paper was subsequently procedurally rejected as it was late and made fundamental differences, which might also impact C++.

Instead, the paper was revised to propose adding these characters to the Future Directions section, and a liaison comment on the CD ballot to ask for this would be appropriate.

The WG14 committee agreed that there was no problem in POSIX adding these characters to the shell, and did not object to the use of the namespace.
(0002793)
nick (manager)
2015-08-20 15:31
edited on: 2015-08-20 15:33

Note: 0000590 was updated today to allow for greater flexibility in the number of digits in a \uXXXX and \uXXXXXXXX sequence and other minor tweaks to reflect existing practice:

The detailed changes recorded in the minutes are:
Proposed changes to bugnote #590:

Change:
If a \xXX or \XXX escape sequence yields a byte whose value is 0, that byte and any following regular characters and escape sequences up to the terminating unescaped single-quote shall be evaluated and discarded.
to:
If a \xXX or \XXX escape sequence yields a byte whose value is 0, it is unspecified whether that nul byte is included in the result or if that byte and any following regular characters and escape sequences up to the terminating unescaped single-quote are evaluated and discarded.

Change:
 \uXXXX yields the character specified by ISO/IEC 10646
whose four-digit short universal character name is XXXX (and
whose eight-digit short universal character name is 0000XXXX).
             \UXXXXXXXX yields the character specified by ISO/IEC
10646 whose eight-digit short universal character name is XXXXXXXX.
to:
 \uXXXX yields the character specified by ISO/IEC 10646 where XXXX is two
to four hexadecimal digits (with leading zeros supplied for missing digits)
whose four-digit short universal character name is XXXX (and
whose eight-digit short universal character name is 0000XXXX).
             \UXXXXXXXX yields the character specified by ISO/IEC 10646 where
XXXXXXXX is two to eight hexadecimal digits (with leading zeros
supplied for missing digits) whose eight-digit short universal character
name is XXXXXXXX.

Change:
If a universal character name specified by \uXXXX or \UXXXXXXXX is less than 00A0 other than 0024 ($), 0040 (@), or 0060('), the results are undefined.
to:
If a universal character name specified by \uXXXX or \UXXXXXXXX is less than 0020 (<space>), the results are undefined.

Change:
In C, any characters specified after that null byte (including escape sequences) continue to be processed and added to the expansion of the character string literal. In $'...' in the shell, however, this standard requires the null byte and all remaining characters up to the terminating unescaped single-quote be evaluated and discarded. (This was historic practice in bash, but not in ksh93.) This allows ...
to:
In C, any characters specified after that null byte (including escape sequences) continue to be processed and added to the expansion of the character string literal. In $'...' in the shell this standard allows the equivalent behavior but also allows the null byte and all remaining characters up to the terminating unescaped single-quote be evaluated and discarded. The latter (which was historic practice in bash, but not in ksh93) allows ...

Change:
5. The double-quote (") character can be used ...
to:
5. The \uXXXX and \UXXXXXXXX escape sequences must have exactly 4 and 8 hexadecimal digits, respectively, in C. In $'...' in the shell, fewer digits are accepted. If the character following either form is itself a hexadecimal digit, the application shall ensure that the full 4 or 8 character width ( using leading zeros as necessary ) is used to disambiguate the interpretation.

6. The double-quote (") character can be used ...

(0002809)
rhansen (manager)
2015-09-03 16:40
edited on: 2015-09-10 20:27

The following is a work in progress, discussed during the 2015-09-03 telecon. The changes from Note: 0000590 are:
  • allow the \u and \U escape sequences to have 1 hexadecimal digit
  • refer to ISO/IEC 6429 for characters \u0 through \u1f (and \U0 through \U1F)
  • change "yields the character specified by" to "yields the character named by"
  • delete the paragraph that says that the results of characters less than 0020 (<space>) are unspecified
  • change "or if ... yields a character that is not present in the current locale" to "or if ... specifies a character that does not have an encoding in the locale in effect during token recognition"
  • add a new paragraph to the end of the new rationale section
At the end of the telecon, we were still discussing whether the behavior of characters not in the current locale should be more specified than just implementation-defined.

Note: The bug report is against the original version of Issue 7, but the following changes are against Issue 7 with TC1 integrated (page and line numbers refer to Issue 7 with TC1).

At page 2320 line 73594 (XCU section 2.2, Quoting) change:
The various quoting mechanisms are the escape character, single-quotes, and double-quotes.
to:
The various quoting mechanisms are the escape character, single-quotes, double-quotes, and dollar-single-quotes.

After page 2320 page 73620 (XCU section 2.2.3, Double-Quotes) insert a new paragraph intented to the same level as the paragraph at lines 73617-73620:
The <dollar-sign> shall not retain its special meaning introducing the dollar-single-quotes form of quoting (see [xref to 2.2.4]).

At page 2321 lines 73626-73627 (XCU 2.2.3, Double-Quotes), change:
  • A single-quoted or double-quoted string that begins, but does not end, within the "`...`" sequence
to:
  • A quoted (single-quoted, double-quoted, or dollar-single-quoted) string that begins, but does not end, within the "`...`" sequence

After page 2321 line 73635 (end of XCU section 2.2), insert a new subsection:
2.2.4 Dollar-Single-Quotes

A sequence of characters starting with a <dollar-sign> immediately followed by a single-quote ($') shall preserve the literal value of all characters up to an unescaped terminating single-quote ('), with the exception of certain backslash escape sequences, as follows:
  • <tt>\"</tt> yields a <quotation-mark> (double-quote) character.
  • <tt>\'</tt> yields an <apostrophe> (single-quote) character.
  • <tt>\\</tt> yields a <backslash> character.
  • <tt>\a</tt> yields an <alert> character.
  • <tt>\b</tt> yields a <backspace> character.
  • <tt>\e</tt> yields an <ESC> character.
  • <tt>\f</tt> yields a <form-feed>> character.
  • <tt>\n</tt> yields a <newline> character.
  • <tt>\r</tt> yields a <carriage-return> character.
  • <tt>\t</tt> yields a <tab> character.
  • <tt>\v</tt> yields a <vertical-tab> character.
  • <tt>\cX</tt> yields the control character listed in the Value column of [xref to XCU Table 4.21] in the Operands section of the stty utility when X is one of the characters listed in the ^c column of the same table, except that \c\\ yields the <FS> control character since the <backslash> character must be escaped.
  • <tt>\uXXXX</tt> yields the character named by ISO/IEC 10646 or, for \u0 to \u1f, by ISO/IEC 6429 where XXXX is one to four hexadecimal digits (with leading zeros supplied for missing digits) whose four-digit short universal character name is XXXX (and whose eight-digit short universal character name is 0000XXXX).
  • <tt>\UXXXXXXXX</tt> yields the character named by ISO/IEC 10646 or, for \u0 to \u1f, by ISO/IEC 6429 where XXXXXXXX is one to eight hexadecimal digits (with leading zeros supplied for missing digits) whose eight-digit short universal character name is XXXXXXXX.
  • <tt>\xXX</tt> yields the byte whose value is the hexadecimal value XX (one or two hex digits)
  • <tt>\XXX</tt> yields the byte whose value is the octal value XXX (one to three octal digits)

If a \xXX or \XXX escape sequence yields a byte whose value is 0, it is unspecified whether that nul byte is included in the result or if that byte and any following regular characters and escape sequences up to the terminating unescaped single-quote are evaluated and discarded.

If more than two hexadecimal digits immediately follow \x or if the octal value specified by \XXX will not fit in a byte, the results are unspecified.

If a backslash is followed by any sequence not listed above or if \e, \cX, \uXXXX, or \UXXXXXXXX specifies a character that does not have an encoding in the locale in effect during token recognition, the result is implementation-defined.

If a backslash escape sequence represents a single-quote character (for example \u0060, \U00000060, or \'), that sequence shall not terminate the dollar-single-quote sequence.

At page 2321 lines 73658-73664 (XCU section 2.3 (Token Recognition) point 4), change:
4. If the current character is <backslash>, single-quote, or double-quote and it is not quoted, it shall affect quoting for subsequent characters up to the end of the quoted text. The rules for quoting are as described in Section 2.2 (on page 2298). During token recognition no substitutions shall be actually performed, and the result token shall contain exactly the characters that appear in the input (except for <newline> joining), unmodified, including any embedded or enclosing quotes or substitution operators, between the <quotationmark> and the end of the quoted text. The token shall not be delimited by the end of the quoted field.
to:
4. If the current character is an unquoted <backslash>, single-quote, or double-quote or is the first character of an unquoted <dollar-sign> single-quote sequence, it shall affect quoting for subsequent characters up to the end of the quoted text. The rules for quoting are as described in [xref to Section 2.2]. During token recognition no substitutions shall be actually performed, and the result token shall contain exactly the characters that appear in the input (except for <newline> joining), unmodified, including any embedded or enclosing quotes or substitution operators, between the start and the end of the quoted text. The token shall not be delimited by the end of the quoted field.

After page 2327 line 73900 (XCU section 2.6, Word Expansions), insert a new bullet point:
  • a <single-quote>

At page 2331 lines 74071-74073 (XCU 2.6.3, Command Substitution), change:
A single-quoted or double-quoted string that begins, but does not end, within the "`...`" sequence produces undefined results.
to:
A quoted string that begins, but does not end, within the "`...`" sequence produces undefined results.

At page 2333 lines 74157-74158 (XCU section 2.6.7, Quote Removal), change:
The quote characters (<backslash>, single-quote, and double-quote) that were present in the original word shall be removed unless they have themselves been quoted.
to:
The quote character sequence <dollar-sign> single-quote and the characters (<backslash>, single-quote, and double-quote) that were present in the original word shall be removed unless they have themselves been quoted.

At page 2348 lines 74718-74719 (the Note in XCU section 2.10.2 (Shell Grammar Rules) rule 1), change:
Because at this point <quotation-mark> characters are retained in the token, quoted strings cannot be recognized as reserved words.
to:
Because at this point quoting characters (<backslash>, single-quote, <quotation-mark>, and the <dollar-sign> single-quote sequence) are retained in the token, quoted strings cannot be recognized as reserved words.

After page 3677 line 125685 (XRAT C.2.2), insert a new subsection:
C.2.2.4 Dollar-Single-Quotes

The $'...' quoting construct has been implemented in several recent shells. It is similar to character string literals ("...") in the C Standard with the following exceptions:
  1. The \x escape sequence in C can be followed by an arbitrary number of hexadecimal digits. The ksh93 implementation of $'...' also consumes an arbitrary number of hexadecimal digits; bash consumes at most two hexadecimal digits in this case. This standard leaves the result unspecified if more than two hexadecimal digits follow \x. (Note that a hexadecimal escape followed by a literal hexadecimal character can always be represented as $'\xXX'X)
  2. The \c escape sequence is not included in the C standard. There was also some disagreement in shells that historically supported \c escape sequences in $'...'. These include:
    1. whether \cA through \cZ produced the byte values 1 through 26, respectively or supported the codeset independent control character as specified by the stty utility. This standard requires codeset independence.
    2. whether \c[, \c\\, \c], \c^, \c_, and \? could be used to yield the <ESC>, <FS>, <GS>, <RS>, <US>, and <DEL> control characters, respectively. This standard requires support for all of the control characters except NULL (matching what is done in the stty utility).
    3. whether \c\\ or \c\ was used to represent <FS>. This standard requires \c\\ to make backslash escape processing consistent.

    The implementors of the most common shells that implement $'\cX' agreed to convert to the behavior specified in this standard.

    Some shells also allow \c<arbitrary_control_character> to act as an inverse function to \cX (i.e., \cm and \cM yield <CR> and \c<CR> yields m or M. This standard leaves this behavior implementation-defined.
  3. The \e escape sequence is not included in the C Standard, but was provided by all historical shells that supported $'...'. Some also supported \E as a synonym. One member of the group objected to adding \e because the <ESC> control character is not required to be in the portable character set. The \e sequence is included because many historical users of $'...' expect it to be there. The \E sequence is not included in this standard because <backslash> escape sequences that start with <backslash> followed by an uppercase letter (except \U) are reserved by the C Standard for implementation use.
  4. The \XXX octal escape sequence and the \xXX hexadecimal escape sequence can be used to insert a null byte into a C Standard character string literal and into a $'...' quoted word in this standard. In C, any characters specified after that null byte (including escape sequences) continue to be processed and added to the expansion of the character string literal. In $'...' in the shell this standard allows the equivalent behavior but also allows the null byte and all remaining characters up to the terminating unescaped single-quote be evaluated and discarded. The latter (which was historic practice in bash, but not in ksh93) allows an escape sequence producing a null byte to terminate the dollar-single-quoted expansion, but not terminate the token in which it appears if there are characters remaining in the token. For example:
        printf a$'b\0c\''d
    
    is required by this standard to produce:
        abd
    
    while historic versions of ksh93 produced:
        ab
    

  5. The \uXXXX and \UXXXXXXXX escape sequences must have exactly 4 and 8 hexadecimal digits, respectively, in C. In $'...' in the shell, fewer digits are accepted. If the character following either form is itself a hexadecimal digit, the application shall ensure that the full 4 or 8 character width ( using leading zeros as necessary ) is used to disambiguate the interpretation.
  6. The double-quote (") character can be used literally, while the single-quote (') character must be represented as an escape sequence. In C, single-quote can be used literally, while double-quote requires an escape sequence.
In C, if a character specified by \uXXXX or \UXXXXXXXX in "..." does not exist in the execution character set, the results are implementation-defined. Similarly, this standard makes the results implementation-defined if \e, \cX, \uXXXX, or \UXXXXXXXX specifies a character that is not present in the current locale.

Note that the escape sequences recognized by $'...', file format notation (see [xref to Table 5-1]), XSI-conforming implementations of the echo utility (see the echo utility's operands section on [xref to echo]), and the printf utility's format operand (see the printf utility's extended description on [xref to printf]) are not the same. Some escape sequences are not recognized by all of the above, the \c escape sequence in echo is not at all like the \c escape sequence in $'...', octal escape sequences in some of the above accept one to four octal digits and require a leading zero while others accept one to three octal digits and do not require a leading zero.

Some historical implementations of $'...' incorrectly translated \uXXXX sequences into UTF-8 encoded characters even when the current locale does not use UTF-8 as its codeset. When a Unicode character specified by \uXXXX names a character that is present in the codeset in the current locale, that character's encoding in the current locale's character set must be used, not the UTF-8 encoding of that character. If the named character is not in the current locale's character set, the result is implementation-defined.


(0002824)
rhansen (manager)
2015-09-10 20:25
edited on: 2015-09-10 20:26

We made a bit more progress on this issue during today's teleconference. For the latest version of the proposed wording, see:

http://posix.rhansen.org:9001/p/bug249 [^]

(see the emailed minutes for the username and password)

Feel free to make changes, but please highlight your changes (e.g., use strikeout when deleting and color the changes red).

(0002893)
steffen (reporter)
2015-11-09 21:10

For the record i'm appending an item of mine from the mailing-list (austin-group-l:archive/latest/23158, 20150930122537.COnKa3bTi%sdaoden@yandex.com):

the ISO C11 draft i have states in 6.4.3, Universal
character names:

  The universal character name \Unnnnnnnn designates the character
  whose eight-digit short identifier (as specified by ISO/IEC 10646)
  is nnnnnnnn [.]

And even though ISO C forbids for unknown reasons \u65\u301, it
still supports ISO 10646 for the remains, which includes that
characters for which no precomposed character is available
a combination of multiple successive \u sequences must naturally
be usable. Simply because that is ISO 10646.
Furthermore ISO C _explicitly_ forbids many specific code ranges
for identifiers, and thus _implicitly_ allows them for other use
cases -- isn't that the usual way to treat it?

Therefore back to my initial message, i think the standard should
today say a word on wether software is allowed to perform the
normalization and show replacement characters for final graphem
boundaries instead of individual bytes or "characters".

[...]

I suggested that software which is capable to understand that
\u65\u301 can actually be composed to \ue9 can do so if possible
in the locale or if replacement has to be performed. Looking at
the above i don't know wether that is the right approach for
a shell, but if we end up with something that is visual to the
user (even isatty(STDOUT_FILENO)) then i think it is a replacement
just as valid (maybe not for the shell).

Regarding \[Uu] in general: this escape carries an ISO 10646
character losslessly around whatever transport to the user, where
it is to be transformed to something in the users current locale.
\u65\u301 is \ue9 in ISO-8859-1 a.k.a. LATIN1. Period.

For other locales similar composition examples can be assumed, the
special thing about LATIN1 in this regard being simply that it has
been chosen for the first 256 entries of Unicode / ISO 10646,
likely because its first 128 entries are compatible to US-ASCII.

It is one of the typical ISO C annoyances that the ASCII range
plus C1 controls can't be used in \[Uu], real-life extensions
exist.

My opinion: wether software is capable to perform normalization
and composition of \u65\u301 or not, thus ending up with \ue9
a.k.a. "é", or otherwise "e?", "e\u0301" or any other replacement
is purely based upon its level of Unicode / ISO 10646
compatibility / compliance. It has nothing at all to do with
\u65\u301, which _is nothing but \ue9 in LATIN1_.

[...]

This was a followup of (austin-group-l:archive/latest/23150, 20150929092823.nia8y-PmH%sdaoden@yandex.com) in which i wrote:

it thus follows that those characters which have no precomposed code point
in Unicode must be representable, since they may have a mapping in
a given locale (even though i would assume that Unicode actually
contains precomposed code points for characters in most, if not
all well-known locales). Unicode (6.3) section 5.6,
Normalization:

  Compared to the number of possible combinations, only
  a relatively small number of pre-composed base character plus
  nonspacing marks have independent Unicode character values.

And more specific examples, e.g., section 8.2, Arabic:

  A basic Arabic letter plus any of these types of marks is never
  encoded as a separate, precomposed character, but must always be
  represented as a sequence of letter plus combining mark..

I'm pretty sure that a linguist or Unicode specialist would know
more such examples (if Arabic by itself would not be sufficient; [.]
(0002922)
shware_systems (reporter)
2015-11-13 19:19

Re:2893
From Chapter 2, p.12, of Unicode v7:
"Glyphs represent the shapes that characters can have when they are rendered or displayed. In contrast to characters, glyphs appear on the screen or paper as particular representations of one or more characters. A repertoire of glyphs makes up a font. Glyph shape and methods of identifying and selecting glyphs are the responsibility of individual font vendors and of appropriate standards and are not part of the Unicode Standard."

Such compositions, of combining sequence characters, are therefore left the province of final consumers with extended facilities for processing them, such as a font driver or Unicode-aware hardcopy output devices. Unicode provides code points that indicate which types of combining may be desired but no consumer of those characters is required to honor them. These types of devices are common, but still a subset of the devices supported by the General Terminal interface, directly or via emulation, so for POSIX those aspects of Unicode are optional. For compatibility with the POSIX locale, only the code points for precomposed Latn-script glyphs that use left to right advancing are required.

It is not the province of intermediary filtering utilities such as the shell to attempt such combining conversions unilaterally given the range of devices currently allowed. Historically, where preprocessing of a character sequence is required because a system operator knows a device does not have sufficient capabilities to represent a glyph expected of a sequence, and for that operator using a supported replacement character is acceptable, this has been accommodated by iconv() as a situational elective, not a locale based requirement on implementations for character element processing.

For backwards compatibility, code points for additional scripts may be compatible with the User Portability Option Group and be part of an extended repertoire. The remaining scripts, including the example Arabic, have to be part of a new Option Group that is a superset of the current User Portability option, as I see it, and will never be a candidate for the POSIX base because of the specialized device support required to get portably consistent results. It's certainly an XSI possibility for Issue 9, but the reference implementation of such a group for Issue 8 hasn't been done yet. The pieces exist, spread across various libraries like icu and freetype; a standardizable grouping of them that is suitably documented doesn't, that I'm aware of, that respects the POSIX extensions model as it exists.
(0002923)
steffen (reporter)
2015-11-13 21:22

@2922:
  Glyphs..are the responsibility of individual font

Yes - if you enter a decomposed Unicode graphem that spans multiple Unicode code points, each of which encoded in UTF-8, then it is up to the terminal / -driver to perform the composition to the glyph that is displayed.

  It is not the province of intermediary filtering utilities such as the shell to attempt such combining conversions

The problem is when the \u sequence is to be displayed in a locale that is not Unicode aware, that requires conversion of the \u sequence.
And ISO 10646 that POSIX refers to doesn't contain all characters in precomposed form.

  The remaining scripts, including the example Arabic..
  will never be a candidate for the POSIX base..
  specialized device support required..

Well it is a combination of base letters and combining marks.
I think iconv(3) has to be extended to be able to deal with that fact when converting from Unicode data to some other character set in which there is a precomposed character that corresponds to a given decomposed sequence.

Stéphane has given the example of his name on the mailing list several times: $'St\u65\u301phane'. This is representable in LATIN1, correct conversion provided. What else should be correct in LATIN1? "Ste?phane" can't be it, i'd say.

But this issue is about $'' support in the shell, not iconv(3) nor any other improved text manipulation, including drop of strxfrm. I was suggesting that the forthcoming standard text says a word about character composition for \u sequences in $'' and Unicode graphem boundaries in general. Because ISO 10646 doesn't offer precomposed entries for all possible characters --- not because code points are yet unassigned or due to private range extensions, but by design of ISO 10646, already today.

And therefore, if a utility that works in the LC_CTYPE locale offers the possibility to enter ISO 10646 / Unicode data -- which this issue is about to standardize for the shell via the $'' construct and the \Uu escape sequence --, it should be able to deal with it and try its best to ensure that the input ends up in the best possible LC_CTYPE form.
The standard should include wording that heads towards this goal.

(I don't think it is a show stopper that there is no string interface that deals with real language strings, because i think on the long run iconv(3) has to be capable to perform proper conversion, as above, and i guess that is therefore the natural choice for \uU since there is no wctomb_l(). And even if the existence of a Unicode locale would be a guess, and even if not then wchar_t is transparent and not necessarily in UTF-32. C11 brings c32rtomb(), but i fail to see how this helps regarding Unicode. It will, however and most likely, ensure that the iconv(3) available in the C library supports the Unicode character sets. Yay.)
(0002925)
steffen (reporter)
2015-11-13 21:24

erm, i said

  Yes - if you enter a decomposed Unicode graphem that spans multiple Unicode code points, each of which encoded in UTF-8, then it is up to the terminal / -driver to perform the composition to the glyph that is displayed.

I meant: if you are in a UTF-8 aware locale. Of course.
(0002929)
shware_systems (reporter)
2015-11-13 23:23

No, it does not require conversion. The entire U0300 block is illegal code points for a basic ISO-8859-1 based locale, and most other single_byte locales. An extended device that reports to the implementation that it does handle the U0300 block in a way that substitutes a precomposed Unicode glyph code point might be usable with an extended locale, as part of that new Option Group, which would also use 8859-1 as its single byte encoding. Then the substitution of \uA9 as a character element would have a direct map to \xA9 as a code point. Such processing requires the device support this capabilities reporting, which precludes both regular and character special file types being used with those strings. Adding the capabilities to these types requires changes to all existing implementations, afaik. Adding an Option Group doesn't.

Conversely, such a substitution should not be made if the device advertises it handles the U0300 block by the backspace/overprint method. What glyph is assigned to \uA9 may have an entirely different appearance than produced after overstrike. Yes, it would be silly for a font to do this, but it's not precluded either. If the shell substitutes \uA9 anyways, the results would be confusing visually, at least.

I agree where appropriate such mappings are desirable. I see appropriate as being in a separate shell work alike utility compiled to require the availability of the new group, not the sh utility directly.
(0002942)
steffen (reporter)
2015-11-14 14:26

No, it does not require conversion.

It should have the option to do so.

  The entire U0300 block is illegal code points for a basic ISO-8859-1 based locale

Performing a composition step may result in a valid one.

Looking back it roots in english centric problem solutions, surely also caused by the resource restrictions of earlier computers (even relative given the power of average desktop computers say in 1999, only sixteen years ago).
But other languages exist, and character encodings with state machines existed decades ago already to overcome the technologic shortcomings of the single-track=english environments, so that other people from different cultures (which still existed back then) had a chance to take part at modern technologies and keep up with the speed of the world trade etc. (Which, as a side note, i personally don't understand because of its negative side effects. But that's how sapiens sapiens acts. He doesn't like falling behind.)

And to what extend is Unicode with its graphem boundaries different to those encodings with state machines? I would say it is because of the fact that this standard was designed with an integral, with a holistic point of view from the very beginning, in order to change that.

So more than two decades ago the ISO 10646 / Unicode effort started and that is where it ended up: not every "user-receivable character" has its own codepoint, but multiple such may combine with exactly defined rules to form such a "graphem".
If that graphem happens to have a representation in the charset of the currently active locale then a sensitive application will want to use that very representation, instead of using replacement characters of whatever form.

You cannot do anything sensitive except iconv(3) anyway, since the POSIX standard doesn't offer any other interface. And ISO C11 comes with a function but which surely is backed by the very same mechanism that iconv(3) uses internally.
But wording should ensure that multiple adjacent (at least \uU) escape sequences may together form a complete ISO 10646 character sequence. That is ISO 10646.

Those passed in conjunction to a truly Unicode-aware iconv(3) should return a normalized composed variant. I would think too many actual uses of iconv(3) in the wild prevent

  If the input buffer ends with an incomplete character or shift sequence, conversion shall stop after the previous successfully converted bytes.

to be extended to handle the case that a valid character at EOS is seen but which _may_ combine if further input is fed in.
So this is what you refer to?

Do you want me to open an issue to add niconv{,_open,_flush,_close}() that requires a _flush(), or without _flush() but with a niconv() requirement of a final call with inbuflen=0 to signal "EOF", before the output is truly complete?
For the future even POSIX / ISO C software needs to be capable to deal with base characters plus combining marks. An extended niconv may be an easy way to upgrade a lot of otherwise charset agnostic software in the most minimal intrusive way possible.
(0003247)
steffen (reporter)
2016-06-06 21:07

Re @2809 (the proposed text)

 |The \XXX octal escape sequence and the \xXX hexadecimal escape sequence can be used to insert a null byte into a C Standard character string literal and into a $'...' quoted word in this standard.

I think this should also be noted for \u and \U, since POSIX doesn't clamp the allowed range of characters and Unicode / ISO 10646 and therefore \u can yield any possible control character that \x and \0 can yield.


P.S.: currently implementing such a thing it.. doesn't amuse me.. that variable expansion hasn't been included in $''. We now have four (4!) different escape mechanisms with four (4!) different sets of rules, and not a single mechanism offers all the possibilities.
I would assume this is a nightmare for any normal user, and anyway it ends up nearly as cryptic as C format strings, though maybe not just as dangerous when misused.
I will implement a \$ escape that can be used to perform variable expansion.
Like that a user can well end up using single-quotes for literal input and dollar-single-quotes for anything else.
(0003248)
shware_systems (reporter)
2016-06-07 19:50
edited on: 2016-06-07 19:53

The /u and /U1..1f escapes may be mapped in the current locale to values other than 1 to 31, depending on what code point, single or multiple byte, the charmap assigns the referenced ISO-6429 functionality; if it supports the function to begin with. While few charmaps are expected to do any mapping of this nature, this possibility is why they weren't included. Such mapping does not occur with the \cX, \XXX or \xXX forms.

Variable and other expansions were not included because sequences like $'text'$var$'end' do the same thing due to string concatenation while tokenizing. Treating $'..' sequences as a form of expansion inside double quotes was also discussed, but consensus on how to make the behavior portable couldn't be reached. At most possibly a character or two of source input might be saved with an inordinate increase in complexity of implementing it in existing $'' code, in my opinion.

Folding the new escapes into double or single quotes was also discussed, so a new quoting form wasn't required, but it was felt this would break too many existing scripts that expect backslashes to have their current literal or escaping usage.

(0003249)
steffen (reporter)
2016-06-07 20:25

Re @3248: what has that to do with \u0 or \U0?
As you know well Unicode[0..256] = ISO 8859-1[0..127] = US-ASCII.
I was thinking it is nothing but an oversight that \000 = \x00 = \u0, yet wording is missing.

 | Variable and other expansions were not included because sequences like $'text'$var$'end'

Oh yes. Nice! Terribly cryptic.
Since $'' adds expansion of backslash escape sequences i have added a \$VAR (and \${VAR}) extension. Really, this is C and sh(1), not brainfuck or something. Sorry.

Btw. i've defined the implementation-defined \c@ a.k.a. NUL to mean the same as printf(1)s \c, since we have two (or, with \u0, three) incarnations of possibilities to end processing of the current quote, but nothing which extends beyond that, and i wonder how much time resources i should spend for implementing this. Granted: printf(1) i don't need.

 | Folding the new escapes into double or single quotes was also discussed

Ok, it is understood that this wasn't possible to not break backward compatibility.
Since at least bash(1) already added $"" with a meaning that i have not looked deeper into we've reached the end of the road. Yay.
(0003250)
steffen (reporter)
2016-06-07 20:27

Re @3249:
 | As you know well Unicode[0..256] = ISO 8859-1[0..127] = US-ASCII.

0..255.
(0003251)
shware_systems (reporter)
2016-06-08 06:51

Unicode[0..31] is Unassigned, with the C0 group of ISO-6429 being the recommended, not mandatory, usage assignments for interchange. This applies to UTF8 also. Everyone on the calls agreed this was the case. POSIX via C requires the \0 code point have a constant usage that differs markedly from ISO-6429, so it is special to begin with and why I didn't include it in the range. Sorry, I could have made that clearer. The other differences in definitions are relatively inconsequential. With Unicode having it as unassigned it was felt the standard should leave open a future version of Unicode might assign it a third interpretation as required usage, which may or may not map to a code point of existing charmaps. It is not exempt from ever being mapped; that is simply the expected ongoing practice. So the full range is \U0..31.

ISO-8859-1 just defines the G1 group [160..255] as a 96-char encoding; it defers to, and by convention most interchange uses, ISO-646-IRV as the default G0 group but any registered 94-char encoding used with it is technically a conforming variant. Variant announcement via ISO-2022 ESC sequences is expected if the default isn't used, and for alternate C0 and C1 groups than ISO-6429. POSIX currently assumes the announcement(s) is encoded in the locale label somehow and an application has some unspecified out of band means for getting the relevant label when context changes to use with setlocale(). The minimal C0 group usable as an alternate, per ISO-6429 deferring to ISO-2022, only requires 3 byte values have specific assignments, however, and \0 isn't one of them. So an 8859 variant may require mapping also to convert it to the C locale.

The $"..." format was looked at, and tabled as having too many platform-specific variations being possible to warrant inclusion at this time, related to infrastructure aspects the standard has as implementation defined or unspecified. I won't belabor which aspects; the expectation is a future bug report will address them.
(0003252)
steffen (reporter)
2016-06-08 10:43

Re @3251: i don't know what you are talking about. It is plain untrue.
You seem to have become confused about reading ISO/IEC 10646:2012, section 11.
(0003254)
shware_systems (reporter)
2016-06-08 19:55

No, I'm not confused. Look at the UCD XML directly; neither 00..1F or 7F..9F have "na=" long names assigned to the code points. The ISO-6429 names are "na2=" commentary, not normative. For Unicode 8 the relevant text is Chapter 23.1, not 11. The text there surprised some, but all agreed that it did reflect the UCD.

The code points are reserved for application usage, and it is up to the application producing Unicode text to use application specific means to communicate what arbitrary assignments it uses those code points for. Saying an application should assume ISO-6429 as the recommended default if it does not implement a compatible receiver of what the producer communicates is a bug that just hasn't bit many people yet, imo.
(0003256)
steffen (reporter)
2016-06-09 13:55

Re ..the proposed text:

I object against

 |whether \c\\ or \c\ was used to represent <FS>. This standard requires \c\\ to make backslash escape processing consistent.

and i don't know which current real-life implementation the standard bases this upon.

 ?0[steffen@wales tmp]$ cat t.sh;for i in bash mksh; do echo $i:;$i t.sh|s-hex;done
 echo a$'\c\b'c
 bash:
 00000000 61 1c 62 63 0a |a.bc.|
 00000005
 mksh:
 00000000 61 1c 62 63 0a |a.bc.|
 00000005
 ?0[sdaoden@wales tmp]$

So in left-to-right parsing \c is a function that takes an argument, if that is the backslash \ it is consumed, no need for quoting it. Just as "i" (and meaning the quote) would expect it to be.

Maybe the standard core developers ran into a bash bug:

 ?0[steffen@wales tmp]$ cat t.sh;for i in bash mksh; do echo $i:;$i t.sh|s-hex;done
 echo a$'\c\'c
 bash:
 t.sh: line 1: unexpected EOF while looking for matching `''
 t.sh: line 2: syntax error: unexpected end of file
 mksh:
 00000000 61 1c 63 0a |a.c.|
 00000004

 ?0[steffen@wales tmp]$ cat t.sh;for i in bash mksh; do echo $i:;$i t.sh|s-hex;done
 echo a$'\c\\'c
 bash:
 00000000 61 1c 63 0a |a.c.|
 00000004
 mksh:
 t.sh[2]: no closing quote

but that shouldn't be used as a template for a standard that yet needs to be born. Imho.
(0003257)
stephane (reporter)
2016-06-09 14:10
edited on: 2016-06-09 14:26

ksh93 is the shell $'...' comes from (while $'\uxxxx' comes from zsh: http://www.zsh.org/mla/workers/2003/msg00223.html) [^]

$ ksh93 -c "printf %s $'\cA'" | hd
00000000  01                                                |.|
00000001
$ ksh93 -c "printf %s $'\c\1'" | hd
00000000  41                                                |A|
00000001


So in ksh93 $'\c\x' seems to be a feature and the reverse of $'\cx'

Or more precisely, $'\cX' toggles the bit 6 of the first byte of the expansion (\xxx expansion) of X

(0003258)
shware_systems (reporter)
2016-06-09 14:15

It may be invention, but the purpose is to allow constructs like \c\u41 as a synonym for \cA, currently, and \c\x01 for A if the converse extension is implemented. Using \\ makes it consistent with the other escapes so lookahead isn't required.
(0003259)
steffen (reporter)
2016-06-09 15:07

Re @3257:
 | So in ksh93 $'\c\x' seems to be a feature and the reverse of $'\cx'

Nice, but the standard core developers don't allow this but _only_ and _explicitly_ enforce and allow it for \c\\ a.k.a. <FS>.

Due to the context shown in @3256 it could be deduced that this desire de facto originates in something that is nothing but a bug, of bash(1) to be exact.
(Luckily I just fixed one of my bugs, so i dare: ..that only looked for a single backslash at EOL to quote LF, instead of looking backwards wether the backslash is itself escaped (i.e., if there is a non-even number of backslashes before the LF): just in case you cannot parse left-to-right.)
(0003262)
stephane (reporter)
2016-06-09 18:42

Re @3259

yes, from a cursory look, I agree it does look like something like:

\c\<anything-other-than-\> -> unspecified

is missing from the proposed specification in @590 (as of 2016-06-09) but it's kind-of implied. I would say you're free to implement $'\c\a' as either $'\x48' (as ksh93 does) or as $'\x1f'a or as "invalid control sequence".

- Issue History
Date Modified Username Field Change
2010-04-30 21:42 dwheeler New Issue
2010-04-30 21:42 dwheeler Status New => Under Review
2010-04-30 21:42 dwheeler Assigned To => ajosey
2010-04-30 21:42 dwheeler Name => David A. Wheeler
2010-04-30 21:42 dwheeler Section => 2.2 Quoting
2010-04-30 21:42 dwheeler Page Number => 2298-2299
2010-04-30 21:42 dwheeler Line Number => 72348-72401
2010-09-16 16:17 nick Note Added: 0000548
2010-09-18 18:12 Don Cragun Relationship added related to 0000322
2010-10-01 12:48 geoffclare Note Added: 0000560
2010-10-06 01:26 nick Tag Attached: c99
2010-10-08 17:28 mirabilos Note Added: 0000565
2010-10-25 06:17 Don Cragun Note Added: 0000590
2010-10-25 14:51 Don Cragun Note Edited: 0000590
2010-10-25 15:55 Don Cragun Note Edited: 0000590
2010-10-26 06:44 Don Cragun Note Edited: 0000590
2010-10-26 20:39 Don Cragun Note Edited: 0000590
2010-10-26 20:40 Don Cragun Note Edited: 0000590
2010-10-26 20:40 Don Cragun Note Edited: 0000590
2010-10-26 20:45 Don Cragun Note Edited: 0000590
2010-10-26 21:04 Don Cragun Note Edited: 0000590
2010-10-27 03:29 Don Cragun Note Edited: 0000590
2010-11-04 16:07 nick Note Added: 0000599
2010-11-05 02:34 Don Cragun Note Edited: 0000590
2010-11-05 03:00 Don Cragun Note Added: 0000601
2010-11-05 03:04 Don Cragun Note Edited: 0000601
2010-11-05 14:52 nick Note Added: 0000609
2010-11-05 14:54 nick File Added: n1534.htm
2010-11-05 14:56 nick File Added: n1534_original.htm
2010-11-11 16:40 Don Cragun Note Edited: 0000590
2010-11-11 16:42 Don Cragun Note Edited: 0000590
2010-11-11 16:44 Don Cragun Interp Status => ---
2010-11-11 16:44 Don Cragun Final Accepted Text => See Note: 0000590
2010-11-11 16:44 Don Cragun Status Under Review => Resolved
2010-11-11 16:44 Don Cragun Resolution Open => Accepted As Marked
2010-11-11 16:44 Don Cragun Tag Attached: issue8
2010-12-09 16:12 Don Cragun Note Edited: 0000590
2015-07-31 15:59 stephane Issue Monitored: stephane
2015-08-20 15:24 nick Note Edited: 0000590
2015-08-20 15:25 nick Note Edited: 0000590
2015-08-20 15:31 nick Note Added: 0002793
2015-08-20 15:32 nick Note Edited: 0002793
2015-08-20 15:33 nick Note Edited: 0002793
2015-09-03 16:40 rhansen Note Added: 0002809
2015-09-03 16:45 rhansen Note Edited: 0002809
2015-09-03 16:48 rhansen Note Edited: 0002809
2015-09-03 17:00 rhansen Note Edited: 0002809
2015-09-03 17:05 rhansen Note Edited: 0002809
2015-09-03 17:10 rhansen Tag Attached: UTF-8_Locale
2015-09-10 20:25 rhansen Note Added: 0002824
2015-09-10 20:26 rhansen Note Edited: 0002824
2015-09-10 20:26 rhansen Note Edited: 0002824
2015-09-10 20:27 rhansen Note Edited: 0002809
2015-10-08 06:43 Don Cragun Final Accepted Text See Note: 0000590 => Previously accepted text is in Note: 0000590.
2015-10-08 06:43 Don Cragun Resolution Accepted As Marked => Reopened
2015-10-08 16:32 Don Cragun Status Resolved => Under Review
2015-10-08 16:41 rhansen Relationship added related to 0000985
2015-11-09 21:10 steffen Note Added: 0002893
2015-11-13 19:19 shware_systems Note Added: 0002922
2015-11-13 21:22 steffen Note Added: 0002923
2015-11-13 21:24 steffen Note Added: 0002925
2015-11-13 23:23 shware_systems Note Added: 0002929
2015-11-14 14:26 steffen Note Added: 0002942
2016-06-06 21:07 steffen Note Added: 0003247
2016-06-07 19:50 shware_systems Note Added: 0003248
2016-06-07 19:53 shware_systems Note Edited: 0003248
2016-06-07 20:25 steffen Note Added: 0003249
2016-06-07 20:27 steffen Note Added: 0003250
2016-06-08 06:51 shware_systems Note Added: 0003251
2016-06-08 10:43 steffen Note Added: 0003252
2016-06-08 19:55 shware_systems Note Added: 0003254
2016-06-09 13:55 steffen Note Added: 0003256
2016-06-09 14:10 stephane Note Added: 0003257
2016-06-09 14:15 shware_systems Note Added: 0003258
2016-06-09 14:18 stephane Note Edited: 0003257
2016-06-09 14:26 stephane Note Edited: 0003257
2016-06-09 15:07 steffen Note Added: 0003259
2016-06-09 18:42 stephane Note Added: 0003262


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker