| Anonymous | Login | Signup for a new account | 2013-05-23 16:20 UTC |
| Main | My View | View Issues | Change Log | Docs |
| Viewing Issue Simple Details [ Jump to Notes ] | [ Issue History ] [ Print ] | ||||||
| ID | Category | Severity | Type | Date Submitted | Last Update | ||
| 0000249 | [1003.1(2008)/Issue 7] Shell and Utilities | Objection | Enhancement Request | 2010-04-30 21:42 | 2010-11-11 16:44 | ||
| Reporter | dwheeler | View Status | public | ||||
| Assigned To | ajosey | ||||||
| Priority | normal | Resolution | Accepted As Marked | ||||
| Status | Resolved | ||||||
| Name | David A. Wheeler | ||||||
| Organization | |||||||
| User Reference | |||||||
| Section | 2.2 Quoting | ||||||
| Page Number | 2298-2299 | ||||||
| Line Number | 72348-72401 | ||||||
| Interp Status | --- | ||||||
| Final Accepted Text | See Note: 0000590 | ||||||
| Summary | 0000249: Add standard support for $'...' in shell | ||||||
| Description |
It is annoyingly difficult to set shell variables with certain characters, for example, to set a shell variable to a value that ends with "\n". This is particularly a problem when manipulating IFS, but the problem occurs in many other situations too. One reason is because command substitution consumes all trailing newlines, so constructs like this don't work: IFS=$(printf "\n") It's possible to work around this (e.g., with clever parameter expansions), but the work-arounds are ugly and error-prone. It would be better to support simple, clear scripts when you want to insert characters like newline. Thus, many shells add the $'...' construct, which is *widely* supported. It is supported by ksh, zsh, bash, and the busybox "sh" command at *least*. This widespread support suggests that this is useful and easy to implement. The fact that even *busybox* supports it, even though they focus on being very small, suggests that this is *important* functionality. As the korn shell documentation says, "the purpose of $'...' is to solve the problem of entering special characters in scripts [using] ANSI-C rules to translate the string... It would have been cleaner to have all "..." strings handle ANSI-C escapes, but that would not be backwards compatible." http://kornshell.com/doc/faq.html [^] I used bash's list of constructs supported in $'...'. I think supporting hex values is a good one. \cX constructs are less necessary, but since they're in bash, may as well include it too (it's trivial to support). |
||||||
| Desired Action |
Page 2229, line 72401: At the end of section 2.2, add a new subsection, "2.2.4 Dollar-single-quotes", with this text: A word beginning with dollar-single-quote ($'), continuing all the way to an unescaped matching single-quote ('), shall preserve the literal value of all characters within, with the exception of certain backslash escape sequences. The defined backslash escape sequences are: \a alert (bell) \b backspace \e an escape character \f form feed \n new line \r carriage return \t horizontal tab \v vertical tab \\ backslash \' single quote \nnn the eight-bit character whose value is the octal value nnn (one to three digits) \xHH the eight-bit character whose value is the hexadecimal value HH (one or two hex digits) \cx a control-x character; \cA is 1, while \cZ is 26. If a backslash is followed by any other sequence, the result is implementation-defined. The expanded result is treated as if it was single-quoted (as if the dollar sign had not been present); in particular, it is *not* split using the IFS values. On page 2298, after line 72385, add: Within the string of characters from an enclosed "$'" to the matching "'", processing shall occur as described in section 2.2.4. In 0000049 , need to add to the end of the list: * a <single-quote> See: http://austingroupbugs.net/view.php?id=49 [^] |
||||||
| Tags | c99, issue8 | ||||||
| Attached Files |
|
||||||
|
|
|||||||
Relationships |
|||||||
|
|||||||
Notes |
|
|
(0000548) nick (manager) 2010-09-16 16:17 |
The C standard also includes Universal Character constants, and these should also be considered. 6.4.3 Universal character names Syntax 1 universal-character-name: \u hex-quad \U hex-quad hex-quad hex-quad: hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit Constraints 2 A universal character name shall not specify a character whose short identifier is less than 00A0 other than 0024 ($), 0040 (@), or 0060 (‘), nor one in the range D800 through DFFF inclusive.) Description 3 Universal character names may be used in identifiers, character constants, and string literals to designate characters that are not in the basic character set. Semantics 4 The universal character name \Unnnnnnnn designates the character whose eight-digit short identifier (as specified by ISO/IEC 10646) is nnnnnnnn.) Similarly, the universal character name \unnnn designates the character whose four-digit short identifier is nnnn (and whose eight-digit short identifier is 0000nnnn). |
|
(0000560) geoffclare (manager) 2010-10-01 12:48 |
So far, discussion of this bug has been about which escapes to include. There are some problems with other aspects of the proposed text that need to be sorted out. The ones I have spotted are: Use of "A word beginning" is wrong, as the expansion can be used within a word. E.g. a$'b'c expands to abc. The sentence beginning "The expanded result is treated as if it was single-quoted" needs to use "shall", and the rest of that sentence should either be removed or reworked. This addition to 2.2.3 Double-Quotes is not right: On page 2298, after line 72385, add: Within the string of characters from an enclosed "$'" to the matching "'", processing shall occur as described in section 2.2.4. since ksh and bash do not expand $'...' within double quotes. A possible alternative might be to append the following to line 72376: The <dollar-sign> shall not retain its special meaning introducing the dollar-single-quotes form of quoting (see [xref to 2.2.4]). Finally an additional change is needed in 2.3 Token Recognition in list item 4. |
|
(0000565) mirabilos (reporter) 2010-10-08 17:28 |
The following documents existing practice for $'…' in mksh, called “C style” there to distinguish from the printf builtin’s escape sequence parsing (where, for example, \123 is invalid and must be written as \0123). Supported escapes: \a \b \c# \E \e \f \n \r \t \v \### \U######## \u#### \x#… \' \\ Notes on some: \c# where # is ? will yield 0x7F, every other octet is ANDed with 0x1F (using non-ASCII chars is undefined behaviour as I plan to let mksh use wide characters internally in $some $large time) \E and \e yield 0x1B \### looks for one, two or three octal digits and converts to octet, stops if encountering a non-digit octet \U and \u look for up to eight/four hexadecimal digits and convert to UTF-8 (actually, mksh only supports the BMP though; may as well be CESU-8) \x in “C style” mode gobbles up as many hexadecimal characters as it can (doesn’t just eat one or two) stopping at a non-digit octet and converts to octets if <= 0xFF, to UTF-8 (see above) otherwise \' allows this to work: x=a$'\''b \\ should be self-explanatory Every other escaped character \X is replaced with itself X (although the amount of supported escape sequences may vary, so it’s in practice UD too). (The print builtin mode excludes e.g. \? from this for compatibility reasons.) This means we have the following differences: • \E is new (from GNU bash… although why POSIX (GNU’s Not Unix) would look at it is beyond me), which counts as UD, so is ok • \x doesn’t terminate after two hex digits; I’d like to make this a user error and standardise only for the case where the third octet after \x is NOT in the range [0-9a-fA-F] for the sake of compatibility • \u supports manual creation of surrogates in CESU-8 encoding, they’re not forbidden • \U for values outside the BMP is undesirable in mksh (but this can be viewed as limitation of the operating environment; while mksh doesn’t use the OS’ implementation it shares MirBSD’s which uses an encoding based on CESU-8 internally and has a 16-bit wchar_t) • in practice, \u/\U > 0xFFFD yield 0xFFFD (0xFFFE and 0xFFFF are invalid anyway) • since this escape method is more generic than identifiers, all values from 0x0000 (inclusive) and up are supported (this may be desirable) In mksh, a$'b'c expands to abc and "a$'b'c" expands to a$'b'c as suggested above as well. There’s a sort of testsuite at https://www.mirbsd.org/cvs.cgi/src/bin/mksh/check.t?rev=HEAD [^] – search for “dollar-quoted-strings” and follow it down, including “dollar-quotes-in-heredocs” (use <<$'…' delimiters) and “dollar-quotes-in-herestrings” (for the <<< extension). These are somewhat modelled after ksh93 while looking at GNU bash and zsh, with some real-life testing added. --- So, to summarise, I’d like to get the following as implementation-defined behaviour: When a sequence backslash 'x' is followed by two octets in the range [0-9a-zA-Z], the behaviour if the next (third) octet is also in the range [0-9a-zA-Z]. The rest is merely for consideration; it might be wise to not limit the u and U escapes to characters valid for identifiers. From a lession learned with JSON, it should be explicitly stated that the octet following the backslash character is matched case-sensitively. (Do we assume ASCII here?) |
|
(0000590) Don Cragun (manager) 2010-10-25 06:17 edited on: 2010-12-09 16:12 |
The following is a complete replacement for the Desired Action to be taken to resolve this issue. It is based on other notes in this bug report, numerous e-mail messages on the austin-group-l alias, and discussions held during Austin Group conference calls:
In XCU section 2.2 (Quoting) change:
"The various quoting mechanisms are the escape character,
single-quotes, and double-quotes."
on P2298, L72359 to:
"The various quoting mechanisms are the escape character,
single-quotes, double-quotes, and dollar-single-quotes."
Add the following new paragraph to XCU section 2.2.3 (Double-Quotes)
after P2298, L72385 (indented to the same level as the paragraph
on L72382-72385):
"The <dollar-sign> shall not retain its special meaning
introducing the dollar-single-quotes form of quoting (see
[xref to 2.2.4])."
In XCU 2.2.3 (Double-Quotes) change:
"A single-quoted or double-quoted string that begins, but
does not end, within the "‘...‘" sequence"
on P2299, L72392-72393 to:
"A quoted (single-quoted, double-quoted, or dollar-single-quoted)
string that begins, but does not end, within the "‘...‘"
sequence"
At the end of XCU section 2.2 after P2299, L72401 add a new subsection,
with this text:
2.2.4 Dollar-Single-Quotes
A sequence of characters starting with a <dollar-sign>
immediately followed by a single-quote ($') shall preserve
the literal value of all characters up to an unescaped
terminating single-quote ('), with the exception of certain
backslash escape sequences, as follows:
\" yields a <quotation-mark> (double-quote) character.
\' yields an <apostrophe> (single-quote) character.
\\ yields a <backslash> character.
\a yields an <alert> character.
\b yields a <backspace> character.
\e yields an <ESC> character.
\f yields a <form-feed>> character.
\n yields a <newline> character.
\r yields a <carriage-return> character.
\t yields a <tab> character.
\v yields a <vertical-tab> character.
\cX yields the control character listed in the Value
column of Table 4.21 on page 3203 (xref to
XCU Table 4.21) in the Operands section of
the stty utility when X is one of the
characters listed in the ^c column of the
same table, except that \c\\ yields the
<FS> control character since the <backslash>
character must be escaped.
\uXXXX yields the character specified by ISO/IEC 10646
whose four-digit short universal character
name is XXXX (and whose eight-digit short
universal character name is 0000XXXX).
\UXXXXXXXX yields the character specified by ISO/IEC
10646 whose eight-digit short universal
character name is XXXXXXXX.
\xXX yields the byte whose value is the hexadecimal
value XX (one or two hex digits)
\XXX yields the byte whose value is the octal value XXX
(one to three octal digits)
If a \xXX or \XXX escape sequence yields a byte whose value
is 0, that byte and any following regular characters and
escape sequences up to the terminating unescaped single-quote
shall be evaluated and discarded.
If a universal character name specified by \uXXXX or
\UXXXXXXXX is less than 00A0 other than 0024 ($), 0040 (@),
or 0060('), the results are undefined. If a universal
character name specified by \uXXXX or \UXXXXXXXX is in the
range D800 through DFFF inclusive, the results are undefined.
If more than two hexadecimal digits immediately follow \x or
if the octal value specified by \XXX will not fit in a byte, the
results are unspecified.
If a backslash is followed by any sequence not listed above
or if \e, \cX, \uXXXX, or \UXXXXXXXX yields a character
that is not present in the current locale, the result is
implementation-defined.
If a backslash escape sequence represents a single-quote
character (for example \u0060, \U00000060, or \'), that
sequence shall not terminate the dollar-single-quote
sequence.
In XCU section 2.3 (Token Recognition) point 4, change:
"If the current character is <backslash>, single-quote, or
double-quote and it is not quoted, it shall affect quoting
for subsequent characters up to the end of the quoted text.
The rules for quoting are as described in Section 2.2 (on
page 2298). During token recognition no substitutions shall
be actually performed, and the result token shall contain
exactly the characters that appear in the input (except for
<newline> joining), unmodified, including any embedded or
enclosing quotes or substitution operators, between the
<quotationmark> and the end of the quoted text. The token
shall not be delimited by the end of the quoted field."
on P2299-2300, L72425-72432 to:
"If the current character is an unquoted <backslash>,
single-quote, or double-quote or is the first character of
an unquoted <dollar-sign> single-quote sequence, it shall
affect quoting for subsequent characters up to the end of
the quoted text. The rules for quoting are as described
in Section 2.2 (on page 2298). During token recognition no
substitutions shall be actually performed, and the result
token shall contain exactly the characters that appear in
the input (except for <newline> joining), unmodified,
including any embedded or enclosing quotes or substitution
operators, between the start and the end of the quoted text.
The token shall not be delimited by the end of the quoted
field."
In XCU section 2.6, add:
* a <single-quote>
to the end of the list created by the changes identified in Mantis
BugID 49. (BugID 49 modifies text on P2305, L72669-72671 in XCU7.)
In XCU 2.6.3 (Command Substitution) change:
"A single-quoted or double-quoted string that begins, but
does not end, within the "‘...‘" sequence produces undefined
results."
on P2309, L72827-72829 to:
"A quoted string that begins, but does not end, within the
"‘...‘" sequence produces undefined results."
In XCU section 2.6.7 (Quote Removal), change:
"The quote characters (<backslash>, single-quote, and
double-quote) that were present in the original word shall
be removed unless they have themselves been quoted."
on P2311, L72904-72905 to:
"The quote character sequence <dollar-sign> single-quote
and the characters (<backslash>, single-quote, and
double-quote) that were present in the original word shall
be removed unless they have themselves been quoted."
In the Note in XCU section 2.10.2 (Shell Grammar Rules) rule 1, change:
"Because at this point <quotation-mark> characters are
retained in the token, quoted strings cannot be recognized
as reserved words."
on P2325, L73436-73437 to:
"Because at this point quoting characters (<backslash>,
single-quote, <quotation-mark>, and the <dollar-sign>
single-quote sequence) are retained in the token, quoted
strings cannot be recognized as reserved words."
Add a new section in XRAT following P3649, L124068:
<italic>C.2.2.4 Dollar-Single-Quotes</italic>
The $'...' quoting construct has been implemented
in several recent shells. It is similar to
character string literals ("...") in the C Standard
with the following exceptions:
1. The \x escape sequence in C can be followed by an
arbitrary number of hexadecimal digits. The
ksh93 implementation of $'...' also consumes
an arbitrary number of hexadecimal digits; bash
consumes at most two hexadecimal digits in this
case. This standard leaves the result unspecified
if more than two hexadecimal digits follow \x.
(Note that a hexadecimal escape followed by a
literal hexadecimal character can always be
represented as $'\xXX'X)
2. The \c escape sequence is not included in the C
standard. There was also some disagreement in
shells that historically supported \c escape
sequences in $'...'. These include:
A. whether \cA through \cZ produced the byte values
1 through 26, respectively or supported the
codeset independent control character as
specified by the stty utility. This standard
requires codeset independence.
B. whether \c[, \c\\, \c], \c^, \c_, and \? could
be used to yield the <ESC>, <FS>, <GS>,
<RS>, <US>, and <DEL> control characters,
respectively. This standard requires support
for all of the control characters except
NULL (matching what is done in the stty
utility).
C. whether \c\\ or \c\ was used to represent <FS>.
This standard requires \c\\ to make backslash
escape processing consistent.
The implementors of the most common shells that
implement $'\cX' agreed to convert to the
behavior specified in this standard.
Some shells also allow \c<arbitrary_control_character>
to act as an inverse function to \cX (i.e., \cm
and \cM yield <CR> and \c<CR> yields m or M.
This standard leaves this behavior
implementation-defined.
3. The \e escape sequence is not included in the C
Standard, but was provided by all historical
shells that supported $'...'. Some also supported
\E as a synonym. One member of the group
objected to adding \e because the <ESC> control
character is not required to be in the portable
character set. The \e sequence is included
because many historical users of $'...' expect
it to be there. The \E sequence is not included
in this standard because <backslash> escape
sequences that start with <backslash> followed
by an uppercase letter (except \U) are reserved
by the C Standard for implementation use.
4. The \XXX octal escape sequence and the \xXX
hexadecimal escape sequence can be used to
insert a null byte into a C Standard character
string literal and into a $'...' quoted word
in this standard. In C, any characters specified
after that null byte (including escape sequences)
continue to be processed and added to the
expansion of the character string literal. In
$'...' in the shell, however, this standard
requires the null byte and all remaining
characters up to the terminating unescaped
single-quote be evaluated and discarded. (This
was historic practice in bash, but not in ksh93.)
This allows an escape sequence producing a null
byte to terminate the dollar-single-quoted
expansion, but not terminate the token in which
it appears if there are characters remaining
in the token. For example:
printf a$'b\0c\''d
is required by this standard to produce:
abd
while historic versions of ksh93 produced:
ab
5. The double-quote (") character can be used
literally, while the single-quote (') character must
be represented as an escape sequence. In C,
single-quote can be used literally, while
double-quote requires an escape sequence.
In C, if a character specified by \uXXXX or \UXXXXXXXX
in "..." does not exist in the execution character
set, the results are implementation-defined.
Similarly, this standard makes the results
implementation-defined if \e, \cX, \uXXXX, or
\UXXXXXXXX specifies a character that is not present
in the current locale.
Note that the escape sequences recognized by $'...',
file format notation (see Table 5-1 on page 121),
XSI-conforming implementations of the echo utility
(see the echo utility's operands section on page
2615), and the printf utility's format operand (see
the printf utility's extended description section
on page 3050) are not the same. Some escape sequences
are not recognized by all of the above, the \c
escape sequence in echo is not at all like the \c
escape sequence in $'...', octal escape sequences
in some of the above accept one to four octal digits
and require a leading zero while others accept one
to three octal digits and do not require a leading
zero.
|
|
(0000599) nick (manager) 2010-11-04 16:06 |
The issue of adding \cX and \e was discussed at the WG14 meeting. A paper to add these character constants was requested and submitted (see http://wiki.dinkumware.com/twiki/pub/WG14/Documents/n1534.htm). [^] However, since this paper was in response to a late liaison statement, objection to considering it was raised. The committee agreed that POSIX was free to use these characters, and encouraged resubmission of N1534 as a liaison ballot comment, probably reworded to be an update to the future directions section. WG14 has indicated that POSIX use of the "Future Standardization" is satisfactory. |
|
(0000601) Don Cragun (manager) 2010-11-05 03:00 edited on: 2010-11-05 03:04 |
Note: 0000590 was updated today to account for another difference in the way bash and ksh93 processed $'...'. If an escape sequence expanded to a null byte, bash terminated the $'...' expansion, but still concatenated any following characters that appeared after the $'...' expansion that were in the same lexical token. On the other hand, ksh93 terminated the entire token in this case. The ksh93 developers agreed to change to match bash in this case if the standard requires this behavior. Also note that if the C1x standard adopts the changes suggested in the document referenced in Note: 0000599, the "\e, "in the last paragraph of the newly created 2.2.4 section can be removed (since the <ESC> character would be moved into the basic character set in C1X). If this happens we should also consider adding \e to XBD Table 5-1 and checking the places in the standard that refer to that table to determine whether \e processing should also be required in those places. |
|
(0000609) nick (manager) 2010-11-05 14:52 |
This issue was presented to WG14 (C). The initial reaction was to request a paper to add \cX and \e as character constants (see http://wiki.dinkumware.com/twiki/pub/WG14/Documents/n1534.htm, [^] attached). This paper was subsequently procedurally rejected as it was late and made fundamental differences, which might also impact C++. Instead, the paper was revised to propose adding these characters to the Future Directions section, and a liaison comment on the CD ballot to ask for this would be appropriate. The WG14 committee agreed that there was no problem in POSIX adding these characters to the shell, and did not object to the use of the namespace. |
Issue History |
|||
| Date Modified | Username | Field | Change |
| 2010-04-30 21:42 | dwheeler | New Issue | |
| 2010-04-30 21:42 | dwheeler | Status | New => Under Review |
| 2010-04-30 21:42 | dwheeler | Assigned To | => ajosey |
| 2010-04-30 21:42 | dwheeler | Name | => David A. Wheeler |
| 2010-04-30 21:42 | dwheeler | Section | => 2.2 Quoting |
| 2010-04-30 21:42 | dwheeler | Page Number | => 2298-2299 |
| 2010-04-30 21:42 | dwheeler | Line Number | => 72348-72401 |
| 2010-09-16 16:17 | nick | Note Added: 0000548 | |
| 2010-09-18 18:12 | Don Cragun | Relationship added | related to 0000322 |
| 2010-10-01 12:48 | geoffclare | Note Added: 0000560 | |
| 2010-10-06 01:26 | nick | Tag Attached: c99 | |
| 2010-10-08 17:28 | mirabilos | Note Added: 0000565 | |
| 2010-10-25 06:17 | Don Cragun | Note Added: 0000590 | |
| 2010-10-25 14:51 | Don Cragun | Note Edited: 0000590 | |
| 2010-10-25 15:55 | Don Cragun | Note Edited: 0000590 | |
| 2010-10-26 06:44 | Don Cragun | Note Edited: 0000590 | |
| 2010-10-26 20:39 | Don Cragun | Note Edited: 0000590 | |
| 2010-10-26 20:40 | Don Cragun | Note Edited: 0000590 | |
| 2010-10-26 20:40 | Don Cragun | Note Edited: 0000590 | |
| 2010-10-26 20:45 | Don Cragun | Note Edited: 0000590 | |
| 2010-10-26 21:04 | Don Cragun | Note Edited: 0000590 | |
| 2010-10-27 03:29 | Don Cragun | Note Edited: 0000590 | |
| 2010-11-04 16:07 | nick | Note Added: 0000599 | |
| 2010-11-05 02:34 | Don Cragun | Note Edited: 0000590 | |
| 2010-11-05 03:00 | Don Cragun | Note Added: 0000601 | |
| 2010-11-05 03:04 | Don Cragun | Note Edited: 0000601 | |
| 2010-11-05 14:52 | nick | Note Added: 0000609 | |
| 2010-11-05 14:54 | nick | File Added: n1534.htm | |
| 2010-11-05 14:56 | nick | File Added: n1534_original.htm | |
| 2010-11-11 16:40 | Don Cragun | Note Edited: 0000590 | |
| 2010-11-11 16:42 | Don Cragun | Note Edited: 0000590 | |
| 2010-11-11 16:44 | Don Cragun | Interp Status | => --- |
| 2010-11-11 16:44 | Don Cragun | Final Accepted Text | => See Note: 0000590 |
| 2010-11-11 16:44 | Don Cragun | Status | Under Review => Resolved |
| 2010-11-11 16:44 | Don Cragun | Resolution | Open => Accepted As Marked |
| 2010-11-11 16:44 | Don Cragun | Tag Attached: issue8 | |
| 2010-12-09 16:12 | Don Cragun | Note Edited: 0000590 | |
| Mantis 1.1.6[^] Copyright © 2000 - 2008 Mantis Group |