View Issue Details

IDProjectCategoryView StatusLast Update
00007451003.1(2013)/Issue7+TC1Base Definitions and Headerspublic2019-06-10 08:55
Reportersteffen Assigned To 
PrioritynormalSeverityEditorialTypeClarification Requested
Status ClosedResolutionAccepted As Marked 
NameSteffen
Organization
User Reference
SectionXBD, 6.2
Page Number128
Line Number3646-7
Interp StatusApproved
Final Accepted Text0000745:0001851
Summary0000745: Require that <newline> may not occur as part of multibyte characters
DescriptionMany POSIX tools perform by-byte-value comparison against the <newline> character. In newer editions of POSIX, the character set constraint that "<period> and <slash> shall not occur as part of any other character in any locale" has been added to allow portable, multibyte safe operations and interchange of filenames and paths. I think it would be an enhancement to extend this to the <newline> character, too.

I've asked the character-set experts on the unicode@unicode.org mailing list (thread starting at [1]) wether it would be safe to implement this change in POSIX, and the answers assert that this would be a possible change (for characters sets which can possibly be used for POSIX locales, and at least).

 [1] http://www.unicode.org/mail-arch/unicode-ml/y2013-m08/0093.html

There was also a thread on the POSIX mailing list regarding this, of course i fail to give the ML message number since permalinks have been moved to GMANE, but there the original mail is [2] (thread: [3]).

 [2] http://permalink.gmane.org/gmane.comp.standards.posix.austin.general/8037
 [3] http://comments.gmane.org/gmane.comp.standards.posix.austin.general/8037
Desired ActionChange the given paragraph from

  Likewise, the byte values used to encode <period> and <slash>
  shall not occur as part of any other character in any locale.

to

  Likewise, the byte values used to encode <period>, <slash>,
  <newline> and <carriage-return> shall not occur as part of
  any other character in any locale.
Tagstc2-2008

Activities

geoffclare

2013-09-02 14:40

manager   bugnote:0001781

Three points:

1. When this was originally raised on the mailing list, it was pointed
out that the normal way applications check whether fgets() has returned
an incomplete line is to test the last byte in the returned buffer for
equality with '\n'. This aspect of the fgets() interface is what makes
this a matter that we should bring to the attention of the C committee.

2. The case for requiring that <newline> cannot be part of any other
character is quite strong, but I don't see the case for <carriage-return>
being as strong. Conversely, if there is a case for <carriage-return>,
then what about other control characters such as <tab> or <form-feed>?

3. I don't think the suggested wording change is the right way to add
the desired requirement.

It is clear what "in any locale" means for <period> and <slash>, since
they are required to have the same encoding across all locales. For
characters whose encoding can vary between locales, it's not clear what
is meant: is the byte value for <newline> in one locale not allowed to
be part of a character in another locale, even if <newline> has a
different byte value in that locale? (I don't think that was intended.)

I would suggest the following change instead (possibly with an
adjustment to the list of control characters):

After:

  Likewise, the byte values used to encode <period> and <slash>
  shall not occur as part of any other character in any locale.

add:

  The byte values used to encode <newline> and <carriage-return> shall
  not occur as part of any other character in the same locale.

shware_systems

2013-09-02 15:26

reporter   bugnote:0001782

At line 6732, C138.pdf, the standard already reads:
The NL and CR characters cannot be changed.

and the processing for them requires they be honored regardless of input position in a line, which includes as secondary position in multi-byte characters, unless IEXTEN is set. So this request is already part of the standard for many implementations, but a clarifying cross-reference may be desirable. The language about IEXTEN may need changing also that NL and CR are immune from changes when it is set.

steffen

2013-09-03 12:19

reporter   bugnote:0001783

Replying to 0000745:0001781:

   The byte values used to encode <newline> and <carriage-return> shall
   not occur as part of any other character in the same locale.

The problem with this wording seems to be that existing programs which hardcode the <newline> character but also use the setlocale(3) facility would cease to function. It should be taken into account that either of hardcoding <newline> and using setlocale(3), as well as the combination of both, has become a common programming practice.

I think there are two possible solutions to this problem; either adjust the above sentence to

   The byte values used to encode <newline> and <carriage-return> shall
   not occur as part of any other character in any locale supported by
   the implementation.

Or add, on page 1884, after line 60595:

   [CX]An application that uses hardcoded character constants, like <newline>, shall ensure that the setlocale() function is only used to select in between locales with compatible character sets.[/CX]

Likewise add, on page 2182, after line 69469:

  [CX]An application that uses hardcoded character constants, like <newline>, shall ensure that the uselocale() function is only used to select in between locales with compatible character sets.[/CX]

steffen

2013-09-04 10:23

reporter   bugnote:0001784

Reply to a different comment of 0000745:0001781:

  The case for requiring that <newline> cannot be part of any other
  character is quite strong, but I don't see the case for <carriage-return>
  being as strong. Conversely, if there is a case for <carriage-return>,
  then what about other control characters such as <tab> or <form-feed>?

This issue is only about the handling of all forms of newline sequences, because this task is often done by using hardcoded constants, notably the ISO C character escape sequence constants which are defined in ISO C99 (section 6.4.4.4) for this purpose, i.e., new-line (the POSIX equivalent is defined in "Base Definitions, 3.239 Newline Character (<newline>)") and carriage return (also known as carriage-return in ISO C99; the POSIX equivalent is defined in "Base Definitions, 3.86 Carriage-Return Character (<carriage-return>)").

The referenced bugnote pointed out that programs which use the fgets(2) function are even required to do so.

However, i had included <carriage-return> in this issue in particular because it is loaded with extended meaning by POSIX, e.g., in "Base Definitions, 11. General Terminal Interface", notably section "11.1.9 Special Characters".

I agree that it would be desirable to extend the requirement for other control characters, i.e., for portable interaction according to ISO 6429 / ECMA-48, for example. Both, the unicode.org as well as the Austin Group mailing list threads i've referenced in 0000745, i.e., the first message of this issue, in fact do extend my seek as expressed by this issue (0000745) and may very well be used to open a second, extended issue that would impose further constraints on character sets that comply to upcoming POSIX standards.

geoffclare

2013-09-26 16:32

manager   bugnote:0001851

Last edited: 2013-09-26 16:35

Interpretation response
------------------------

The standard does not speak to this issue, and as such no conformance
distinction can be made between alternative implementations based on
this. This is being referred to the sponsor.

Rationale:
-------------
None.

Notes to the Editor (not part of this interpretation):
-------------------------------------------------------
Make the change in the desired action and also at page 128 line 3615 change:

    The encoded values associated with <slash> and <period> shall be
    invariant across all locales supported by the implementation.

    
to:

    The encoded values associated with <period>, <slash>, <newline>
    and <carriage-return> shall be invariant across all locales
    supported by the implementation.

At page 204 line 6729, change "carriage-return" to "<carriage-return>"

Add a sentence at the end of page 204 line 6731: "It cannot be changed."

Cross-volume change to XRAT...
At page 3483 line 117516 add a new paragraph to A.6.2:

    The encodings for <newline> and <carriage-return> are required to
    be the same across all locales since they are special to the
    general terminal interface and cannot be changed (see [xref to
    XBD 11.1.9]).

ajosey

2014-02-21 15:41

manager   bugnote:0002158

Interpretation Proposed 21 Feb 2014

ajosey

2014-03-25 13:43

manager   bugnote:0002205

Interpretation Approved: 25 March 2014

Issue History

Date Modified Username Field Change
2013-09-02 13:20 steffen New Issue
2013-09-02 13:20 steffen Name => Steffen
2013-09-02 13:20 steffen Section => XBD, 6.2
2013-09-02 13:20 steffen Page Number => 128
2013-09-02 13:20 steffen Line Number => 3646-7
2013-09-02 14:40 geoffclare Note Added: 0001781
2013-09-02 15:26 shware_systems Note Added: 0001782
2013-09-03 12:19 steffen Note Added: 0001783
2013-09-04 10:23 steffen Note Added: 0001784
2013-09-26 16:32 geoffclare Note Added: 0001851
2013-09-26 16:35 geoffclare Note Edited: 0001851
2013-09-26 16:36 geoffclare Interp Status => Pending
2013-09-26 16:36 geoffclare Final Accepted Text => 0000745:0001851
2013-09-26 16:36 geoffclare Status New => Interpretation Required
2013-09-26 16:36 geoffclare Resolution Open => Accepted As Marked
2013-09-26 16:36 geoffclare Tag Attached: tc2-2008
2014-02-21 15:41 ajosey Interp Status Pending => Proposed
2014-02-21 15:41 ajosey Note Added: 0002158
2014-03-25 13:43 ajosey Interp Status Proposed => Approved
2014-03-25 13:43 ajosey Note Added: 0002205
2019-06-10 08:55 agadmin Status Interpretation Required => Closed