Austin Group Defect Tracker

Aardvark Mark IV


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0000745 [1003.1(2013)/Issue7+TC1] Base Definitions and Headers Editorial Clarification Requested 2013-09-02 13:20 2019-06-10 08:55
Reporter steffen View Status public  
Assigned To
Priority normal Resolution Accepted As Marked  
Status Closed  
Name Steffen
Organization
User Reference
Section XBD, 6.2
Page Number 128
Line Number 3646-7
Interp Status Approved
Final Accepted Text Note: 0001851
Summary 0000745: Require that <newline> may not occur as part of multibyte characters
Description Many POSIX tools perform by-byte-value comparison against the <newline> character. In newer editions of POSIX, the character set constraint that "<period> and <slash> shall not occur as part of any other character in any locale" has been added to allow portable, multibyte safe operations and interchange of filenames and paths. I think it would be an enhancement to extend this to the <newline> character, too.

I've asked the character-set experts on the unicode@unicode.org mailing list (thread starting at [1]) wether it would be safe to implement this change in POSIX, and the answers assert that this would be a possible change (for characters sets which can possibly be used for POSIX locales, and at least).

 [1] http://www.unicode.org/mail-arch/unicode-ml/y2013-m08/0093.html [^]

There was also a thread on the POSIX mailing list regarding this, of course i fail to give the ML message number since permalinks have been moved to GMANE, but there the original mail is [2] (thread: [3]).

 [2] http://permalink.gmane.org/gmane.comp.standards.posix.austin.general/8037 [^]
 [3] http://comments.gmane.org/gmane.comp.standards.posix.austin.general/8037 [^]
Desired Action Change the given paragraph from

  Likewise, the byte values used to encode <period> and <slash>
  shall not occur as part of any other character in any locale.

to

  Likewise, the byte values used to encode <period>, <slash>,
  <newline> and <carriage-return> shall not occur as part of
  any other character in any locale.
Tags tc2-2008
Attached Files

- Relationships

-  Notes
(0001781)
geoffclare (manager)
2013-09-02 14:40

Three points:

1. When this was originally raised on the mailing list, it was pointed
out that the normal way applications check whether fgets() has returned
an incomplete line is to test the last byte in the returned buffer for
equality with '\n'. This aspect of the fgets() interface is what makes
this a matter that we should bring to the attention of the C committee.

2. The case for requiring that <newline> cannot be part of any other
character is quite strong, but I don't see the case for <carriage-return>
being as strong. Conversely, if there is a case for <carriage-return>,
then what about other control characters such as <tab> or <form-feed>?

3. I don't think the suggested wording change is the right way to add
the desired requirement.

It is clear what "in any locale" means for <period> and <slash>, since
they are required to have the same encoding across all locales. For
characters whose encoding can vary between locales, it's not clear what
is meant: is the byte value for <newline> in one locale not allowed to
be part of a character in another locale, even if <newline> has a
different byte value in that locale? (I don't think that was intended.)

I would suggest the following change instead (possibly with an
adjustment to the list of control characters):

After:

  Likewise, the byte values used to encode <period> and <slash>
  shall not occur as part of any other character in any locale.

add:

  The byte values used to encode <newline> and <carriage-return> shall
  not occur as part of any other character in the same locale.
(0001782)
shware_systems (reporter)
2013-09-02 15:26

At line 6732, C138.pdf, the standard already reads:
The NL and CR characters cannot be changed.

and the processing for them requires they be honored regardless of input position in a line, which includes as secondary position in multi-byte characters, unless IEXTEN is set. So this request is already part of the standard for many implementations, but a clarifying cross-reference may be desirable. The language about IEXTEN may need changing also that NL and CR are immune from changes when it is set.
(0001783)
steffen (reporter)
2013-09-03 12:19

Replying to Note: 0001781:

   The byte values used to encode <newline> and <carriage-return> shall
   not occur as part of any other character in the same locale.

The problem with this wording seems to be that existing programs which hardcode the <newline> character but also use the setlocale(3) facility would cease to function. It should be taken into account that either of hardcoding <newline> and using setlocale(3), as well as the combination of both, has become a common programming practice.

I think there are two possible solutions to this problem; either adjust the above sentence to

   The byte values used to encode <newline> and <carriage-return> shall
   not occur as part of any other character in any locale supported by
   the implementation.

Or add, on page 1884, after line 60595:

   [CX]An application that uses hardcoded character constants, like <newline>, shall ensure that the setlocale() function is only used to select in between locales with compatible character sets.[/CX]

Likewise add, on page 2182, after line 69469:

  [CX]An application that uses hardcoded character constants, like <newline>, shall ensure that the uselocale() function is only used to select in between locales with compatible character sets.[/CX]
(0001784)
steffen (reporter)
2013-09-04 10:23

Reply to a different comment of Note: 0001781:

  The case for requiring that <newline> cannot be part of any other
  character is quite strong, but I don't see the case for <carriage-return>
  being as strong. Conversely, if there is a case for <carriage-return>,
  then what about other control characters such as <tab> or <form-feed>?

This issue is only about the handling of all forms of newline sequences, because this task is often done by using hardcoded constants, notably the ISO C character escape sequence constants which are defined in ISO C99 (section 6.4.4.4) for this purpose, i.e., new-line (the POSIX equivalent is defined in "Base Definitions, 3.239 Newline Character (<newline>)") and carriage return (also known as carriage-return in ISO C99; the POSIX equivalent is defined in "Base Definitions, 3.86 Carriage-Return Character (<carriage-return>)").

The referenced bugnote pointed out that programs which use the fgets(2) function are even required to do so.

However, i had included <carriage-return> in this issue in particular because it is loaded with extended meaning by POSIX, e.g., in "Base Definitions, 11. General Terminal Interface", notably section "11.1.9 Special Characters".

I agree that it would be desirable to extend the requirement for other control characters, i.e., for portable interaction according to ISO 6429 / ECMA-48, for example. Both, the unicode.org as well as the Austin Group mailing list threads i've referenced in 0000745, i.e., the first message of this issue, in fact do extend my seek as expressed by this issue (0000745) and may very well be used to open a second, extended issue that would impose further constraints on character sets that comply to upcoming POSIX standards.
(0001851)
geoffclare (manager)
2013-09-26 16:32
edited on: 2013-09-26 16:35

Interpretation response
------------------------

The standard does not speak to this issue, and as such no conformance
distinction can be made between alternative implementations based on
this. This is being referred to the sponsor.

Rationale:
-------------
None.

Notes to the Editor (not part of this interpretation):
-------------------------------------------------------
Make the change in the desired action and also at page 128 line 3615 change:

    The encoded values associated with <slash> and <period> shall be
    invariant across all locales supported by the implementation.

    
to:

    The encoded values associated with <period>, <slash>, <newline>
    and <carriage-return> shall be invariant across all locales
    supported by the implementation.

At page 204 line 6729, change "carriage-return" to "<carriage-return>"

Add a sentence at the end of page 204 line 6731: "It cannot be changed."

Cross-volume change to XRAT...
At page 3483 line 117516 add a new paragraph to A.6.2:

    The encodings for <newline> and <carriage-return> are required to
    be the same across all locales since they are special to the
    general terminal interface and cannot be changed (see [xref to
    XBD 11.1.9]).

(0002158)
ajosey (manager)
2014-02-21 15:41

Interpretation Proposed 21 Feb 2014
(0002205)
ajosey (manager)
2014-03-25 13:43

Interpretation Approved: 25 March 2014

- Issue History
Date Modified Username Field Change
2013-09-02 13:20 steffen New Issue
2013-09-02 13:20 steffen Name => Steffen
2013-09-02 13:20 steffen Section => XBD, 6.2
2013-09-02 13:20 steffen Page Number => 128
2013-09-02 13:20 steffen Line Number => 3646-7
2013-09-02 14:40 geoffclare Note Added: 0001781
2013-09-02 15:26 shware_systems Note Added: 0001782
2013-09-03 12:19 steffen Note Added: 0001783
2013-09-04 10:23 steffen Note Added: 0001784
2013-09-26 16:32 geoffclare Note Added: 0001851
2013-09-26 16:35 geoffclare Note Edited: 0001851
2013-09-26 16:36 geoffclare Interp Status => Pending
2013-09-26 16:36 geoffclare Final Accepted Text => Note: 0001851
2013-09-26 16:36 geoffclare Status New => Interpretation Required
2013-09-26 16:36 geoffclare Resolution Open => Accepted As Marked
2013-09-26 16:36 geoffclare Tag Attached: tc2-2008
2014-02-21 15:41 ajosey Interp Status Pending => Proposed
2014-02-21 15:41 ajosey Note Added: 0002158
2014-03-25 13:43 ajosey Interp Status Proposed => Approved
2014-03-25 13:43 ajosey Note Added: 0002205
2019-06-10 08:55 agadmin Status Interpretation Required => Closed


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker