Austin Group Defect Tracker

Aardvark Mark IV


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0001564 [Issue 8 drafts] Shell and Utilities Editorial Clarification Requested 2022-02-23 01:54 2022-04-15 02:17
Reporter calestyo View Status public  
Assigned To
Priority normal Resolution Open  
Status New   Product Version Draft 2.1
Name Christoph Anton Mitterer
Organization
User Reference
Section 2.13 Pattern Matching Notation
Page Number 2351
Line Number 76099
Final Accepted Text
Summary 0001564: clariy on what (character/byte) strings pattern matching notation should work
Description On the mailing list, the question arose (from my side) what the current wording in the standard implies as to whether pattern matching works on byte or character strings.


- In some earlier discussion it was pointed out that shell variables
  should be strings (of bytes, other than NUL)
  => which could one lead to think that pattern
     matching must work on any such strings

- 2.6.2 Parameter Expansion
  doesn't seem to say, what the #, ##, % and %% special forms of
  expansion work on: bytes or characters, it just
  refers to the pattern matching chapter


- 2.13. Pattern Matching Notation says:
  "The pattern matching notation described in this section is used to
  specify patterns for matching strings in the shell."
  => strings... would mean bytes (as per 3.375 String)

- 2.13.1 Patterns Matching a Single Character however says:
  "The following patterns matching a single character shall match a
  single character: ordinary characters,..."


I questioned whether one could deduce from that, that patten matching is required to cope with any non-characters in the string it operates upon.

This was however rejected on the list, and Geoff Clare pointed out, that since no behaviour is specified (i.e. how the implementation would need to handle such invalidly encoded character) the use of pattern matching on arbitrary byte strings is undefined behaviour.
Desired Action Either:
1) - In line 76099, replace "strings" with "character strings" and perhaps mention that the results when this is done on strings that contain any byte sequence that is not a character in the current locale, the results are undefined.

Perhaps also clarify this in fnmatch() (page 879), this doesn't seem to mention locales at all, but when the above assumption is true, and pattern matching operates on characters only, wouldn't it then need to be subject of the current LC_CTYPE?


2) Alternatively, some expert could check whether there are any shell/fnmatch() implementations which do not simply carry on any bytes that do not form characters. Probably there are (yash?). But if there weren't POSIX might even chose to standardise that behaviour, which would probably be better than leaving it unspecified?!
Tags No tags attached.
Attached Files

- Relationships
related to 0001561New clarify what kind of data shell variables need to be able to hold 

-  Notes
(0005716)
calestyo (reporter)
2022-02-25 04:57

3) A third aspect, that should perhaps be considered by some more knowledgable person than me:

One main use of pattern matching is matching filenames (2.13.3 Patterns Used for Filename Expansion).
Filenames however are explicitly byte strings


So if pattern matching is indeed intended to only have a specified meaning on character strings (in the current locale), then 2.13.3 should somehow explain this.

Especially how such patterns should deal with filenames that aren't character strings in the current locale (e.g. error, unspecified, ignored?).

For example, would a "ls *" in principle be expected to only match filenames who are character strings in the current locale?
(0005719)
mirabilos (reporter)
2022-02-25 20:54

「would a "ls *" in principle be expected to only match filenames who are character strings in the current locale?」

That’d be the logical consequence of treating it as characters in the current encoding.

When I added locale “things” to my projects, I extended the definition of character. Instead of just accepting the characters that are valid in the current encoding (which is either 8-bit-transparent 7-bit ASCII or (back then BMP-only, but I’m moving to full 21-bit) UTF-8), so-called “raw octets” are also mapped into the wide character range.

Every time a conversion fails, the first octet of it is handled as raw octet, then the conversion restarts on the next one. (This can obviously be optimised for illegal UTF-8 sequences if one is careful about the beginning of the next possibly valid sequence.)

In 16-bit wchar_t times (basically “until 2022Q1”), this is mapped into a PUA range reserved by the CSUR for this. (Not quite optimal.) This is U+EF80‥U+EFFF. (What happens when you encounter \xEE\xBE\x80 can only be described as fun.)

In the new scheme, I’m mapping them to U-10000080‥U-100000FF which is outside of the range of things, so not a problem (except now I’m wondering what to set WCHAR_MAX to, but I think 0x10FFFFU still, because only these are, strictly speaking, valid?)

There’s a complication that has to do with the idiotic Standard C API for mbrtowc(3) in that “return value == 0” is the sole test for “*pwc == L'\0'” and so cannot be used to signal that 0 octets have been eaten, which means I might need to use even higher numbers for 2‑, 3‑ and 4-byte raw octet sequences. (The latter of which has wcwidth() == 4…)

But that’s detail. The thing relevant here is that this is (could be, but anything else is either discarding the notion of character here (which would be hard to make congruent with the existence of character classes) or an active and certainly harmful disservice to users (the “only match filenames that are valid” I quoted above)) a middle ground between characters and bytes: bytes that are characters if possible and have character semantics applied, but may not.

This is currently unspecified. I’d like to (continue) treat(int) things in a way that means that, for example, ? is either a character or a single byte from an invalid multibyte sequence (of length 1 or more), with a subsequent ? catching a possible second byte, and so on. Raw octets are displayed as � with a wcwidth() of 1 each (or some application-local suitable encoding, where that is possible).
(0005729)
calestyo (reporter)
2022-03-03 03:37

Well I guess the whole thing is also, why your point had been earlier, that '.' as sentinel would be enough, and any implementation that wouldn't carry on invalid encodings (i.e. bytes that do not form characters), would be buggy in that respect already.


I guess on the one hand, Geoff is clearly right, when he says that any such behaviour (especially the complicated mappings that you explained above) are not expected to be carried out by an implementation (at least not from the current standard)... and as such '.' would in fact not be enough as the sentinel for command substitution with trailing newlines, but the LC_ALL=C would be required.


OTOH... the '*' example above was intended to question whether there are really any implementations which would filter out filenames which contain bytes that do not form characters. I'd guess not.


So one more reason, why I think that this should be clearly specified (which I've requested with this issue)... i.e. if there's consensus that it (pattern matching) operates on character strings only - clearly name this, declare operation on non-character strings unspecified (with respect to their results) and remove all current references that indicate that it would operate on bytes.
(0005796)
geoffclare (manager)
2022-04-11 13:55

Suggested changes...

On page 2351 line 76098 section 2.13 Pattern Matching Notation, change:
The pattern matching notation described in this section is used to specify patterns for matching strings in the shell.
to:
The pattern matching notation described in this section is used to specify patterns for matching character strings in the shell.

After page 2351 line 76102 section 2.13 Pattern Matching Notation, add a new paragraph:
If an attempt is made to use pattern matching notation to match a string that contains one or more bytes that do not form part of a valid character, the behavior is unspecified. Since pathnames can contain such bytes, portable applications need to ensure that the current locale is the C or POSIX locale when performing pattern matching (or expansion) on arbitrary pathnames.
(0005797)
kre (reporter)
2022-04-11 22:58

How can a conforming application possibly (sanely) ensure the C locale is
in use when performing pathname expansion using user input that has been
presented in the user's locale (and if that is not to be allowed, how
can the user ever sanely use pathnames containing characters that are not
ASCII, and if that is not to be allowed, what good are locales ?)

I truly wish we could simply stop attempting to make the standard consistent
with regard to the current locale mess, it is all way too broken to be
useful.
(0005798)
geoffclare (manager)
2022-04-12 08:51

> How can a conforming application possibly (sanely) ensure the C locale is in use when performing pathname expansion using user input that has been presented in the user's locale

1. The vast majority of apps will never need to do that because they know (or can assume) that the pathnames they handle either always use the portable filename character set or use the user's locale. I.e. the pathnames are not abitrary (a word I was careful to include in the proposed changes).

2. In apps that truly do need to do matching or expansion on arbitrary pathnames, a C program can call uselocale() before and after calls to fnmatch(), glob(), and wordexp(). A shell script can set LC_ALL=C before handling pathnames (and unset it or restore it afterwards).
(0005804)
calestyo (reporter)
2022-04-15 02:12

Re: https://www.austingroupbugs.net/view.php?id=1561#c5797 [^]

Well this proposal is not really changing anything, is it? Why do you think it's worse to name that something results in undefined behaviour than not saying anything at all and leave it ambiguous (especially when the wording is already contradictory)?

Also... it doesn't seem as if locales would ever go away. And even if the POSIX/C community (and all other affected groups) would decide tomorrow to abolish locales or at least the choice of character encodings and make all UTF-8... there would be still millions of lines of code which assume different character encodings to exist and which thus somehow need to be defined in a proper manner.
(0005805)
calestyo (reporter)
2022-04-15 02:17

Re: https://www.austingroupbugs.net/view.php?id=1561#c5796 [^]

In principle that clarifies my original point.

(Although I'd probably have preferred if all (relevant) implementations just behave the same already... and this could be made non-undefined.


Do you think something should be done about fnmatch(), page 879?

While it refers to sections 2.13.1 and 2.13.2 (which also fall under your proposed changes in 2.13) it still uses "string" all over and doesn't mention any dependency on the locale's character encoding?

- Issue History
Date Modified Username Field Change
2022-02-23 01:54 calestyo New Issue
2022-02-23 01:54 calestyo Name => Christoph Anton Mitterer
2022-02-23 01:54 calestyo Section => 2.13 Pattern Matching Notation
2022-02-23 01:54 calestyo Page Number => 2351
2022-02-23 01:54 calestyo Line Number => 76099
2022-02-25 04:57 calestyo Note Added: 0005716
2022-02-25 20:54 mirabilos Note Added: 0005719
2022-03-03 03:37 calestyo Note Added: 0005729
2022-04-07 16:30 geoffclare Relationship added related to 0001561
2022-04-11 13:55 geoffclare Note Added: 0005796
2022-04-11 22:58 kre Note Added: 0005797
2022-04-12 08:51 geoffclare Note Added: 0005798
2022-04-15 02:12 calestyo Note Added: 0005804
2022-04-15 02:17 calestyo Note Added: 0005805


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker