View Issue Details

IDProjectCategoryView StatusLast Update
00018571003.1(2024)/Issue8Base Definitions and Headerspublic2025-03-21 08:30
Reporterdannyniu Assigned To 
PrioritynormalSeverityObjectionTypeError
Status Interpretation RequiredResolutionAccepted As Marked 
NameDannyNiu/NJF
OrganizationIndividual
User Reference
Section9.1 Regular Expression Definitions # and others.
Page Number179-180 and others
Line Number6366-6368 and others.
Interp StatusProposed
Final Accepted Text0001857:0006919
Summary0001857: Several problems with the new "lazy" regex quantifier.
Description1. Newly added text describe "shortest" match with the word "longest":
====

lines 6366-6368 on page 180:

> However, matching the ERE "(.*?).*" against "abcdef", the subpattern "(.*?)" matches the empty string, since that is the **longest** possible match for the ERE ".*?".

The qualifier "?" modifies the quantifier to make them "lazy", and the "longest" make the intention of the standard writer confusing. Maybe it should be "shortest"?

2. the supposed length of subpatterns.
====

Lines 6362-6363 on page 180:

> Consistent with the whole match being the longest of the leftmost matches, each subpattern, from left to right, shall match the longest possible string

This is okay for longest matches and without the "lazy" qualifier. In a conceptual implementation of ERE, a back-tracking recursive-decending matcher greedily match, from left to right, each subpattern - so that they're longest before the final match.

For each new match that're longer than the previous, the right-most subpatterns are contracted first, before left-side ones. Thus after all iterations, the result conform to the requirement laid out for subpatterns naturally.

However, when "lazy" qualifier's applied to a component in a subpattern (one that's parenthesized), without other "greedy" quantifiers applied to other component(s) of the subpattern, the subpattern can only be "shortest". Therefore, the rules regarding the matched lengths of subpatterns with "lazy" qualifier(s) needs to be updated.

Next, when `REG_MINIMAL` is applied to the whole regex, the quantifiers become "lazy" by default, therefore absant any qualifier, subpatterns matched can only be "shortest". Thus re-iterating the need to update the rules for matching subpatterns when "lazy" qualfiers/specifiers are used.
Desired ActionVarious. The rules needs to be worked out carefully, and I have no definitive desired action at this moment. Also, this issue need to be related #793 and #1329, unless it's against procedure to relate new bugs to closed ones.
Tagstc1-2024

Relationships

related to 0001877 Resolved ISO editors Issue 8 comment 068 

Activities

dannyniu

2024-09-20 08:05

reporter   bugnote:0006879

Last edited: 2024-09-20 08:13

My first attempt at a proposed wording.

Section 9.1, term definition **matched**

3rd para:

Change:

> ... If the pattern permits a variable number of matching characters and thus there is more than one such sequence starting at that point, the longest such sequence is matched. ...

To

> ... If the pattern permits a variable number of matching characters and thus there is more than one such sequence starting at that point, the length-fittest such sequence is matched. ... (Examples adapts accordingly)

(I'll define length-fittest soon).

4th para:

Change:

> Consistent with the whole match being the longest of the leftmost matches, each subpattern, from left to right, shall match the longest possible string.

To:

> Within the entire match, where there's quantifier in the entire pattern and any and all of its subpatterns, for 2 adjacent quantifiers (in the sense that there's no other quantifier exists on any segment between the 2 quantifiers) that would result in multiple valid matches, the length-fittest rule applies to the former quantifiers in preference - in other word, the latter quantifier yields to the former when applying the length-fittest rule. Such precedence of preference/yielding is transitive across all quantifiers of the entire expression.

Add a term definition **length-fittest**:

> A concept defined for quantifiers for the purpose of determining the region and sub-regions of match. For quantifiers without the `?` lazy quantifier, the most number of possible repetition is the fittest in terms of length; likewise, for quantifiers with the `?` lazy quantifier, the least number of possible repetition is the fittest in terms of length.

geoffclare

2024-09-23 08:56

manager   bugnote:0006880

> For quantifiers without the `?` lazy quantifier, the most number of possible repetition is the fittest in terms of length; likewise, for quantifiers with the `?` lazy quantifier, the least number of possible repetition is the fittest in terms of length.

This would change the established-for-decades "longest" requirement to "most repetitions", which is not the same thing. And it turns out that on macOS the '?' modifier does not change to matching the least repetitions, it is shortest match; the re_format(7) man page is wrong. Tested using the program at the end of https://posix.rhansen.org/p/2020-11-09 with REG_MINIMAL removed:
$ ./a.out '([ab]{6}|a)*?b' aaaabbbb
regexec() returned 0
rm_so 0, rm_eo 5

(Least repetitions would give rm_eo 7.)

Same test with grep, using -o to see what matched:
$ echo aaaabbbb | grep -E -o '([ab]{6}|a)*?b'
aaaab
b
b
b

This behaviour makes sense as the whole point of REG_MINIMAL and the '?' modifier is to change to the opposite greediness, and the opposite of longest is shortest. Having the default as longest and REG_MINIMAL/'?' as least repetitions would produce the same output in the above tests with and without the '?', making them pointless in such cases.

geoffclare

2024-09-24 10:46

manager   bugnote:0006881

Last edited: 2024-09-26 13:28

Suggested changes ...

On page 179 line 6348 section 9.1, add a sentence:
The matching process is described in [xref to 9.2].
and move the remaining paragraphs of the "matching" definition to after page 181 line 6413 section 9.2.

On page 179 line 6357 section 9.1, change:
If the pattern permits a variable number of matching characters and thus there is more than one such sequence starting at that point, the longest such sequence is matched. For example, the BRE "bb*" matches the second to fourth characters of the string "abbbc", and the ERE "(wee|week)(knights|night)" matches all ten characters of the string "weeknights".

Consistent with the whole match being the longest of the leftmost matches, each subpattern, from left to right, shall match the longest possible string.
to:
If the pattern permits a variable number of matching characters and thus there is more than one such sequence starting at that point, the matched sequence shall be the longest such sequence for which any minimal repetitions (see [xref to 9.4.6]) used in the match have the shortest possible match. For example, the BRE "bb*" matches the second to fourth characters of the string "abbbc", and the ERE "(wee|week)(knights|night)" matches all ten characters of the string "weeknights". However, the ERE "(aaa??)*" matches only the first four characters of the string "aaaaa", not all five, because in order to match all five, "a??" would match with length one instead of zero; the ERE "(aaa??)*|(aaa?)*" matches all five because the longest match is one which does not use any minimal repetitions.

Consistent with the match for the entire regular expression being the leftmost and longest for which any minimal repetitions used in the match have the shortest possible match, each BRE or ERE in a concatenated set, from left to right, shall match the longest possible string for which any minimal repetitions used in the match for that BRE or ERE have the shortest possible match.

On page 180 line 6367 section 9.1, change:
the subpattern "(.*?)" matches the empty string, since that is the longest possible match for the ERE ".*?"
to:
the subexpression "(.*?)" matches the empty string, since that is the longest possible match for which the minimal repetition ".*?" has the shortest possible match (zero length).

On page 179 line 6370 section 9.1, change:
the longest sequence shall be measured
to:
the sequence length shall be measured

After page 191 line 6814 section 9.5, add a note:
<small>Note:The grammar defines syntax only and places no requirements on implementations as to how the parsed BRE or ERE is used for matching. The matching process is described in [xref to 9.2].</small>

After XRAT page 3716 line 127617 section A.9.4.6, add a paragraph:
Note that the repetition modifier '?' (<question-mark>) is specified as changing the matching behavior for the modified repetition from the leftmost longest possible match to the leftmost shortest possible match. This does not necessarily give the same result as matching with the least repetitions. For example, the ERE "([ab]{6}|a)*?b" matches the first five characters of the string "aaaabbbb" as this is the shortest for the minimal repetition "*?". Matching with the least repetitions would match the first seven characters by using one repetition of "[ab]{6}" instead of four repetitions of "a". This distinction is only possible because the alternatives in an ERE alternation are chosen according to which gives the longest (or shortest) match. Other types of regular expression exist (notably in perl, php, and python) where the alternatives are tried in order; for those there is no difference between longest and most repetitions or between shortest and least repetitions.


dannyniu

2024-09-24 11:54

reporter   bugnote:0006882

Last edited: 2024-09-24 12:12

Now that we've got a wording:

- how do we incorporate the `REG_MINIMAL` flag?

- if both greedy **AND** lazy quantifiers're nested (possibly recursively!), how long should the nested and the overall match of the parenthesized submatches be?

For the sake of public record, I was not fully aware of macOS ERE behavior when I proposed my initial wording. I took inspiration from perlre's "best/worst" match concept. I proposed the wording because I worried a single implementation whose document explicitly acknowledged some POSIX requirements were violated would be a little unclear/ambiguous/inconsistent to be accepted into the standard.

geoffclare

2024-09-24 14:04

manager   bugnote:0006883

> how do we incorporate the `REG_MINIMAL` flag?

I think my suggested wording works with REG_MINIMAL because of the way it is specified in 9.4.6 list item 6 as changing the default (and reversing what the '?' modifier does).

> if both greedy **AND** lazy quantifiers're nested ...

That was the reason for wording it as "longest possible ... for which any minimal repetitions used ... have the shortest possible match". A minimal repetition nested inside a greedy one has precedence (if used); otherwise, each just follows its normal rule.

dannyniu

2024-09-25 08:28

reporter   bugnote:0006884

Last edited: 2024-09-25 11:36

Geoff, I tested the `re` module from Python v3.10.15's standard library, and the PCRE module from PHP v8.3.11, they all match the initial 7 characters of the string "aaaabbbb" with '([ab]{6}|a)*?b'. It seems that it's the macOS implementation is an outcast.

Given this, I question if we should standardize macOS behavior, given that its documentation isn't entirely clear & unambiguous & consistent, and explicitly acknowledge violation of POSIX requirements, as I mentioned before.

If it's breaking an existing implementation that you worry, then hear me out:

1. POSIX-2024 is a new standard, and macOS is generally slow on adopting conformance to newer standard (they were still certified to 03 back when v7 was published for quite some time), they still have time to fix their implementation before claiming conformance to 2024.

2. Portable applications could not have used the new lazy quantifier, given that it's not yet clearly & unambiguously & consistently specified. It is almost certain that they haven't yet used the new feature, but for those that do use the lazy quantifier, they're probably using some other languages, runtime, and/or libraries (such as Python and PHP I've used as example).

3. The whole point of standardizing lazy quantifier is so that applications from point #2 I mentioned can use standardized facility, if they found that the semantics' different, then this won't be possible anymore.

Source code:

```python3
#!/usr/bin/env python3

import re

print(re.search(r'([ab]{6}|a)*?b', "aaaabbbb").group(0))
```

```php
#!/usr/bin/env php
<?php
 $m = [];
 preg_match('/([ab]{6}|a)*b/', "aaaabbbb", $m);
 var_dump($m);
```

geoffclare

2024-09-25 13:17

manager   bugnote:0006885

Last edited: 2024-09-27 07:09

Re 0001857:0006884.

I am strongly of the opinion that this is a misfeature in python and php. I suspect that most python and php programmers would be surprised to discover they behave that way. If there are two possible matches with different lengths, having the non-greedy RE give the longer of the two, the same as the greedy RE does, violates the principle of least surprise.
[Later retracted - see 0001857:0006892]

As regards "macOS is generally slow on adopting conformance to newer standard", that is irrelevant. This new addition in POSIX is intended to be a standardisation of long-standing existing practice. We would need a very good reason to specify behaviour that is different than that existing practice, thus breaking not just the existing implementations[*] but also applications that rely on the existing behaviour. It doesn't matter that they were not "portable applications" when they were written if a forced change to existing implementations is what causes them to break.

[*] Yes, "implementations" plural. Known to be at least macOS and DragonFly BSD.

dannyniu

2024-09-25 15:08

reporter   bugnote:0006886

Last edited: 2024-09-26 06:50

Re Note: 6885.

I actaully asked about this on StackOverflow before resorting to issue tracker as no one seem to understand the situation in enough detail to make informed statements.

One of the comments said straightforwardly that the lazy qualifier applies to the immediate preceding repetition (i.e. rather than recursively to subpattern matches as implied as necessary to make the match shortest by Geoff's proposed wording) by echoing my words in the question.

Relevant URL: https://stackoverflow.com/questions/78984645/regarding-posix-eres-new-repetition-modifier#comment139266744_78984645

If a wording must be specified, it must be clear & unambiguous & consistent. As a torture test for any potential wording, I propose testing the wording with the following:

- regex:
[0-9](((((([0-9][a-z]+[0-9])+?)+)+?)+)+?)+[0-9]

- string to be matched:
12abc34def56ghi78jkl90mnop12qrst34uvw56xyz78

The reason I propose this regex, is that you have to have some rules to resolve the ambiguity of recursively, alternatingly, nested lazy and greedy quantifier, and it's definitely not as easy as saying "minimal repetitions have a high precedence". And I'd like to set it as a challenge to propose a wording whose description covers the behavior of the torture test regex.

Finally, I suggest doing a survey on GitHub, on the use of lazy quantifier in Python-language projects if anyone still see value in keeping interoperability with Python users. For starters: https://github.com/search?q=lang%3Apython+%22%29*%3F%22+OR+%22%29%2B%3F%22&type=code

steffen

2024-09-25 22:10

reporter   bugnote:0006887

re:

Geoff, I tested the `re` module from Python v3.10.15's standard library,
and the PCRE module from PHP v8.3.11, they all match "aaaabbb" with
'([ab]{6}|a)*?b'. It seems that it's the macOS implementation is an
outcast.


#?2|kent:tmp$ ./p-pcre2 '([ab]{6}|a)*?b' aaaabbb
0: 0/7
        <aaaabbb>
1: 0/6
        <aaaabb>
#?0|kent:tmp$ ./p-tre '([ab]{6}|a)*?b' aaaabbb
HA at <6>
HA at <|>
MINIINININI  0 mini=0
MINIINININI 1 mini=1 rest:*
HAHAHAH
0: 0/5
        <aaaab>
1: 3/4
        <a>
#?0|kent:tmp$ ./p-c '([ab]{6}|a)*?b' aaaabbb
0: 0/7
        <aaaabbb>
1: 0/6
        <aaaabb>


<small>now i start using hypertext in Mantis, that worn down i am ..</small>

steffen

2024-09-25 22:33

reporter   bugnote:0006888

ok and also re: https://www.austingroupbugs.net/view.php?id=1857#c6886

- regex:
[0-9](((((([0-9][a-z]*[0-9])*?)*)*?)*)*?)*[0-9]

- string to be matched:
12abc34def56ghi78jkl90mnop12qrst34uvw56xyz78


And i say there is no challenge, and the word "lazy" was never heard before regarding ungreedy regular expression matching, the reasons why i have used "minimal" has already been told.

#?0|kent:tmp$ ./p-c '[0-9](((((([0-9][a-z]*[0-9])*?)*)*?)*)*?)*[0-9]' 12abc34def56ghi78jkl90mnop12qrst34uvw56xyz78
0: 0/44
        <12abc34def56ghi78jkl90mnop12qrst34uvw56xyz78>
1: 1/43
        <2abc34def56ghi78jkl90mnop12qrst34uvw56xyz7>
2: 1/43
        <2abc34def56ghi78jkl90mnop12qrst34uvw56xyz7>
3: 1/43
        <2abc34def56ghi78jkl90mnop12qrst34uvw56xyz7>
4: 1/43
        <2abc34def56ghi78jkl90mnop12qrst34uvw56xyz7>
5: 1/43
        <2abc34def56ghi78jkl90mnop12qrst34uvw56xyz7>
6: 38/43
        <6xyz7>
#?0|kent:tmp$ ./p-pcre2 '[0-9](((((([0-9][a-z]*[0-9])*?)*)*?)*)*?)*[0-9]' 12abc34def56ghi78jkl90mnop12qrst34uvw56xyz78
0: 0/2
        <12>
1: 1/1
        <>
2: -1/-1
3: -1/-1
4: -1/-1
5: -1/-1
6: -1/-1
#?0|kent:tmp$ ./p-tre '[0-9](((((([0-9][a-z]*[0-9])*?)*)*?)*)*?)*[0-9]' 12abc34def56ghi78jkl90mnop12qrst34uvw56xyz78
MINIINININI  0 mini=0
MINIINININI  0 mini=0
MINIINININI 1 mini=1 rest:*
MINIINININI  0 mini=0
MINIINININI  0 mini=0
MINIINININI 1 mini=1 rest:*
MINIINININI  0 mini=0
MINIINININI  0 mini=0
MINIINININI 1 mini=1 rest:*
MINIINININI  0 mini=0
HAHAHAH
0: 0/2
        <12>
1: 1/1
        <>
2: 1/1
        <>
3: 1/1
        <>
4: 1/1
        <>
5: 1/1
        <>
6: -1/-1

steffen

2024-09-25 22:36

reporter   bugnote:0006889

$ echo 12abc34def56ghi78jkl90mnop12qr | perl -e '$i=<STDIN>;if($i =~ "[0-9](((((([0-9][a-z]*[0-9])*?)*)*?)*)*?)*[0-9]"){print "i<$i>; 1<$1> 2<$2> 3<$3> &<$&>\n"}else{print "no match\n"}'
i<12abc34def56ghi78jkl90mnop12qr
>; 1<> 2<> 3<> &<12>

geoffclare

2024-09-26 08:41

manager   bugnote:0006890

> One of the comments said straightforwardly that the lazy qualifier applies to the immediate preceding repetition (i.e. rather than recursively to subpattern matches as implied as necessary to make the match shortest by Geoff's proposed wording) by echoing my words in the question.

There is certainly no intention to require the '?' modifier to act recursively, and I can't see any way to interpret my suggested wording as implying it.

dannyniu

2024-09-26 11:43

reporter   bugnote:0006891

Last edited: 2024-09-26 11:50

Geoff, you said you have a strong opinion that the Python and PHP's PCRE behavior is mis-feature, violates "principle of least surprise", and question its real-world use if any. While I don't want to and can't say if it's a mis-feature, I believe the rules are well-defined and doesn't violate the principle of least surprise when developer carefully read the document and understand all implications of using the lazy qualifier.

I'd like to propose that if (a big if) we ultimately cannot agree on a wording to use, at least we make it clear that implementations are split on this by saying the behavior is undefined (or implementation-defined) when the lazy quantifiers're used on groupings.

geoffclare

2024-09-26 12:16

manager   bugnote:0006892

Last edited: 2024-09-26 13:30

> Geoff, you said you have a strong opinion that the Python and PHP's PCRE behavior is mis-feature

I retract that statement. It was based on the incorrect assumption that python and php choose between alternatives the same way POSIX does. As per my recent emails to the mailing list, that assumption is false; they try the alternatives in order.

This makes POSIX EREs and perl/python/php REs fundamentally different when alternatives are used, and nobody should expect to see the same behaviour for patterns such as "([ab]{6}|a)*?b".

[Update: I have edited 0001857:0006881 to include something about this in the suggested XRAT addition.]

dannyniu

2024-09-27 11:34

reporter   bugnote:0006896

Last edited: 2024-09-27 11:37

Geoff, have you considered my torture testing case? How do you see your wording cover that case? I recall "minimal quantifier takes precedence", do you intend that to mean for my torture test regex, the inner greedy quantifiers yields? I'm all ears.

steffen

2024-09-27 15:51

reporter   bugnote:0006897

I think the XRAT addition is a very helpful text / example that addresses real-life knowledge deficiencies.

geoffclare

2024-09-30 09:26

manager   bugnote:0006898

> Geoff, have you considered my torture testing case? How do you see your wording cover that case?

- regex:
[0-9](((((([0-9][a-z]+[0-9])+?)+)+?)+)+?)+[0-9]

- string to be matched:
12abc34def56ghi78jkl90mnop12qrst34uvw56xyz78

The key part of my suggested wording is "the longest such sequence for which any minimal repetitions used in the match have the shortest possible match".

first [0-9] matches the first 1

([0-9][a-z]+[0-9])+? matches 2abc3 with 1 repetition (and 3 of [a-z] obviously)

(([0-9][a-z]+[0-9])+?)+ matches 2abc3 with 1 repetition as that's the longest match for which the minimal repetition has the shortest match

((([0-9][a-z]+[0-9])+?)+)+? matches 2abc3 with 1 repetition

(((([0-9][a-z]+[0-9])+?)+)+?)+ matches 2abc3 with 1 repetition as that's the longest match for which the minimal repetitions have the shortest match

((((([0-9][a-z]+[0-9])+?)+)+?)+)+? matches 2abc3 with 1 repetition

(((((([0-9][a-z]+[0-9])+?)+)+?)+)+?)+ matches 2abc3 with 1 repetition as that's the longest match for which the minimal repetitions have the shortest match

final [0-9] matches the first 4

matched string is 12abc34

dannyniu

2024-10-02 02:07

reporter   bugnote:0006899

One more question do we measure length by bytes, characters, or collating element?

geoffclare

2024-10-03 09:12

manager   bugnote:0006900

Last edited: 2024-10-03 09:15

> do we measure length by bytes, characters, or collating element?

Characters. The collating element issue is explicitly addressed in 9.1:
When a multi-character collating element in a bracket expression (see Section 9.3.5, on page 182) is involved, the longest sequence shall be measured in characters consumed from the string to be matched; that is, the collating element counts not as one element, but as the number of characters it matches.

Note that my suggested changes include a change from "longest sequence" to "sequence length" here.

geoffclare

2024-10-17 15:24

manager   bugnote:0006919

Last edited: 2025-03-20 15:58

Interpretation response
------------------------
The standard is unclear on this issue, and no conformance distinction can be made between alternative implementations based on this. This is being referred to the sponsor.

Rationale:
-------------
The standard does not state the precedence between greedy and non-greedy repetitions when nested.

Notes to the Editor (not part of this interpretation):
-------------------------------------------------------
Make the changes in 0001857:0007130.

agadmin

2024-10-17 16:22

administrator   bugnote:0006920

Interpretation proposed: 17 October 2024

agadmin

2024-11-19 11:53

administrator   bugnote:0006963

Interpretation approved: 19th November 2024

dannyniu

2024-12-01 12:43

reporter   bugnote:0006979

I'm still confused. I played around with egrep on my mac, and I still don't understand how the current wording describe the existing behavior. My terminal window output:

```
$ echo 12345678 | grep -E -o '([0-9]+)??'

$ echo 12345678 | grep -E -o '([0-9]+)+?'
1
2
3
4
5
6
7
8

$ echo 12345678 | grep -E -o '([0-9]+?)+'
12
34
56
78

$
```

geoffclare

2024-12-03 15:11

manager   bugnote:0006982

Re: 0001857:0006979
The last case looks like a bug. As far as I can see there is no reasonable rule for the way '?' affects how the matching is done that would result in a two-character match.

dannyniu

2024-12-25 14:40

reporter   bugnote:0007033

Is anyone aware of this article: https://link.springer.com/chapter/10.1007/978-3-030-02508-3_6

I haven't bought the chapter, but the abstract seem to imply that certain instances of complex expressions have behavior that varies across different implementations.

If that's true, could we add some informative texts warning developers the potential of difference in interpretation and to use strict unambiguous subset of the features offered in ERE?

dannyniu

2025-02-27 05:18

reporter   bugnote:0007087

As I mentioned on the list, I think we should explicitly define what ''subpattern'' and ''subexpression'' mean in regular expressions. Specifically, whether they refer to parenthezised substrings, or any arbitrary valid substring of the regular expression.

dannyniu

2025-03-04 14:56

reporter   bugnote:0007090

For the sake of public record, I'm duplicating mailing list message to note here that Geoff's step-by-step analysis of my torture testing case (at https://www.austingroupbugs.net/view.php?id=1857#c6898 ) is inconsistent with macOS `grep`. Here's my terminal output:

```
// Portable Home on External Drive /
$ echo 12abc34def56 | grep -E -o '(([0-9][a-z]+[0-9])+?)+'
2abc34def5

// Portable Home on External Drive /
$
```

Whether this is indeed a bug in software with no change to the standard text needed, or that the standard text itself is in error is arguable.

dannyniu

2025-03-05 11:43

reporter   bugnote:0007091

Geoff, my experiment with TRE seem to conflict with your interpretation of my torture test case in https://www.austingroupbugs.net/view.php?id=1857#c6898

Code:
```
//#define USE_LOCAL_TRE_H 1
#include <tre/tre.h>

int main()
{
    int ret = 100;
    static regex_t preg;
    static regmatch_t matches[16];

    ret = tre_regcomp(&preg, "[0-9](((((([0-9][a-z]+[0-9])+?)+)+?)+)+?)+[0-9]", REG_EXTENDED);
    printf("tre_regcomp returned: %d\n", ret);
    ret = tre_regexec(&preg, "12abc34def56ghi78jkl90mnop12qrst34uvw56xyz78", 16, matches, 0);
    printf("regexec returned: %d\n", ret);
    for(int i=0; i<16; i++) printf("%ld\t%ld\n", (long)matches[i].rm_so, (long)matches[i].rm_eo);
    tre_regfree(&preg);

    return 0;
}
```

Output:
```
tre_regcomp returned: 0
regexec returned: 0
0 44
38 43
38 43
38 43
38 43
38 43
38 43
-1 -1
-1 -1
-1 -1
-1 -1
-1 -1
-1 -1
-1 -1
-1 -1
-1 -1
```

geoffclare

2025-03-06 14:26

manager   bugnote:0007094

Re 0001857:0007091

The macOS behaviour matches my analysis. Looks like a case where Apple fixed a TRE bug.

Code:
#include <regex.h>
#include <stdio.h>

int main()
{
    int ret = 100;
    static regex_t preg;
    static regmatch_t matches[16];

    ret = regcomp(&preg, "[0-9](((((([0-9][a-z]+[0-9])+?)+)+?)+)+?)+[0-9]", REG_EXTENDED|REG_ENHANCED);
    printf("regcomp returned: %d\n", ret);
    ret = regexec(&preg, "12abc34def56ghi78jkl90mnop12qrst34uvw56xyz78", 16, matches, 0);
    printf("regexec returned: %d\n", ret);
    for(int i=0; i<16; i++) printf("%ld\t%ld\n", (long)matches[i].rm_so, (long)matches[i].rm_eo);
    regfree(&preg);

    return 0;
}

Output:
regcomp returned: 0
regexec returned: 0
0       7
1       6
-1      -1
-1      -1
-1      -1
-1      -1
-1      -1
-1      -1
-1      -1
-1      -1
-1      -1
-1      -1
-1      -1
-1      -1
-1      -1
-1      -1

geoffclare

2025-03-20 15:57

manager   bugnote:0007130

Last edited: 2025-03-20 16:11

On page 179 line 6348 section 9.1, add a sentence:
The matching process is described in [xref to 9.2].

and move the remaining paragraphs of the "matching" definition to after page 181 line 6413 section 9.2.

On page 179 line 6357 section 9.1, change:
If the pattern permits a variable number of matching characters and thus there is more than one such sequence starting at that point, the longest such sequence is matched. For example, the BRE "bb*" matches the second to fourth characters of the string "abbbc", and the ERE "(wee|week)(knights|night)" matches all ten characters of the string "weeknights".

Consistent with the whole match being the longest of the leftmost matches, each subpattern, from left to right, shall match the longest possible string.

to:
If the pattern permits a variable number of matching characters and thus there is more than one such sequence starting at that point, the match shall be made according to the following rules:

1. For a BRE, or an ERE that does not use the repetition modifier '?', the match shall be the leftmost longest.

2. If an ERE contains repetitions with and without the repetition modifier '?', the precedence between the repetitions shall be:

a. Each leftmost shortest match shall match the leftmost shortest sequence in the string, in descending priority from left to right.

b. Consistent with rule 2a, the length matched by the entire regular expression shall be the leftmost longest.

c. Consistent with rules 2a and 2b, each leftmost longest match shall match the leftmost longest sequence in the string, in descending priority from left to right.

d. If an attempt is made to match the same sequence of the string using repetitions both with and without the repetition modifier '?', the behavior is unspecified. For example, the ERE ([0-9]+)+? has unspecified behavior.

According to these rules, the BRE "bb*" matches the second to fourth characters of the string "abbbc", and the ERE "(wee|week)(knights|night)" matches all ten characters of the string "weeknights". However, the ERE "(aaa??)*" matches only the first four characters of the string "aaaaa", not all five, because in order to match all five, "a??" would match with length one instead of zero; the ERE "(aaa??)*|(aaa?)*" matches all five because the longest match is one which does not use any minimal repetitions.

Consistent with the match for the entire regular expression being made according to the above rules, each BRE or ERE in a concatenated set, from left to right, shall match according to the above rules, applied to that BRE or ERE.


On page 180 line 6367 section 9.1, change:
the subpattern "(.*?)" matches the empty string, since that is the longest possible match for the ERE ".*?"

to:
the subexpression "(.*?)" matches the empty string, since the minimal repetition ".*?" has priority and the empty string is the shortest possible match (zero length) for that repetition.


On page 179 line 6370 section 9.1, change:
the longest sequence shall be measured

to:
the sequence length shall be measured


After page 191 line 6814 section 9.5, add a note:
<small>Note:The grammar defines syntax only and places no requirements on implementations as to how the parsed BRE or ERE is used for matching. The matching process is described in [xref to 9.2].</small>


After XRAT page 3716 line 127617 section A.9.4.6, add a paragraph:
Note that the repetition modifier '?' (<question-mark>) is specified as changing the matching behavior for the modified repetition from the leftmost longest possible match to the leftmost shortest possible match. This does not necessarily give the same result as matching with the least repetitions. For example, the ERE "([ab]{6}|a)*?b" matches the first five characters of the string "aaaabbbb" as this is the shortest for the minimal repetition "*?". Matching with the least repetitions would match the first seven characters by using one repetition of "[ab]{6}" instead of four repetitions of "a". This distinction is only possible because the alternatives in an ERE alternation are chosen according to which gives the longest (or shortest) match. Other types of regular expression exist (notably in perl, php, and python) where the alternatives are tried in order; for those there is no difference between longest and most repetitions or between shortest and least repetitions.

agadmin

2025-03-21 08:30

administrator   bugnote:0007132

The interpretation notes to editors has been updated and this is restarting a 30 day review. The interpretation itself remains unchanged.
Interpretation proposed: 21 March 2025

Issue History

Date Modified Username Field Change
2024-09-14 12:54 dannyniu New Issue
2024-09-14 12:54 dannyniu Name => DannyNiu/NJF
2024-09-14 12:54 dannyniu Organization => Individual
2024-09-14 12:54 dannyniu Section => 9.1 Regular Expression Definitions # and others.
2024-09-14 12:54 dannyniu Page Number => 179-180 and others
2024-09-14 12:54 dannyniu Line Number => 6366-6368 and others.
2024-09-20 08:05 dannyniu Note Added: 0006879
2024-09-20 08:07 dannyniu Note Edited: 0006879
2024-09-20 08:13 dannyniu Note Edited: 0006879
2024-09-23 08:56 geoffclare Note Added: 0006880
2024-09-24 10:46 geoffclare Note Added: 0006881
2024-09-24 10:46 geoffclare Note Edited: 0006881
2024-09-24 11:54 dannyniu Note Added: 0006882
2024-09-24 12:08 dannyniu Note Edited: 0006882
2024-09-24 12:09 dannyniu Note Edited: 0006882
2024-09-24 12:11 dannyniu Note Edited: 0006882
2024-09-24 12:12 dannyniu Note Edited: 0006882
2024-09-24 14:04 geoffclare Note Added: 0006883
2024-09-25 08:28 dannyniu Note Added: 0006884
2024-09-25 08:30 dannyniu Note Edited: 0006884
2024-09-25 08:33 dannyniu Note Edited: 0006884
2024-09-25 08:42 dannyniu Note Edited: 0006884
2024-09-25 08:43 dannyniu Note Edited: 0006884
2024-09-25 11:36 dannyniu Note Edited: 0006884
2024-09-25 13:17 geoffclare Note Added: 0006885
2024-09-25 15:08 dannyniu Note Added: 0006886
2024-09-25 15:17 dannyniu Note Edited: 0006886
2024-09-25 15:23 dannyniu Note Edited: 0006886
2024-09-25 15:27 dannyniu Note Edited: 0006886
2024-09-25 22:10 steffen Note Added: 0006887
2024-09-25 22:33 steffen Note Added: 0006888
2024-09-25 22:36 steffen Note Added: 0006889
2024-09-26 04:02 dannyniu Note Edited: 0006886
2024-09-26 06:50 dannyniu Note Edited: 0006886
2024-09-26 08:41 geoffclare Note Added: 0006890
2024-09-26 11:43 dannyniu Note Added: 0006891
2024-09-26 11:50 dannyniu Note Edited: 0006891
2024-09-26 12:16 geoffclare Note Added: 0006892
2024-09-26 12:17 geoffclare Note Edited: 0006892
2024-09-26 13:27 geoffclare Note Edited: 0006881
2024-09-26 13:28 geoffclare Note Edited: 0006881
2024-09-26 13:30 geoffclare Note Edited: 0006892
2024-09-27 07:09 geoffclare Note Edited: 0006885
2024-09-27 11:34 dannyniu Note Added: 0006896
2024-09-27 11:37 dannyniu Note Edited: 0006896
2024-09-27 15:51 steffen Note Added: 0006897
2024-09-30 09:26 geoffclare Note Added: 0006898
2024-10-02 02:07 dannyniu Note Added: 0006899
2024-10-03 09:12 geoffclare Note Added: 0006900
2024-10-03 09:14 geoffclare Note Edited: 0006900
2024-10-03 09:15 geoffclare Note Edited: 0006900
2024-10-17 15:24 geoffclare Note Added: 0006919
2024-10-17 15:25 geoffclare Interp Status => Pending
2024-10-17 15:25 geoffclare Final Accepted Text => 0001857:0006919
2024-10-17 15:25 geoffclare Status New => Interpretation Required
2024-10-17 15:25 geoffclare Resolution Open => Accepted As Marked
2024-10-17 15:26 geoffclare Tag Attached: tc1-2024
2024-10-17 16:22 agadmin Interp Status Pending => Proposed
2024-10-17 16:22 agadmin Note Added: 0006920
2024-11-19 11:53 agadmin Interp Status Proposed => Approved
2024-11-19 11:53 agadmin Note Added: 0006963
2024-11-19 12:11 geoffclare Relationship added related to 0001877
2024-12-01 12:43 dannyniu Note Added: 0006979
2024-12-03 15:11 geoffclare Note Added: 0006982
2024-12-25 14:40 dannyniu Note Added: 0007033
2025-02-27 05:18 dannyniu Note Added: 0007087
2025-03-04 14:56 dannyniu Note Added: 0007090
2025-03-05 11:43 dannyniu Note Added: 0007091
2025-03-06 14:26 geoffclare Note Added: 0007094
2025-03-20 15:57 geoffclare Note Added: 0007130
2025-03-20 15:58 geoffclare Note Edited: 0006919
2025-03-20 16:11 geoffclare Note Edited: 0007130
2025-03-21 08:30 agadmin Interp Status Approved => Proposed
2025-03-21 08:30 agadmin Note Added: 0007132