Austin Group Defect Tracker

Aardvark Mark IV


Viewing Issue Simple Details Jump to Notes ] Issue History ] Print ]
ID Category Severity Type Date Submitted Last Update
0001329 [1003.1(2013)/Issue7+TC1] Base Definitions and Headers Objection Clarification Requested 2020-03-29 20:54 2020-12-09 10:01
Reporter rhialto View Status public  
Assigned To
Priority normal Resolution Accepted As Marked  
Status Applied  
Name Olaf 'Rhialto' Seibert
Organization
User Reference
Section Vol 1. 9.4.6, Vol. 1. 13. regex.h, Vol 2. regcomp(), Vol. 4. A.9.2
Page Number 190
Line Number 6195
Interp Status ---
Final Accepted Text Note: 0005111
Summary 0001329: Problem in resolution of 0000793: "Regular Expressions: add REG_MINIMAL and a minimum repitition modifier"
Description Issue 793 (which has status Applied) contains this text to apply:

     - Vol. 1: Base Definitions, Chapter 9, «Regular Expressions».

  9.4.6 EREs Matching Multiple Characters, p. 190, line 6195:
  insert after

        6. Each of the duplication symbols ('+', '*', '?', and intervals) may
           be suffixed by the minimal repitition modifier '?' <question-mark>,
           in which case matching behaviour is changed from the «leftmost
           longest possible match» to the «leftmost shortest possible match»,
           including the null match
           (see [reference to A.9, p. 3500 ff.]). For example, the ERE ".*c"
           matches the last character ('c') in the string "abc abc", whereas
           the ERE ".*?c" matches the first character 'c', the third character
           in the string.

There seems to be a mistake in this text:
                                                                                
     For example, the ERE ".*c" matches the last character ('c') in the
     string "abc abc",
                                                                                
which should likely be
                                                          
     For example, the ERE ".*c" matches all characters up to the last
     character ('c') in the string "abc abc",
                               
As an existing implementation, NetBSD's egrep matches the whole string:

    $ echo abc-abc | egrep --color ".*c"
    abc-abc
                                              
with the whole of abc-abc coloured red.

           
                                                                     
The text that follows seems incorrect for a similar but potentially more serious reason:
                                                                                
    whereas the ERE ".*?c" matches the first character 'c', the third
    character in the string. [[ "abc abc" ]]
                                                                                
Possibly it should match "abc", because that is leftmost; just matching
"c" would be shortest but I don't see it as being leftmost. Which of these two is meant?



Furthermore, "repitition" (used several times) seems to contain a typo.
Desired Action Correct the examples, and/or indicate whether "leftmost" or "shortest" is more important in a match with a minimal repetition modifier.
Tags issue8
Attached Files

- Relationships
child of 0000793Appliedajosey Regular Expressions: add REG_MINIMAL and a minimum repitition modifier 

-  Notes
(0004804)
kre (reporter)
2020-03-30 02:57

The major issue here is whether in "leftmost shortest possible match" which
takes priority, "leftmost" or "shortest possible" when the results would be
different (as in the example given) - the text of the example suggests that
the shortest match should be preferred over leftmost, but I suspect that's
probably just a similar example error than the obvious earlier one.

In case of "leftmost longest possible match" this issue cannot arise, as
moving the match further to the left necessarily makes it longer,

It is likely that the text from the definition of "matched" in 9.1:

    The search for a matching sequence starts at the beginning of a string
    and stops when the first sequence matching the expression is found,
    where ``first'' is defined to mean ``begins earliest in the string''.

will cover this case, but that's only if this is what the implementations
that currently implement minimal matches implement it that way, and if the
example is corrected to make it clear that is what happens. Additional
words to make it clear that when there is an apparent conflict, the leftmost
possible match is chosen rather than a shorter one beginning further to the
right might be useful.
(0004805)
kre (reporter)
2020-03-30 03:14
edited on: 2020-03-30 05:59

0000073 should probably also have amended the text in 9.1 that immediately
follows the quote given in Note: 0004804 ...

     If the pattern permits a variable number of matching characters and
     thus there is more than one such sequence starting at that point, the
     longest such sequence is matched.

Since that will no longer be universally true.

Further, when dealing with sub-matches, we need to know how the "minimal
match" operator applies to submatches, both when those sub-matches do, and
do not, also contain the minimal match operator.

That is, in each of

     (([abc]*)([abc]*)([abc]*))+?x
     (([abc]*?)([abc]*)([abc]*))+?x
     (([abc]*)([abc]*?)([abc]*))+?x
     (([abc]*?)([abc]*?)([abc]*))+?x

when applied to the string

     abcaabbccaaabbbcccx

what is \1 \2 \3 and \4 in each case ?

That will partly turn upon whether the added text

        If the REG_MINIMAL flag, as defined in the <regex.h>[REF] header,
           is used when compiling an ERE via regcomp(3)[REF], the «leftmost
           shortest possible match» is the default, and the minimal repitition
           modifier ’?’ can be used to select the «leftmost longest possible
           match».

(added in the new s.6 on p 190 of 7-TC1) is intended to apply to the
? operator as well as the REG_MINIMAL flag. That is, when evaluating
an ERE for which the "minimal match" operator applies, do sub-expressions
inherit the mimimal match property, or not? And if they do, does the
minimal match operator do what it does when REG_MINIMAL is used, and
reverse its effect, to become a longest match operator?

I don't know the answers to any of this, but I do know it all needs to be
made clear. There might be more, RE's in general tend to be things I simply
use, I generally prefer not to try and bend my mind around their definitions.

(0004806)
geoffclare (manager)
2020-03-30 08:36

When I applied bug 0000793 I spotted that "up to" was missing and inserted it. I should have added a note to the bug that I had done this - sorry.

The source currently has:
For example, the ERE
.sG ".*c"
matches up to the last character (\c
.cH c )
in the string
.sG "abc abc" ,
whereas the ERE
.sG ".*?c"
matches up to the first character
.cH c ,
the third character in the string.

I also fixed the spelling of "repetition" (and would not have added a note for that - minor editorial corrections like that are often needed when applying bugs).

Re: Note: 0004805 I agree that a change to the definition of "matched" in 9.1 is needed, and we should address that problem here.
(0004808)
kre (reporter)
2020-03-30 10:27

I'd also like to see a better definition of what all of this really
means, particularly REG_MINIMAL and the (new, as opposed to old) '?'
operator when that is in use, and how all of this applies in some of
the more difficult cases. I know I couldn't implement this in a
way I would expect to be compatible with other implementations based
only on what has been added in 0000793 (even assuming that we fix
the definition of "matched").
(0005111)
geoffclare (manager)
2020-11-09 17:06

Draft 1.1 page 163 after line 5700 add:
leftmost
The characters closest to the beginning of the string.


Draft 1.1 page 172 lines 6090-6100 change:
Each of the duplication symbols ('+', '*', '?', and intervals) can be suffixed by the minimal repetition modifier '?' (<question-mark>), in which case matching behavior shall be changed from the leftmost longest possible match to the leftmost shortest possible match, including the null match (see Section A.9, on page 3410). For example, the ERE <tt>".*c"</tt> matches up to the last character (<tt>'c'</tt>) in the string <tt>"abc abc"</tt>, whereas the ERE <tt>".*?c"</tt> matches up to the first character <tt>'c'</tt>, the third character in the string.

If the REG_MINIMAL flag, defined in the <regex.h> header, is used when compiling an ERE via regcomp( ), the leftmost shortest possible match shall be the default, and the minimal repetition modifier <tt>'?'</tt> can be used to select the leftmost longest possible match.


The behavior of multiple adjacent duplication symbols ('+', '*', '?', and intervals, possibly suffixed by the minimal repetition modifier '?') produces undefined results.
to:
Each of the duplication symbols ('+', '*', '?', and intervals) can be suffixed by the repetition modifier '?' (<question-mark>), in which case matching behavior for that repetition shall be changed from the leftmost longest possible match to the leftmost shortest possible match, including the null match (see Section A.9, on page 3410). For example, the ERE ".*c" matches up to and including the last character ('c') in the string "abc abc", whereas the ERE ".*?c" matches up to and including the first character 'c', the third character in the string.
    
If the REG_MINIMAL flag, defined in the <regex.h> header, is used when compiling an ERE via regcomp( ), the leftmost shortest possible match shall be the default for all duplication symbols, and the repetition modifier '?' can be used to select the leftmost longest possible match for the repetition it modifies.


The behavior of multiple adjacent duplication symbols ('+', '*', '?', and intervals, possibly suffixed by the repetition modifier '?') produces undefined results.

Draft 1.1 page 163 line 5718-5722 after:
Consistent with the whole match being the longest of the leftmost matches, each subpattern, from left to right, shall match the longest possible string. For this purpose, a null string shall be considered to be longer than no match at all. For example, matching the BRE "\(.*\).*" against "abcdef", the subexpression "(\1)" is "abcdef", and matching the BRE "\(a*\)*" against "bc", the subexpression "(\1)" is the null string.
add:
However, matching the ERE "(.*?).*" against "abcdef", the subpattern "(.*?)" matches the empty string, since that is the longest possible match for the ERE ".*?".


Draft 1.1 page 1719 line 56518, change:
REG_MINIMAL
Change default matching behavior to leftmost shortest possible match.
to:
REG_MINIMAL
Change the matching behavior for duplication symbols to the leftmost shortest possible match, and invert the behavior of the repetition modifier '?' (<question-mark>) to match the longest possible match instead of the shortest.


Draft 1.1 page 3411 line 116717-116719 change:
EREs can optionally use a leftmost-shortest rule (enabled via the REG_MINIMAL flag or the '?' minimal repetition modifier), in which case the shortest possible matching prefix is instead identified as the matching sequence.
to:
EREs can optionally use a leftmost-shortest rule for repetitions (enabled via the REG_MINIMAL flag or the '?' repetition modifier), in which case the shortest possible matching prefix is instead identified as the matching sequence for the affected repetition(s).

- Issue History
Date Modified Username Field Change
2020-03-29 20:54 rhialto New Issue
2020-03-29 20:54 rhialto Name => Olaf 'Rhialto' Seibert
2020-03-29 20:54 rhialto Section => Vol 1. 9.4.6, Vol. 1. 13. regex.h, Vol 2. regcomp(), Vol. 4. A.9.2
2020-03-29 20:54 rhialto Page Number => 190
2020-03-29 20:54 rhialto Line Number => 6195
2020-03-29 23:03 Don Cragun Relationship added child of 0000793
2020-03-30 02:57 kre Note Added: 0004804
2020-03-30 03:14 kre Note Added: 0004805
2020-03-30 03:23 kre Note Edited: 0004805
2020-03-30 05:59 kre Note Edited: 0004805
2020-03-30 08:36 geoffclare Note Added: 0004806
2020-03-30 10:27 kre Note Added: 0004808
2020-11-09 17:06 geoffclare Note Added: 0005111
2020-11-09 17:08 geoffclare Interp Status => ---
2020-11-09 17:08 geoffclare Final Accepted Text => Note: 0005111
2020-11-09 17:08 geoffclare Status New => Resolved
2020-11-09 17:08 geoffclare Resolution Open => Accepted As Marked
2020-11-09 17:08 geoffclare Tag Attached: issue8
2020-12-09 10:01 geoffclare Status Resolved => Applied


Mantis 1.1.6[^]
Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker