Re: [Bug-apl] Suggestion for Quad-RE

On 12 October 2017 at 20:55, Juergen Sauermann <address@hidden> wrote:

Hi Elias,

see below.

/// Jürgen

On 10/12/2017 09:13 AM, Elias Mårtenson wrote:

On 11 October 2017 at 21:15, Juergen Sauermann <address@hidden> wrote:

If I understand libpcre2 correctly (and I probably don't) then a general regular _expression_ RE is a tree whose
structure is determined by the nesting of the parentheses in RE, and the result of a match follows the tree structure.

Actually, this is not the case. When you have subexpressions, what you have is simply a list of them, and each subexpression has a value. Whether or not these subexpressions are nested does not matter. Its position is purely dictated by the index of the opening parentheses.

Not exactly. It is true that libpcre returns a list of matches in terms of the position of each
match in the subject string B. However any two matches are either disjoint or one match is
contained in the other. This containment relation defines a partial order between the
matches which is most conveniently described by a tree. In that tree one RE, say RE1 is a
child of another RE RE2 if the substring of B corresponding to RE2 is contained in the
substring of B that corresponds to RE2.

Wow, this mail became longer than I had expected. Just to be clear: I am mostly talking about the case where strings are returned. The cases where indexes or ⊂-data is returned are different, and what I'm saying below does not necessarily apply entirely.

I will henceforth use the word "group" instead of "subexpression". The formal name is “capturing group”.

The fact that parenthesised groups can nest is orthogonal to the behaviour of the groups themselves.

The groups are referenced by index or by name (there is a special pattern syntax to give a group a name), and the API then provides functions to access the values of these groups. You can extract the values of a group either by using the name or by the index.

Notice that you can even refer to a capturing group using backreferences within the pattern itself. This should help prove how simple they are:

In a pattern the sequence \1 can be used to refer back to the result of the first group, \2 to the second, etc. Thus (and you can test this with PCRE yourself):

x(.(.))-\1 matches: xab-ab, xwe-we, etc...

x(.(.))-\2 matches: xab-b, xwe-e, etc...

Every single Regexp API I have used (including languages such as C, Perl, Java, _javascript_, Elixir, etc) returns the groups in the exact same way as I have explained. You use groups to choose what parts of the text you're interested in, and refer to them by index.

Speaking of Elixir, this is the set of group-related flags that their API supports (something tells me they also use PCRE behind the scenes):

:all - all captured subpatterns including the complete matching string (this is the default)
:first - only the first captured subpattern, which is always the complete matching part of the string; all explicitly captured subpatterns are discarded
:all_but_first- all but the first matching subpattern, i.e. all explicitly captured subpatterns, but not the complete matching part of the string
:none - does not return matching subpatterns at all
:all_names - captures all names in the Regex
list(binary) - a list of named captures to capture

I think this is an excellent approach, and definitely something to emulate.

The question is then: shall ⎕RE simply return the array of matches (which was what your
implementation did) or shall ⎕RE return the matches as a tree? This is the same question
as shall the tree be represented as a simple vector of nodes (corresponding to an APL
vector of some kind) or shall it be represented as a recursive node-properties + children structure (corresponding to a nested APL value)?

My argument here is one of pragmatism:

Regexp implementations in all known languages returns an indexed list of groups
A simple list is simply the most useful. I struggle to think of any case where returning a nested structure is in any way useful.

The vector of nodes and the nested APL value are both equivalent in describing the
tree. However, converting the nested tree structure to a vector of nodes is much simpler
(in APL) than the other way around because converting a node vector to the tree involves
a lot of comparisons which are quite lightweight but extremely ugly in APL. That was why
decided to return the tree and not the vector of nodes.

You are indeed correct that converting it into a tree is more difficult than the other way around. But then again, why should we go out of our ways to make an incredibly rare use-case somewhat easier when the common use-case becomes annoyingly complicated?

Can you think of a case where returning a nested structure is useful?

I am not entirely against a flag that goes into that direction, but I believe that flag should
determine if either the tree is returned (default) or the node vector of the of the tree if
the flag is given. Unfortunately that flag, even though it is far more consistent with the
structure of the ⎕RE result than 1↓, does not solve your 1↓ because it would still contain
the top-level match (= the root of the tree).

I'm still curious as to why you are interested in even having the option of returning a tree. I just don't see any need for it, and no other language that I have ever used have done anything like it.

Do you have a use-case in mind?

When you use subexpressions, it means that I am interested in specific parts of the matched string. If I am interested in a specific part of a string, it is very unlikely that I want to know the content of the entire match. But, if I do, I can always retrieve that using another set of parens that surrounds the entire regexp.

Not necessarily. It could also be a boundary condition of your match that you
only want to be satisfied no matter how. REs like [A-Z][a-z][0-9] are often used that way.

You can easily change that to ([A-Z][a-z][0-9]) get the entire match in this case.

When you don't have any subexpressions, it's most likely that I am not interested in the matched string at all, but rather just a boolean result telling me if I have a match at all.

The boolean case is simple, so the only aspect of this that warrants any discussion is how that should be achieved. My opinion is that it should be the default, but a flag can also be used.

For subexpressions, I think a few examples will help explain how they are used:

Let's assume the following regexp:

A(.)|B(.)

This regexp has two subexpressions, and the result with therefore have two values. Due to the fact that they are separated by the alternation symbol (|), one of the subexpressions will always be empty. So, here are the different possible results when matching different strings:

"AXY" Subexpr 1: "X", Subexpr 2: ""

"BZA" Subexpr 1: "", Subexpr 2: "Z"

"CXY" No match

Not sure if that should be so but i am not too familiar with libpre2 either. I would naively
expect that an RE of the form A|B would either return a match for A or a match for B but
not both. man pcre2pattern says:

The pattern always has a fixed number of groups that is defined by the pattern itself. In fact, PCRE has an API call that allows you to extract the number of groups, and this number is always constant for any given pattern (the call is pcre2_get_ovector_count(), and it should be noted that this is the number of substrings that should be returned. The return value from pcre2_match() may be lower than this number, at which it should be assumed that the remaining captured groups are empty. My old code didn't handle this case correctly).

       Vertical bar characters are used to separate alternative patterns. For
       example, the pattern

         gilbert|sullivan

       matches either "gilbert" or "sullivan". Any number of alternatives may
       appear, and an empty alternative is permitted (matching the empty
       string). The matching process tries each alternative in turn, from left
       to right, and the first one that succeeds is used.

My understanding of this is that, for example, B is ignored if A matches. That implies that
the matching of B is not even performed so "" (for no match) would be incorrect because
B could also match as well.

This is correct. But it has no relation to capturing groups. If we generalise your example a bit and add some groups, we get the following:

(gil....)|(sul......)

In this case, if you try to match against "gilbert", the second group still exists, but is empty. The same goes for the opposite case. Let's take three cases and look at what PCRE returns in this case:

foo - As there is no match, pcre2_match() will return -1.

gilbert - pcre2_match() will return 2, since it never looked at the second alternative

sullivan - pcre2_match() will return 3, since it had to look at both sides

In all of the cases where a match happened, pcre2_get_ovector_count() will return 3, since there is one full match + 2 capture groups.

Note that pcre returns unused capture groups with both start and end indexes being -1. This can be used to determine which side of an alternate actually matched. How to actually treat this in the case where strings (rather than indexes) are being returned in an open question. Returning ⍬ is an option, as is the empty string.

Regards,

Elias

From:	Elias Mårtenson
Subject:	Re: [Bug-apl] Suggestion for Quad-RE
Date:	Fri, 13 Oct 2017 16:46:36 +0800