|
From: | Elias Mårtenson |
Subject: | Re: [Bug-apl] Suggestion for Quad-RE |
Date: | Fri, 13 Oct 2017 16:46:36 +0800 |
Hi Elias,
see below.
/// Jürgen
On 10/12/2017 09:13 AM, Elias Mårtenson wrote:
Not exactly. It is true that libpcre returns a list of matches in terms of the position of eachOn 11 October 2017 at 21:15, Juergen Sauermann <address@hidden> wrote:
If I understand libpcre2 correctly (and I probably don't) then a general regular _expression_ RE is a tree whose
structure is determined by the nesting of the parentheses in RE, and the result of a match follows the tree structure.
Actually, this is not the case. When you have subexpressions, what you have is simply a list of them, and each subexpression has a value. Whether or not these subexpressions are nested does not matter. Its position is purely dictated by the index of the opening parentheses.
match in the subject string B. However any two matches are either disjoint or one match is
contained in the other. This containment relation defines a partial order between the
matches which is most conveniently described by a tree. In that tree one RE, say RE1 is a
child of another RE RE2 if the substring of B corresponding to RE2 is contained in the
substring of B that corresponds to RE2.
:all
- all captured subpatterns including the complete matching string (this is the default)
:first
- only the first captured subpattern, which is always the complete matching part of the string; all explicitly captured subpatterns are discarded
:all_but_first
- all but the first matching subpattern, i.e. all explicitly captured subpatterns, but not the complete matching part of the string
:none
- does not return matching subpatterns at all
:all_names
- captures all names in the Regex
list(binary)
- a list of named captures to capture
The question is then: shall ⎕RE simply return the array of matches (which was what your
implementation did) or shall ⎕RE return the matches as a tree? This is the same question
as shall the tree be represented as a simple vector of nodes (corresponding to an APL
vector of some kind) or shall it be represented as a recursive node-properties + children structure (corresponding to a nested APL value)?
The vector of nodes and the nested APL value are both equivalent in describing the
tree. However, converting the nested tree structure to a vector of nodes is much simpler
(in APL) than the other way around because converting a node vector to the tree involves
a lot of comparisons which are quite lightweight but extremely ugly in APL. That was why
decided to return the tree and not the vector of nodes.
I am not entirely against a flag that goes into that direction, but I believe that flag should
determine if either the tree is returned (default) or the node vector of the of the tree if
the flag is given. Unfortunately that flag, even though it is far more consistent with the
structure of the ⎕RE result than 1↓, does not solve your 1↓ because it would still contain
the top-level match (= the root of the tree).
Not necessarily. It could also be a boundary condition of your match that youWhen you use subexpressions, it means that I am interested in specific parts of the matched string. If I am interested in a specific part of a string, it is very unlikely that I want to know the content of the entire match. But, if I do, I can always retrieve that using another set of parens that surrounds the entire regexp.
only want to be satisfied no matter how. REs like [A-Z][a-z][0-9] are often used that way.
Not sure if that should be so but i am not too familiar with libpre2 either. I would naivelyWhen you don't have any subexpressions, it's most likely that I am not interested in the matched string at all, but rather just a boolean result telling me if I have a match at all.
The boolean case is simple, so the only aspect of this that warrants any discussion is how that should be achieved. My opinion is that it should be the default, but a flag can also be used.
For subexpressions, I think a few examples will help explain how they are used:
Let's assume the following regexp:
A(.)|B(.)
This regexp has two subexpressions, and the result with therefore have two values. Due to the fact that they are separated by the alternation symbol (|), one of the subexpressions will always be empty. So, here are the different possible results when matching different strings:
"AXY" Subexpr 1: "X", Subexpr 2: """BZA" Subexpr 1: "", Subexpr 2: "Z""CXY" No match
expect that an RE of the form A|B would either return a match for A or a match for B but
not both. man pcre2pattern says:
Vertical bar characters are used to separate alternative patterns. For
example, the pattern
gilbert|sullivan
matches either "gilbert" or "sullivan". Any number of alternatives may
appear, and an empty alternative is permitted (matching the empty
string). The matching process tries each alternative in turn, from left
to right, and the first one that succeeds is used.
My understanding of this is that, for example, B is ignored if A matches. That implies that
the matching of B is not even performed so "" (for no match) would be incorrect because
B could also match as well.
[Prev in Thread] | Current Thread | [Next in Thread] |