• Experiences with match() subexpressions?

    From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Thu Apr 10 09:06:34 2025
    From Newsgroup: comp.lang.awk

    I'm looking for subexpressions of regexp-matches using GNU Awk's
    third parameter of match(). For example

    data = "R=r1,R=r2,R=r3,E=e"
    match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr)

    The result stored in 'arr' seems to be determined by the static
    parenthesis structure, so with the pattern repetition {2,5} only
    the last matched data in the subexpression (r3) seems to persist
    in arr. - I suppose there's no cute way to achieve what I wanted?

    Janis
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Thu Apr 10 09:09:55 2025
    From Newsgroup: comp.lang.awk

    On 10.04.2025 09:06, Janis Papanagnou wrote:
    I'm looking for subexpressions of regexp-matches using GNU Awk's
    third parameter of match(). For example

    data = "R=r1,R=r2,R=r3,E=e"
    match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr)

    The result stored in 'arr' seems to be determined by the static
    parenthesis structure, so with the pattern repetition {2,5} only
    the last matched data in the subexpression (r3) seems to persist
    in arr. - I suppose there's no cute way to achieve what I wanted?

    To clarify; what I wanted is access of the values "r1", "r2", "r3",
    and "e" through 'arr'.

    Janis


    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From gazelle@gazelle@shell.xmission.com (Kenny McCormack) to comp.lang.awk on Thu Apr 10 11:08:55 2025
    From Newsgroup: comp.lang.awk

    In article <vt7qs4$2gior$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    On 10.04.2025 09:06, Janis Papanagnou wrote:
    I'm looking for subexpressions of regexp-matches using GNU Awk's
    third parameter of match(). For example

    data = "R=r1,R=r2,R=r3,E=e"
    match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr)

    The result stored in 'arr' seems to be determined by the static
    parenthesis structure, so with the pattern repetition {2,5} only
    the last matched data in the subexpression (r3) seems to persist
    in arr. - I suppose there's no cute way to achieve what I wanted?

    To clarify; what I wanted is access of the values "r1", "r2", "r3",
    and "e" through 'arr'.

    I have to admit that I (still) don't really understand how this match third
    arg stuff works. I.e., I can never predict what will happen, so I always
    just dump out the array and try to reverse-engineer it each time I need to
    use it.

    I adapted your code into the following test script:

    --- Cut Here ---
    #!/bin/sh
    gawk 'BEGIN {
    data = "R=r1,R=r2,R=r3,E=e"
    match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr)
    for (i in arr) print i,arr[i]
    }'

    # To clarify; what I wanted is access of the values "r1", "r2", "r3",
    # and "e" through 'arr'.
    --- Cut Here ---

    The output I get is:

    --- Cut Here ---
    0start 1
    0length 18
    3start 18
    1start 11
    2start 13
    3length 1
    2length 2
    1length 5
    0 R=r1,R=r2,R=r3,E=e
    1 R=r3,
    2 r3
    3 e
    --- Cut Here ---

    After playing around a bit, I could not come up with any sensible way of getting what you want to get.

    As an alternative, it sounds like you could just could just split the
    string on the comma; that would get you:

    R=r1
    R=r2
    R=r3
    E=e

    Or, for finer control, you could use patsplit().
    --
    The randomly chosen signature file that would have appeared here is more than 4 lines long. As such, it violates one or more Usenet RFCs. In order to remain in compliance with said RFCs, the actual sig can be found at the following URL:
    http://user.xmission.com/~gazelle/Sigs/Reaganomics
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Thu Apr 10 13:55:07 2025
    From Newsgroup: comp.lang.awk

    On 10.04.2025 13:08, Kenny McCormack wrote:
    In article <vt7qs4$2gior$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    On 10.04.2025 09:06, Janis Papanagnou wrote:
    I'm looking for subexpressions of regexp-matches using GNU Awk's
    third parameter of match(). For example

    data = "R=r1,R=r2,R=r3,E=e"
    match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr)

    The result stored in 'arr' seems to be determined by the static
    parenthesis structure, so with the pattern repetition {2,5} only
    the last matched data in the subexpression (r3) seems to persist
    in arr. - I suppose there's no cute way to achieve what I wanted?

    To clarify; what I wanted is access of the values "r1", "r2", "r3",
    and "e" through 'arr'.

    I have to admit that I (still) don't really understand how this match third arg stuff works.

    I've never used that before but it seems to be quite simple; for every parenthesis group expression in the regexp it provides (statically, as
    the parentheses are written, from left to right) an array element with
    the expanded matched subexpression.

    I.e., I can never predict what will happen, so I always
    just dump out the array and try to reverse-engineer it each time I need to use it.

    I adapted your code into the following test script:

    --- Cut Here ---
    #!/bin/sh
    gawk 'BEGIN {
    data = "R=r1,R=r2,R=r3,E=e"
    match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr)
    for (i in arr) print i,arr[i]
    }'

    # To clarify; what I wanted is access of the values "r1", "r2", "r3",
    # and "e" through 'arr'.
    --- Cut Here ---

    The output I get is:

    --- Cut Here ---
    0start 1
    0length 18
    3start 18
    1start 11
    2start 13
    3length 1
    2length 2
    1length 5

    Above output stuff appears because in 'arr' there's additional elements
    about the pattern positions stored.

    I don't need that so I'm just interested in the data patterns below and
    iterate with a index-counted loop...

    0 R=r1,R=r2,R=r3,E=e

    the whole expression

    1 R=r3,

    the expression in the first parenthesis

    2 r3

    the expression in the second, embedded parenthesis

    3 e

    the expression in the final parenthesis

    --- Cut Here ---

    After playing around a bit, I could not come up with any sensible way of getting what you want to get.

    Yeah, Arnold just told me the same; that it's impossible because the
    underlying GNU regexp library doesn't support what I'm looking for.

    What I considered a possible workaround (in this case) is to sequence
    the (...){2,5} expression by using sequences of (...)? expressions.
    (But in the general case, for larger ranges than 2-5, that's neither
    feasible nor sensible any more.)


    As an alternative, it sounds like you could just could just split the
    string on the comma; that would get you:

    Yes, that was also how I did such things in the past. Only when I saw
    that "third argument" to match() I hoped the two-level parsing could
    be simplified in one step. The reason was that I thought to have seen
    other languages (Perl, maybe?) that supported such a feature.


    R=r1
    R=r2
    R=r3
    E=e

    Or, for finer control, you could use patsplit().

    I think I'll do the parsing the straightforward two-step way as I did
    before the GNU Awk specific functions were available; it's probably
    also the clearest way to program that functionality.

    Janis

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From gazelle@gazelle@shell.xmission.com (Kenny McCormack) to comp.lang.awk on Thu Apr 10 14:04:46 2025
    From Newsgroup: comp.lang.awk

    In article <vt8bit$2uiq5$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    ...
    I have to admit that I (still) don't really understand how this match third >> arg stuff works.
    ...
    I.e., I can never predict what will happen, so I always
    just dump out the array and try to reverse-engineer it each time I need to >> use it.
    ...
    Above output stuff appears because in 'arr' there's additional elements
    about the pattern positions stored.

    Just to clarify, I wasn't looking for a tutorial (man page regurgitation).
    I understand the man page description of match's 3rd arg as well as anyone;
    I just find it that it doesn't do as much in practice as (I think) it
    should - and that it is unpredictable (by me, anyway) what it will do (you
    have to dump out the array and trial-and-error it to get it to do what you want). It promises more than it delivers. I have much the same comments
    to make about the similar functionality in Tcl (Expect).

    None of which is criticism of the feature; as you say below, it basically
    does as much as the underlying regexp library allows it to do.

    ...
    I think I'll do the parsing the straightforward two-step way as I did
    before the GNU Awk specific functions were available; it's probably
    also the clearest way to program that functionality.

    Probably so. BTW, it is not really "GNU Awk specific"; lots of languages
    have this general capability.

    Incidentally, here is a function of mine that uses match's 3rd arg. I find
    it useful. This addresses a common AWK issue, where you have a line with fields (in the usual AWK whitespace-delimited sense), but you need to know
    the actual character positions of the fields (since they can move around
    from line to line of input). Note also that I'm not really sure where the
    name "splitMatch" came from; it was just what popped into my head when I
    was writing this...

    --- Cut Here ---
    # Find the character positions of each of the fields in string s.
    # Note that s will usually be $0, and n will usually be NF.
    function splitMatch(s,n,A, i,t) {
    for (i=1; i<=n; i++) t = t "([^ \t]+)[ \t]*"
    return match(s,t,A)
    }
    --- Cut Here ---
    --
    In the corner of the room on the ceiling is a large vampire bat who
    is obviously deranged and holding his nose.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Thu Apr 10 23:39:57 2025
    From Newsgroup: comp.lang.awk

    On 10.04.2025 16:04, Kenny McCormack wrote:
    In article <vt8bit$2uiq5$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    ...
    I have to admit that I (still) don't really understand how this match third >>> arg stuff works.
    ...
    I.e., I can never predict what will happen, so I always
    just dump out the array and try to reverse-engineer it each time I need to >>> use it.
    ...
    Above output stuff appears because in 'arr' there's additional elements
    about the pattern positions stored.

    Just to clarify, I wasn't looking for a tutorial (man page regurgitation).
    I understand the man page description of match's 3rd arg as well as anyone;

    (I didn't mean to offend you. Sorry, if it appeared so. - I just
    read you writing "I don't really understand how this [...] works",
    and that "it is unpredictable", so I thought some descriptive words
    may be useful.)

    I just find it that it doesn't do as much in practice as (I think) it
    should - and that it is unpredictable (by me, anyway) what it will do (you have to dump out the array and trial-and-error it to get it to do what you want).

    It is pretty understandable to me, and not the least unpredictable.
    (That's why I thought it would be okay to write what I had written
    to explain it.) I don't understand what you find to be unpredictable.
    But never mind.

    It promises more than it delivers.

    Yes, probably. Although, according to what's literally documented,
    it doesn't promise too much, IMO. The feature can be very useful,
    but not for the case I was looking for. - Actually, it could have
    provided the functionality I was seeking, but since GNU Awk relies
    on the GNU regexp functions as they are implemented I cannot expect
    that any provided features gets extended by Awk. - If GNU Awk would
    have an own RE implementation then we could think about using, e.g.,
    another array dimension to store the (now only temporary existing,
    and generally unavailable) subexpressions.

    [...]

    None of which is criticism of the feature; as you say below, it basically does as much as the underlying regexp library allows it to do.

    ...
    I think I'll do the parsing the straightforward two-step way as I did
    before the GNU Awk specific functions were available; it's probably
    also the clearest way to program that functionality.

    Probably so. BTW, it is not really "GNU Awk specific"; lots of languages have this general capability.

    Oh, I was just trying to say that for my programming the standard Awk
    functions (as opposed to GNU Awk _specific_ functions) are fine here.
    (That should not disdain all the useful GNU Awk extensions existing.)

    Janis

    [...]


    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Ed Morton@mortonspam@gmail.com to comp.lang.awk on Thu Apr 10 20:07:35 2025
    From Newsgroup: comp.lang.awk

    On 4/10/2025 2:09 AM, Janis Papanagnou wrote:
    On 10.04.2025 09:06, Janis Papanagnou wrote:
    I'm looking for subexpressions of regexp-matches using GNU Awk's
    third parameter of match(). For example

    data = "R=r1,R=r2,R=r3,E=e"
    match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr)

    The result stored in 'arr' seems to be determined by the static
    parenthesis structure, so with the pattern repetition {2,5} only
    the last matched data in the subexpression (r3) seems to persist
    in arr. - I suppose there's no cute way to achieve what I wanted?

    To clarify; what I wanted is access of the values "r1", "r2", "r3",
    and "e" through 'arr'.

    Correct, you can't do what you want using just `match()`, it's simply
    matching a regexp with capture groups against a string, just like sed does.

    There are, of course, several other ways to get `arr[]` populated the
    way you want. e.g split(), patsplit(), while(match()), or dynamically generating the regexp. The best one to choose will depend on the real
    values that r1, etc. can have, for example it'd be hard to use split()
    if `r1` can be a quoted string that might itself contain similar
    substrings such as `data = "R=\"R=r1,R=r2\",R=r2,R=r3,E=e"`.

    Ed.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From arnold@arnold@freefriends.org (Aharon Robbins) to comp.lang.awk on Fri Apr 11 06:33:19 2025
    From Newsgroup: comp.lang.awk

    In article <vt9dre$3t3po$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    The feature can be very useful,
    but not for the case I was looking for. - Actually, it could have
    provided the functionality I was seeking, but since GNU Awk relies
    on the GNU regexp functions as they are implemented I cannot expect
    that any provided features gets extended by Awk. - If GNU Awk would
    have an own RE implementation then we could think about using, e.g.,
    another array dimension to store the (now only temporary existing,
    and generally unavailable) subexpressions.

    Actually, this is not so trivial. The data structures at the C level
    as mandated by POSIX are one dimensional; the submatches in parentheses
    are counted from left to right. There's no way to represent the
    subexpressions that are under control of interval expressions, which
    would essentially require a two-dimensional data structure.

    Mike Haertel is writing a new regexp matcher for gawk; it was announced
    here some time agao: https://github.com/mikehaertel/minrx. The code is
    in the feature/minrx branch of the gawk Git repository.

    I just opened an issue, https://github.com/mikehaertel/minrx/issues/43,
    about this question. We shall see what develops.

    Arnold
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Fri Apr 11 09:10:55 2025
    From Newsgroup: comp.lang.awk

    On 11.04.2025 08:33, Aharon Robbins wrote:
    In article <vt9dre$3t3po$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    The feature can be very useful,
    but not for the case I was looking for. - Actually, it could have
    provided the functionality I was seeking, but since GNU Awk relies
    on the GNU regexp functions as they are implemented I cannot expect
    that any provided features gets extended by Awk. - If GNU Awk would
    have an own RE implementation then we could think about using, e.g.,
    another array dimension to store the (now only temporary existing,
    and generally unavailable) subexpressions.

    Actually, this is not so trivial. The data structures at the C level
    as mandated by POSIX are one dimensional; the submatches in parentheses
    are counted from left to right. There's no way to represent the subexpressions that are under control of interval expressions, which
    would essentially require a two-dimensional data structure.

    Yes, that's why I had thought about a 2-dimensional array [on GNU
    Awk level] so that arr[n][i] for i=1..z would contain the patterns.
    This is what I actually tried with GNU Awk (before I had asked you)
    to see whether there's some undocumented feature.

    (I'm aware that things may get quite complicated if there's some
    restrictions imposed (on "C"-level or else) which are in the way.)


    Mike Haertel is writing a new regexp matcher for gawk; it was announced
    here some time agao: https://github.com/mikehaertel/minrx. The code is
    in the feature/minrx branch of the gawk Git repository.

    I just opened an issue, https://github.com/mikehaertel/minrx/issues/43,
    about this question. We shall see what develops.

    Oh, thanks for that. - My expectation had just been to check whether
    such a feature is already available in GNU Awk (or could in a simple
    way be made available with little effort). So I'm indeed interested
    to hear whether that is a feasible and sensible feature. - Myself,
    I have to admit, haven't yet thoroughly thought through about such a
    feature. I've just seen it from the limited view of my application
    context and thought it could be a worthwhile extension/generalization.

    Janis


    Arnold


    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Kaz Kylheku@643-408-1753@kylheku.com to comp.lang.awk on Fri Apr 11 07:40:01 2025
    From Newsgroup: comp.lang.awk

    On 2025-04-11, Aharon Robbins <arnold@freefriends.org> wrote:
    In article <vt9dre$3t3po$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    The feature can be very useful,
    but not for the case I was looking for. - Actually, it could have
    provided the functionality I was seeking, but since GNU Awk relies
    on the GNU regexp functions as they are implemented I cannot expect
    that any provided features gets extended by Awk. - If GNU Awk would
    have an own RE implementation then we could think about using, e.g., >>another array dimension to store the (now only temporary existing,
    and generally unavailable) subexpressions.

    Actually, this is not so trivial. The data structures at the C level
    as mandated by POSIX are one dimensional; the submatches in parentheses
    are counted from left to right. There's no way to represent the subexpressions that are under control of interval expressions, which
    would essentially require a two-dimensional data structure.

    Mike Haertel is writing a new regexp matcher for gawk; it was announced
    here some time agao: https://github.com/mikehaertel/minrx. The code is
    in the feature/minrx branch of the gawk Git repository.

    I just opened an issue, https://github.com/mikehaertel/minrx/issues/43,
    about this question. We shall see what develops.

    Unix and POSIX regular expressions have perpetrated a kind of
    misfeature. They took the purely algebraic parentheses described in
    classic literature on regular expressions, whose only role is to
    override the precedence and associativity of operators, and turned them
    into active operators that perform a double duty: they still override precedence, but also denote submatches associated with capture
    registers.

    Parentheses are enumerated and made to correspond with numbered capture registers, I think, as follows:

    ( ( ) ( ( ) ) )
    1 2 3 4

    Scanning left to right, we identify the open left parentheses
    which have matching closing parentheses, and number these in order
    starting from 1.

    There is a convention that capture register 0 is reserved for
    the full match for the expression. This is how it is with
    the array reported by POSIX's regexec. Thus the numbering is
    one based.

    The POSIX standard clearly says what happens when a parenthesized
    subexpression matches something more than once.

    This is spelled out in the documentation page on the regcomp,
    regexec and regfree functions. Look for this text:

    "If subexpression i in a regular expression is not contained within
    another subexpression, and it participated in the match several times,
    then the byte offsets in pmatch[ i] shall delimit the last such match.

    This is exactly the last match behavior observed by Janis in Awk's
    match function.

    Basically, subexpressions are dumb hack. As the regex automaton
    traverses through its states in response to the input, it triggers
    some anchor points associated with the original subexpression,
    which copy some data, or keep track of some pointers to the start and
    end of the match. When the submatch is complete, there is a data
    transfer which clobbers any previous such a data transfer.

    There are some tricky rules nested expressions.
    Suppose that we have:

    ( ... ( ... ) ...)
    1 2

    2 is nested inside 1. Suppose that 1 matches multiple times.
    Clearly, the corresponding register is left with the most
    recent match when the matching is done.

    But suppose that subexpression 2 sometimes matches when 1
    matches, but sometimes doe snot match when 1 matches.

    I think the obscurely worded POSIX rules are trying to prevent an inconsistency.

    In a nutshell, if a string is reported in register 2 from
    matching subexpression 2, it has to be a substring of a match that is concurrently happening for subexpression 1.

    Now suppose that that an iteration of 1 matches something,
    but in that iteration, subexpression 2 does not match.
    Then 2 has to be reset to indicate that it didn't match anything.

    Probably, it's a good idea to implement the behavior follows: whenever a
    new capture iteration begins for 1, the register for 2 must also be
    cleared, so that it doesn't retain stale data in the event that a match
    for 2 is not encountered in the new iteration of 1.

    This stuff is not really that usable for repetition; captures
    were clearly envisioned mainly for non-repeating matching without
    any kleene stars or {m, n} repetitions.
    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Kaz Kylheku@643-408-1753@kylheku.com to comp.lang.awk on Fri Apr 11 08:22:44 2025
    From Newsgroup: comp.lang.awk

    On 2025-04-11, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    On 11.04.2025 08:33, Aharon Robbins wrote:
    In article <vt9dre$3t3po$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    The feature can be very useful,
    but not for the case I was looking for. - Actually, it could have
    provided the functionality I was seeking, but since GNU Awk relies
    on the GNU regexp functions as they are implemented I cannot expect
    that any provided features gets extended by Awk. - If GNU Awk would
    have an own RE implementation then we could think about using, e.g.,
    another array dimension to store the (now only temporary existing,
    and generally unavailable) subexpressions.

    Actually, this is not so trivial. The data structures at the C level
    as mandated by POSIX are one dimensional; the submatches in parentheses
    are counted from left to right. There's no way to represent the
    subexpressions that are under control of interval expressions, which
    would essentially require a two-dimensional data structure.

    Yes, that's why I had thought about a 2-dimensional array [on GNU
    Awk level] so that arr[n][i] for i=1..z would contain the patterns.
    This is what I actually tried with GNU Awk (before I had asked you)
    to see whether there's some undocumented feature.

    I solved this problem 15 years ago in the TXR Pattern Language

    $ echo 'R=r1,R=r2,R=r3,E=e' | txr -B -c '@(coll)R=@r,@(until)E@(end)E=@e' r[0]="r1"
    r[1]="r2"
    r[2]="r3"
    e="e"

    We can eval the output into Bash and have a ${r[@]} array.

    We can see the captured variables in a Lisp format:

    $ echo 'R=r1,R=r2,R=r3,E=e' | txr -l -c '@(coll)R=@r,@(until)E@(end)E=@e'
    (r "r1" "r2" "r3")
    (e . "e")

    The matches occuring in repetition constructs like @(coll) or its
    vertical, line-oriented counterpart @(collect), are automatically
    tabulated into lists.

    We can see that the "e" variable wasn't; it is string valued,
    rather than list valued.

    One possibility is to use the @(merge dest {sources}*) directive which
    examines different nesting depths of its operands and
    intelligently combines them.

    $ echo 'R=r1,R=r2,R=r3,E=e' | txr -B -c '@(coll)R=@r,@(until)E@(end)E=@e @(merge x r e)'
    r[0]="r1"
    r[1]="r2"
    r[2]="r3"
    e="e"
    x[0]="r1"
    x[1]="r2"
    x[2]="r3"
    x[3]="e"

    $ echo 'R=r1,R=r2,R=r3,E=e' | txr -B -c '@(coll)R=@r,@(until)E@(end)E=@e @(merge x r e)
    @(forget r e)'
    x[0]="r1"
    x[1]="r2"
    x[2]="r3"
    x[3]="e"

    A plethora of techniques are possible.

    In Lisp, Split data along commas, then again on =

    (flow "R=r1,R=r2,R=r3,E=e"
    (spl ","))
    ("R=r1" "R=r2" "R=r3" "E=e")
    (flow "R=r1,R=r2,R=r3,E=e"
    (spl ",")
    (map (op spl "=")))
    (("R" "r1") ("R" "r2") ("R" "r3") ("E" "e"))

    Or pattern match the comma splits:

    (flow "R=r1,R=r2,R=r3,E=e"
    (spl ",")
    (map (do match `@key=@val` @1 (list key val))))
    (("R" "r1") ("R" "r2") ("R" "r3") ("E" "e"))

    Just the R's please

    (flow "R=r1,R=r2,R=r3,E=e"
    (spl ",")
    (map (do if-match `R=@val` @1 val)))
    ("r1" "r2" "r3" nil)

    Splice out the nils:

    (flow "R=r1,R=r2,R=r3,E=e"
    (spl ",")
    (mappend (do if-match `R=@val` @1 (list val))))
    ("r1" "r2" "r3")

    Or remove them:

    (flow "R=r1,R=r2,R=r3,E=e"
    (spl ",")
    (map (do if-match `R=@val` @1 val))
    (remq nil))

    Heck, use a Lispified Awk. The variable f holds
    the fields. Whenw e assign f to itself, that
    forces the recalculation of variable rec with
    the ofs:

    (awk (:inputs '("R=r1,R=r2,R=r3,E=e"))
    (:set fs "," ofs ":")
    (t (set f f) (prn)))
    R=r1:R=r2:R=r3:E=e
    nil

    Use two Awks, nested inside each other: inner Awk
    processes the fields f produced by the outer Awk:

    (awk (:inputs '("R=r1,R=r2,R=r3,E=e"))
    (:set fs "," ofs ":")
    (t (awk (:inputs f)
    (:set fs "=")
    (t (prn [f 1])))))
    r1
    r2
    r3
    e
    nil
    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From gazelle@gazelle@shell.xmission.com (Kenny McCormack) to comp.lang.awk on Fri Apr 11 08:57:22 2025
    From Newsgroup: comp.lang.awk

    In article <67f8b7af$0$705$14726298@news.sunsite.dk>,
    Aharon Robbins <arnold@freefriends.org> wrote:
    ...
    Mike Haertel is writing a new regexp matcher for gawk; it was announced
    here some time agao: https://github.com/mikehaertel/minrx. The code is
    in the feature/minrx branch of the gawk Git repository.

    Just out of curiosity, does the new matcher address the issue raised by
    Janis?

    It sounds like you are implying that it does, but do not say so explicitly.

    Again, just curiosity. I remember when you announced the new matcher, and
    it sounded interesting, but the presentation left me wondering the usual question(s): Why should I care? (Why should I get excited about this?)

    Incidentally, I remember that the primary issue with the new matcher was
    that it was written in C++. It needs to be C-ified in order to be included
    in a GAWK release version.
    --
    If the automobile had followed the same development cycle as the
    computer, a Rolls-Royce today would cost $100, get a million miles to
    the gallon, and explode once every few weeks, killing everyone inside.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Fri Apr 11 15:50:11 2025
    From Newsgroup: comp.lang.awk

    On 11.04.2025 10:57, Kenny McCormack wrote:
    In article <67f8b7af$0$705$14726298@news.sunsite.dk>,
    Aharon Robbins <arnold@freefriends.org> wrote:
    ...
    Mike Haertel is writing a new regexp matcher for gawk; it was announced
    here some time agao: https://github.com/mikehaertel/minrx. The code is
    in the feature/minrx branch of the gawk Git repository.

    Just out of curiosity, does the new matcher address the issue raised by Janis?

    I read his post as if he put it under discussion ("I just opened an
    issue, [...] about this question. We shall see what develops.") and
    the provided link shows this as well.[*]

    (I don't see the answers, though, since my browser obviously doesn't
    support the web-page's (dynamic?) format. - So I cannot tell what the
    state of that discussion is.)

    It sounds like you are implying that it does, but do not say so explicitly.

    [...]

    Janis

    [*] From https://github.com/mikehaertel/minrx/issues/43:

    So there are two questions.

    Is it theoretically possible to capture all the instances of
    subexpressions matched by the interval expression?

    Can this be brought out into the code? I understand it would take an extended API with a richer data structure in order to do this. gawk's
    extended version of the match() function could then be (somehow)
    extended to take advantage of this feature.

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Kaz Kylheku@643-408-1753@kylheku.com to comp.lang.awk on Fri Apr 11 17:54:07 2025
    From Newsgroup: comp.lang.awk

    On 2025-04-11, Aharon Robbins <arnold@freefriends.org> wrote:
    In article <vt9dre$3t3po$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    The feature can be very useful,
    but not for the case I was looking for. - Actually, it could have
    provided the functionality I was seeking, but since GNU Awk relies
    on the GNU regexp functions as they are implemented I cannot expect
    that any provided features gets extended by Awk. - If GNU Awk would
    have an own RE implementation then we could think about using, e.g., >>another array dimension to store the (now only temporary existing,
    and generally unavailable) subexpressions.

    Actually, this is not so trivial. The data structures at the C level
    as mandated by POSIX are one dimensional; the submatches in parentheses
    are counted from left to right. There's no way to represent the subexpressions that are under control of interval expressions, which
    would essentially require a two-dimensional data structure.

    Here is what I believe is the right requirement, if you want repeatedly
    visited subexpressions to capture all their iterations.

    The dimensionality has to be such that the entire array of matches is
    versioned as a whole.

    In other words, abstractly, we have

    matches[history][register]

    where history counts from 0, that being the latest matches.
    register also goes from zero; [0] is the match for the entire
    expression, [1] for subexpression 1 and so on.

    Any time there is a repetition in any subexpression, matches[0]
    is duplicated and pushed into the history.

    We can imagine the matches[h][0..(n-1)] giving a trace of the
    matches through the tree of subexpressions, from root to leaf.
    Each time someting is matched, the entire trace is recorded
    in the history, so everything is consistent.

    Say we want to parse the syntax

    key=v1,v2,v3 foo=a,b

    Using something like :

    ([^ =]+=([^ ,]*,?)* *)*
    1 2

    Then we have the subgroups 1 and 2. We would like to end up with
    a two dimensional match array like this:

    match[hist][reg] =

    reg

    hist 0 1 2

    0 key=v1,v2,v3 foo=a,b foo=a,b b

    1 key=v1,v2,v3 foo=a,b foo=a,b a,

    2 key=v1,v2,v3 foo=a,b key=v1,v2,v3 v3

    3 key=v1,v2,v3 foo=a,b key=v1,v2,v3 v2,

    4 key=v1,v2,v3 foo=a,b key=v1,v2,v3 v1,

    This gives us the raw trace snashpot data from which a tree could be
    built using a simple algorithm (say, still in the order of leftmost
    being more recent match):

    "key=v1,v2,v3 foo=a,b"
    / \
    "foo=a,b" "key=v1,v2,v3"
    / \ / | \
    "b" "a," "v3" "v2," "v1,"

    This structure provides more logical access.

    Anyway, I feel this problem is better solved using approaches
    that avoid regexes, or that use regexes for just some low-level
    tokenizing.

    With my above regex, there are stray commas in the items,
    because they had to be included in the repetition, and there
    is no nice way to exclude them without adding another level
    of parentheses.

    Each time we play with the parentheses, we radically change
    the structure and size of the output.

    It just ends up a wrongheaded academic exercise.
    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Ed Morton@mortonspam@gmail.com to comp.lang.awk on Sun Apr 13 12:52:27 2025
    From Newsgroup: comp.lang.awk

    On 4/10/2025 8:07 PM, Ed Morton wrote:
    On 4/10/2025 2:09 AM, Janis Papanagnou wrote:
    On 10.04.2025 09:06, Janis Papanagnou wrote:
    I'm looking for subexpressions of regexp-matches using GNU Awk's
    third parameter of match(). For example

       data = "R=r1,R=r2,R=r3,E=e"
       match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr)

    The result stored in 'arr' seems to be determined by the static
    parenthesis structure, so with the pattern repetition {2,5} only
    the last matched data in the subexpression (r3) seems to persist
    in arr. - I suppose there's no cute way to achieve what I wanted?

    To clarify; what I wanted is access of the values "r1", "r2", "r3",
    and "e" through 'arr'.

    Correct, you can't do what you want using just `match()`, it's simply matching a regexp with capture groups against a string, just like sed does.

    There are, of course, several other ways to get `arr[]` populated the
    way you want. e.g split(), patsplit(), while(match()), or dynamically generating the regexp. The best one to choose will depend on the real
    values that r1, etc. can have, for example it'd be hard to use split()
    if `r1` can be a quoted string that might itself contain similar
    substrings such as `data = "R=\"R=r1,R=r2\",R=r2,R=r3,E=e"`.

    FWIW, probably more for the benefit of any awk newcomers reading this,
    if your data really could have quoted fields (otherwise a simple `split(data,",")` is all you need) then, assuming they follow the same
    quoting rules as for CSVs, I'd use either of these or similar with GNU
    awk (for `patsplit()`:

    data = "R=\"R=r1,R=r2\",R=r2,R=r3,E=e"
    nf = patsplit(data, arr, /[RE]=([^,]*|"([^"]|"")*")/)
    delete arr
    for ( i in arr ) {
    sub(/[^=]+=/, "", arr[i])
    }

    or any awk:

    data = "R=\"R=r1,R=r2\",R=r2,R=r3,E=e"
    nf = 0
    delete arr
    while ( match(data, /[RE]=([^,]*|"([^"]|"")*")/, a) ) {
    arr[++nf] = substr(data, RSTART+2, RLENGTH-2)
    data = substr(data, RSTART+RLENGTH)
    }

    either of which would populate `arr[]` with:

    "R=r1,R=r2"
    r2
    r3
    e

    and set `nf` to the number of entries in `arr[]`.

    Regards,

    Ed.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From gazelle@gazelle@shell.xmission.com (Kenny McCormack) to comp.lang.awk on Mon Apr 14 18:20:31 2025
    From Newsgroup: comp.lang.awk

    In article <vtgtkr$3br8e$1@dont-email.me>,
    Ed Morton <mortonspam@gmail.com> wrote:
    ...
    data = "R=\"R=r1,R=r2\",R=r2,R=r3,E=e"
    nf = patsplit(data, arr, /[RE]=([^,]*|"([^"]|"")*")/)
    delete arr
    for ( i in arr ) {
    sub(/[^=]+=/, "", arr[i])
    }

    This can't be right, since if the sequence:
    delete arr
    for (i in arr) ...
    can't possibly do anything. I.e., the for statement will be a no-op, since
    the array is empty at that point.

    or any awk:

    data = "R=\"R=r1,R=r2\",R=r2,R=r3,E=e"
    nf = 0
    delete arr
    while ( match(data, /[RE]=([^,]*|"([^"]|"")*")/, a) ) {
    arr[++nf] = substr(data, RSTART+2, RLENGTH-2)
    data = substr(data, RSTART+RLENGTH)
    }

    I believe "delete arr" (without an index, hence removing the entire array)
    is an "extension". I can't quite quote chapter and verse, but I note that
    "man mawk" explicitly mentions that mawk supports this syntax, thereby
    implying that it isn't "standard". Of course, gawk supports it as well.

    So, if by "any awk", you mean "strictly standard", then, well, you can see where I am going with this.
    --
    "Only a genius could lose a billion dollars running a casino."
    "You know what they say: the house always loses."
    "When life gives you lemons, don't pay taxes."
    "Grab 'em by the p***y!"
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Mon Apr 14 20:53:01 2025
    From Newsgroup: comp.lang.awk

    On 14.04.2025 20:20, Kenny McCormack wrote:
    In article <vtgtkr$3br8e$1@dont-email.me>,
    Ed Morton <mortonspam@gmail.com> wrote:
    ...
    data = "R=\"R=r1,R=r2\",R=r2,R=r3,E=e"
    nf = patsplit(data, arr, /[RE]=([^,]*|"([^"]|"")*")/)
    delete arr
    for ( i in arr ) {
    sub(/[^=]+=/, "", arr[i])
    }

    This can't be right, since if the sequence:
    delete arr
    for (i in arr) ...
    can't possibly do anything. I.e., the for statement will be a no-op, since the array is empty at that point.

    or any awk:

    data = "R=\"R=r1,R=r2\",R=r2,R=r3,E=e"
    nf = 0
    delete arr
    while ( match(data, /[RE]=([^,]*|"([^"]|"")*")/, a) ) {
    arr[++nf] = substr(data, RSTART+2, RLENGTH-2)
    data = substr(data, RSTART+RLENGTH)
    }

    I believe "delete arr" (without an index, hence removing the entire array)
    is an "extension". I can't quite quote chapter and verse, but I note that "man mawk" explicitly mentions that mawk supports this syntax, thereby implying that it isn't "standard". Of course, gawk supports it as well.

    So, if by "any awk", you mean "strictly standard", then, well, you can see where I am going with this.

    I seem to recall that a standard way to clear an array could be using
    split("", arr)
    for example. To my taste it looks a bit clumsy, not as nice as using
    'delete', but well, whatever one prefers.

    Janis

    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Ed Morton@mortonspam@gmail.com to comp.lang.awk on Mon Apr 14 18:55:22 2025
    From Newsgroup: comp.lang.awk

    On 4/14/2025 1:53 PM, Janis Papanagnou wrote:
    On 14.04.2025 20:20, Kenny McCormack wrote:
    In article <vtgtkr$3br8e$1@dont-email.me>,
    Ed Morton <mortonspam@gmail.com> wrote:
    ...
    data = "R=\"R=r1,R=r2\",R=r2,R=r3,E=e"
    nf = patsplit(data, arr, /[RE]=([^,]*|"([^"]|"")*")/)
    delete arr
    for ( i in arr ) {
    sub(/[^=]+=/, "", arr[i])
    }

    This can't be right, since if the sequence:
    delete arr
    for (i in arr) ...
    can't possibly do anything. I.e., the for statement will be a no-op, since >> the array is empty at that point.

    Yeah, remove that `delete arr`, it's not necessary since `patsplit()`
    will delete `arr` before populating it and `delete arr` in that location
    would break the code.


    or any awk:

    data = "R=\"R=r1,R=r2\",R=r2,R=r3,E=e"
    nf = 0
    delete arr
    while ( match(data, /[RE]=([^,]*|"([^"]|"")*")/, a) ) {
    arr[++nf] = substr(data, RSTART+2, RLENGTH-2)
    data = substr(data, RSTART+RLENGTH)
    }

    I believe "delete arr" (without an index, hence removing the entire array) >> is an "extension". I can't quite quote chapter and verse, but I note that >> "man mawk" explicitly mentions that mawk supports this syntax, thereby
    implying that it isn't "standard". Of course, gawk supports it as well.

    `delete arr` is defined by the current POSIX standard (https://pubs.opengroup.org/onlinepubs/9799919799/utilities/awk.html) as equivalent to `for (index in array) delete array[index]` but for years
    prior to that [almost?] every maintained awk supported `delete arr` anyway.


    So, if by "any awk", you mean "strictly standard", then, well, you can see >> where I am going with this.

    I seem to recall that a standard way to clear an array could be using
    split("", arr)

    `split("", arr)` was the defacto "standard" way to delete an array's
    content without looping before `delete arr` was adopted by POSIX. In all seriousness if anyone is using an awk that doesn't support `delete arr`
    then they need to get a new awk as who knows what other features it
    might be lacking.

    Ed.

    for example. To my taste it looks a bit clumsy, not as nice as using 'delete', but well, whatever one prefers.

    Janis


    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Tue Apr 15 05:35:14 2025
    From Newsgroup: comp.lang.awk

    On 15.04.2025 01:55, Ed Morton wrote:

    In all seriousness if anyone is using an awk that doesn't support
    `delete arr` then they need to get a new awk as who knows what other
    features it might be lacking.

    One nice property of Awk was that for decades its powerful features
    and its kernel functionality persisted and fancy features were not
    necessary to make good use of that tool.

    Janis


    Ed.


    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From Manuel Collado@mcollado2011@gmail.com to comp.lang.awk on Fri Apr 18 12:03:15 2025
    From Newsgroup: comp.lang.awk

    El 11/4/25 a las 9:10, Janis Papanagnou escribió:
    On 11.04.2025 08:33, Aharon Robbins wrote:
    In article <vt9dre$3t3po$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    The feature can be very useful,
    but not for the case I was looking for. - Actually, it could have
    provided the functionality I was seeking, but since GNU Awk relies
    on the GNU regexp functions as they are implemented I cannot expect
    that any provided features gets extended by Awk. - If GNU Awk would
    have an own RE implementation then we could think about using, e.g.,
    another array dimension to store the (now only temporary existing,
    and generally unavailable) subexpressions.

    Actually, this is not so trivial. The data structures at the C level
    as mandated by POSIX are one dimensional; the submatches in parentheses
    are counted from left to right. There's no way to represent the
    subexpressions that are under control of interval expressions, which
    would essentially require a two-dimensional data structure.

    Yes, that's why I had thought about a 2-dimensional array [on GNU
    Awk level] so that arr[n][i] for i=1..z would contain the patterns.
    This is what I actually tried with GNU Awk (before I had asked you)
    to see whether there's some undocumented feature.

    A 2-dimensional array is not strictly necessary. It could be possible to
    keep the one dimensional array interface and use the same trick for multidimensional arrays indices in Posix AWK. I.e., return a list of
    matched values delimited by SUBSEP.

    Just my 2c.
    --- Synchronet 3.20c-Linux NewsLink 1.2
  • From gazelle@gazelle@shell.xmission.com (Kenny McCormack) to comp.lang.awk on Fri Apr 18 12:01:18 2025
    From Newsgroup: comp.lang.awk

    In article <vtt813$2ovai$1@dont-email.me>,
    Manuel Collado <mcollado2011@gmail.com> wrote:
    ...
    A 2-dimensional array is not strictly necessary. It could be possible to
    keep the one dimensional array interface and use the same trick for >multidimensional arrays indices in Posix AWK. I.e., return a list of
    matched values delimited by SUBSEP.

    But why would you want to?

    GAWK has multidimensional arrays; they should be used.
    --
    (Cruz certainly has an odd face) ... it looks like someone sewed pieces of a waterlogged Reagan mask together at gunpoint ...

    http://www.rollingstone.com/politics/news/how-america-made-donald-trump-unstoppable-20160224
    --- Synchronet 3.20c-Linux NewsLink 1.2