Forum: War Ensemble BBS

Experiences with match() subexpressions?

From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Thu Apr 10 09:06:34 2025

From Newsgroup: comp.lang.awk

I'm looking for subexpressions of regexp-matches using GNU Awk's
third parameter of match(). For example

data = "R=r1,R=r2,R=r3,E=e"
match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr)

The result stored in 'arr' seems to be determined by the static
parenthesis structure, so with the pattern repetition {2,5} only
the last matched data in the subexpression (r3) seems to persist
in arr. - I suppose there's no cute way to achieve what I wanted?

Janis
--- Synchronet 3.20c-Linux NewsLink 1.2

From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Thu Apr 10 09:09:55 2025

From Newsgroup: comp.lang.awk

On 10.04.2025 09:06, Janis Papanagnou wrote:

I'm looking for subexpressions of regexp-matches using GNU Awk's
third parameter of match(). For example

data = "R=r1,R=r2,R=r3,E=e"
match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr)

The result stored in 'arr' seems to be determined by the static
parenthesis structure, so with the pattern repetition {2,5} only
the last matched data in the subexpression (r3) seems to persist
in arr. - I suppose there's no cute way to achieve what I wanted?

To clarify; what I wanted is access of the values "r1", "r2", "r3",
and "e" through 'arr'.

Janis

--- Synchronet 3.20c-Linux NewsLink 1.2

From gazelle@gazelle@shell.xmission.com (Kenny McCormack) to comp.lang.awk on Thu Apr 10 11:08:55 2025

From Newsgroup: comp.lang.awk

In article <vt7qs4$2gior$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

On 10.04.2025 09:06, Janis Papanagnou wrote:

I'm looking for subexpressions of regexp-matches using GNU Awk's
third parameter of match(). For example

data = "R=r1,R=r2,R=r3,E=e"
match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr)

The result stored in 'arr' seems to be determined by the static
parenthesis structure, so with the pattern repetition {2,5} only
the last matched data in the subexpression (r3) seems to persist
in arr. - I suppose there's no cute way to achieve what I wanted?

To clarify; what I wanted is access of the values "r1", "r2", "r3",
and "e" through 'arr'.

I have to admit that I (still) don't really understand how this match third
arg stuff works. I.e., I can never predict what will happen, so I always
just dump out the array and try to reverse-engineer it each time I need to
use it.

I adapted your code into the following test script:

--- Cut Here ---
#!/bin/sh
gawk 'BEGIN {
data = "R=r1,R=r2,R=r3,E=e"
match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr)
for (i in arr) print i,arr[i]
}'

# To clarify; what I wanted is access of the values "r1", "r2", "r3",
# and "e" through 'arr'.
--- Cut Here ---

The output I get is:

--- Cut Here ---
0start 1
0length 18
3start 18
1start 11
2start 13
3length 1
2length 2
1length 5
0 R=r1,R=r2,R=r3,E=e
1 R=r3,
2 r3
3 e
--- Cut Here ---

After playing around a bit, I could not come up with any sensible way of getting what you want to get.

As an alternative, it sounds like you could just could just split the
string on the comma; that would get you:

R=r1
R=r2
R=r3
E=e

Or, for finer control, you could use patsplit().
--
The randomly chosen signature file that would have appeared here is more than 4 lines long. As such, it violates one or more Usenet RFCs. In order to remain in compliance with said RFCs, the actual sig can be found at the following URL:
http://user.xmission.com/~gazelle/Sigs/Reaganomics
--- Synchronet 3.20c-Linux NewsLink 1.2

From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Thu Apr 10 13:55:07 2025

From Newsgroup: comp.lang.awk

On 10.04.2025 13:08, Kenny McCormack wrote:

In article <vt7qs4$2gior$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

On 10.04.2025 09:06, Janis Papanagnou wrote:

I'm looking for subexpressions of regexp-matches using GNU Awk's
third parameter of match(). For example

data = "R=r1,R=r2,R=r3,E=e"
match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr)

The result stored in 'arr' seems to be determined by the static
parenthesis structure, so with the pattern repetition {2,5} only
the last matched data in the subexpression (r3) seems to persist
in arr. - I suppose there's no cute way to achieve what I wanted?

To clarify; what I wanted is access of the values "r1", "r2", "r3",
and "e" through 'arr'.

I have to admit that I (still) don't really understand how this match third arg stuff works.

I've never used that before but it seems to be quite simple; for every parenthesis group expression in the regexp it provides (statically, as
the parentheses are written, from left to right) an array element with
the expanded matched subexpression.

I.e., I can never predict what will happen, so I always
just dump out the array and try to reverse-engineer it each time I need to use it.

I adapted your code into the following test script:

--- Cut Here ---
#!/bin/sh
gawk 'BEGIN {
data = "R=r1,R=r2,R=r3,E=e"
match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr)
for (i in arr) print i,arr[i]
}'

# To clarify; what I wanted is access of the values "r1", "r2", "r3",
# and "e" through 'arr'.
--- Cut Here ---

The output I get is:

--- Cut Here ---
0start 1
0length 18
3start 18
1start 11
2start 13
3length 1
2length 2
1length 5

Above output stuff appears because in 'arr' there's additional elements
about the pattern positions stored.

I don't need that so I'm just interested in the data patterns below and
iterate with a index-counted loop...

0 R=r1,R=r2,R=r3,E=e

the whole expression

1 R=r3,

the expression in the first parenthesis

2 r3

the expression in the second, embedded parenthesis

3 e

the expression in the final parenthesis

--- Cut Here ---

After playing around a bit, I could not come up with any sensible way of getting what you want to get.

Yeah, Arnold just told me the same; that it's impossible because the
underlying GNU regexp library doesn't support what I'm looking for.

What I considered a possible workaround (in this case) is to sequence
the (...){2,5} expression by using sequences of (...)? expressions.
(But in the general case, for larger ranges than 2-5, that's neither
feasible nor sensible any more.)

As an alternative, it sounds like you could just could just split the
string on the comma; that would get you:

Yes, that was also how I did such things in the past. Only when I saw
that "third argument" to match() I hoped the two-level parsing could
be simplified in one step. The reason was that I thought to have seen
other languages (Perl, maybe?) that supported such a feature.

R=r1
R=r2
R=r3
E=e

Or, for finer control, you could use patsplit().

I think I'll do the parsing the straightforward two-step way as I did
before the GNU Awk specific functions were available; it's probably
also the clearest way to program that functionality.

Janis

--- Synchronet 3.20c-Linux NewsLink 1.2

From gazelle@gazelle@shell.xmission.com (Kenny McCormack) to comp.lang.awk on Thu Apr 10 14:04:46 2025

From Newsgroup: comp.lang.awk

In article <vt8bit$2uiq5$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
...

I have to admit that I (still) don't really understand how this match third >> arg stuff works.

...

I.e., I can never predict what will happen, so I always
just dump out the array and try to reverse-engineer it each time I need to >> use it.

...

Above output stuff appears because in 'arr' there's additional elements
about the pattern positions stored.

Just to clarify, I wasn't looking for a tutorial (man page regurgitation).
I understand the man page description of match's 3rd arg as well as anyone;
I just find it that it doesn't do as much in practice as (I think) it
should - and that it is unpredictable (by me, anyway) what it will do (you
have to dump out the array and trial-and-error it to get it to do what you want). It promises more than it delivers. I have much the same comments
to make about the similar functionality in Tcl (Expect).

None of which is criticism of the feature; as you say below, it basically
does as much as the underlying regexp library allows it to do.

...

I think I'll do the parsing the straightforward two-step way as I did
before the GNU Awk specific functions were available; it's probably
also the clearest way to program that functionality.

Probably so. BTW, it is not really "GNU Awk specific"; lots of languages
have this general capability.

Incidentally, here is a function of mine that uses match's 3rd arg. I find
it useful. This addresses a common AWK issue, where you have a line with fields (in the usual AWK whitespace-delimited sense), but you need to know
the actual character positions of the fields (since they can move around
from line to line of input). Note also that I'm not really sure where the
name "splitMatch" came from; it was just what popped into my head when I
was writing this...

--- Cut Here ---
# Find the character positions of each of the fields in string s.
# Note that s will usually be $0, and n will usually be NF.
function splitMatch(s,n,A, i,t) {
for (i=1; i<=n; i++) t = t "([^ \t]+)[ \t]*"
return match(s,t,A)
}
--- Cut Here ---
--
In the corner of the room on the ceiling is a large vampire bat who
is obviously deranged and holding his nose.
--- Synchronet 3.20c-Linux NewsLink 1.2

From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Thu Apr 10 23:39:57 2025

From Newsgroup: comp.lang.awk

On 10.04.2025 16:04, Kenny McCormack wrote:

In article <vt8bit$2uiq5$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
...

I have to admit that I (still) don't really understand how this match third >>> arg stuff works.

...

I.e., I can never predict what will happen, so I always
just dump out the array and try to reverse-engineer it each time I need to >>> use it.

...

Above output stuff appears because in 'arr' there's additional elements
about the pattern positions stored.

Just to clarify, I wasn't looking for a tutorial (man page regurgitation).
I understand the man page description of match's 3rd arg as well as anyone;

(I didn't mean to offend you. Sorry, if it appeared so. - I just
read you writing "I don't really understand how this [...] works",
and that "it is unpredictable", so I thought some descriptive words
may be useful.)

I just find it that it doesn't do as much in practice as (I think) it
should - and that it is unpredictable (by me, anyway) what it will do (you have to dump out the array and trial-and-error it to get it to do what you want).

It is pretty understandable to me, and not the least unpredictable.
(That's why I thought it would be okay to write what I had written
to explain it.) I don't understand what you find to be unpredictable.
But never mind.

It promises more than it delivers.

Yes, probably. Although, according to what's literally documented,
it doesn't promise too much, IMO. The feature can be very useful,
but not for the case I was looking for. - Actually, it could have
provided the functionality I was seeking, but since GNU Awk relies
on the GNU regexp functions as they are implemented I cannot expect
that any provided features gets extended by Awk. - If GNU Awk would
have an own RE implementation then we could think about using, e.g.,
another array dimension to store the (now only temporary existing,
and generally unavailable) subexpressions.

[...]

None of which is criticism of the feature; as you say below, it basically does as much as the underlying regexp library allows it to do.

...

I think I'll do the parsing the straightforward two-step way as I did
before the GNU Awk specific functions were available; it's probably
also the clearest way to program that functionality.

Probably so. BTW, it is not really "GNU Awk specific"; lots of languages have this general capability.

Oh, I was just trying to say that for my programming the standard Awk
functions (as opposed to GNU Awk _specific_ functions) are fine here.
(That should not disdain all the useful GNU Awk extensions existing.)

Janis

[...]

--- Synchronet 3.20c-Linux NewsLink 1.2

From Ed Morton@mortonspam@gmail.com to comp.lang.awk on Thu Apr 10 20:07:35 2025

From Newsgroup: comp.lang.awk

On 4/10/2025 2:09 AM, Janis Papanagnou wrote:

On 10.04.2025 09:06, Janis Papanagnou wrote:

I'm looking for subexpressions of regexp-matches using GNU Awk's
third parameter of match(). For example

data = "R=r1,R=r2,R=r3,E=e"
match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr)

The result stored in 'arr' seems to be determined by the static
parenthesis structure, so with the pattern repetition {2,5} only
the last matched data in the subexpression (r3) seems to persist
in arr. - I suppose there's no cute way to achieve what I wanted?

To clarify; what I wanted is access of the values "r1", "r2", "r3",
and "e" through 'arr'.

Correct, you can't do what you want using just `match()`, it's simply
matching a regexp with capture groups against a string, just like sed does.

There are, of course, several other ways to get `arr[]` populated the
way you want. e.g split(), patsplit(), while(match()), or dynamically generating the regexp. The best one to choose will depend on the real
values that r1, etc. can have, for example it'd be hard to use split()
if `r1` can be a quoted string that might itself contain similar
substrings such as `data = "R=\"R=r1,R=r2\",R=r2,R=r3,E=e"`.

Ed.
--- Synchronet 3.20c-Linux NewsLink 1.2

From arnold@arnold@freefriends.org (Aharon Robbins) to comp.lang.awk on Fri Apr 11 06:33:19 2025

From Newsgroup: comp.lang.awk

In article <vt9dre$3t3po$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

The feature can be very useful,
but not for the case I was looking for. - Actually, it could have
provided the functionality I was seeking, but since GNU Awk relies
on the GNU regexp functions as they are implemented I cannot expect
that any provided features gets extended by Awk. - If GNU Awk would
have an own RE implementation then we could think about using, e.g.,
another array dimension to store the (now only temporary existing,
and generally unavailable) subexpressions.

Actually, this is not so trivial. The data structures at the C level
as mandated by POSIX are one dimensional; the submatches in parentheses
are counted from left to right. There's no way to represent the
subexpressions that are under control of interval expressions, which
would essentially require a two-dimensional data structure.

Mike Haertel is writing a new regexp matcher for gawk; it was announced
here some time agao: https://github.com/mikehaertel/minrx. The code is
in the feature/minrx branch of the gawk Git repository.

I just opened an issue, https://github.com/mikehaertel/minrx/issues/43,
about this question. We shall see what develops.

Arnold
--- Synchronet 3.20c-Linux NewsLink 1.2

From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Fri Apr 11 09:10:55 2025

From Newsgroup: comp.lang.awk

On 11.04.2025 08:33, Aharon Robbins wrote:

In article <vt9dre$3t3po$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

The feature can be very useful,
but not for the case I was looking for. - Actually, it could have
provided the functionality I was seeking, but since GNU Awk relies
on the GNU regexp functions as they are implemented I cannot expect
that any provided features gets extended by Awk. - If GNU Awk would
have an own RE implementation then we could think about using, e.g.,
another array dimension to store the (now only temporary existing,
and generally unavailable) subexpressions.

Actually, this is not so trivial. The data structures at the C level
as mandated by POSIX are one dimensional; the submatches in parentheses
are counted from left to right. There's no way to represent the subexpressions that are under control of interval expressions, which
would essentially require a two-dimensional data structure.

Yes, that's why I had thought about a 2-dimensional array [on GNU
Awk level] so that arr[n][i] for i=1..z would contain the patterns.
This is what I actually tried with GNU Awk (before I had asked you)
to see whether there's some undocumented feature.

(I'm aware that things may get quite complicated if there's some
restrictions imposed (on "C"-level or else) which are in the way.)

Mike Haertel is writing a new regexp matcher for gawk; it was announced
here some time agao: https://github.com/mikehaertel/minrx. The code is
in the feature/minrx branch of the gawk Git repository.

I just opened an issue, https://github.com/mikehaertel/minrx/issues/43,
about this question. We shall see what develops.

Oh, thanks for that. - My expectation had just been to check whether
such a feature is already available in GNU Awk (or could in a simple
way be made available with little effort). So I'm indeed interested
to hear whether that is a feasible and sensible feature. - Myself,
I have to admit, haven't yet thoroughly thought through about such a
feature. I've just seen it from the limited view of my application
context and thought it could be a worthwhile extension/generalization.

Janis

Arnold

--- Synchronet 3.20c-Linux NewsLink 1.2

From Kaz Kylheku@643-408-1753@kylheku.com to comp.lang.awk on Fri Apr 11 07:40:01 2025

From Newsgroup: comp.lang.awk

On 2025-04-11, Aharon Robbins <arnold@freefriends.org> wrote:

In article <vt9dre$3t3po$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

The feature can be very useful,
but not for the case I was looking for. - Actually, it could have
provided the functionality I was seeking, but since GNU Awk relies
on the GNU regexp functions as they are implemented I cannot expect
that any provided features gets extended by Awk. - If GNU Awk would
have an own RE implementation then we could think about using, e.g., >>another array dimension to store the (now only temporary existing,
and generally unavailable) subexpressions.

Actually, this is not so trivial. The data structures at the C level
as mandated by POSIX are one dimensional; the submatches in parentheses
are counted from left to right. There's no way to represent the subexpressions that are under control of interval expressions, which
would essentially require a two-dimensional data structure.

Mike Haertel is writing a new regexp matcher for gawk; it was announced
here some time agao: https://github.com/mikehaertel/minrx. The code is
in the feature/minrx branch of the gawk Git repository.

I just opened an issue, https://github.com/mikehaertel/minrx/issues/43,
about this question. We shall see what develops.

Unix and POSIX regular expressions have perpetrated a kind of
misfeature. They took the purely algebraic parentheses described in
classic literature on regular expressions, whose only role is to
override the precedence and associativity of operators, and turned them
into active operators that perform a double duty: they still override precedence, but also denote submatches associated with capture
registers.

Parentheses are enumerated and made to correspond with numbered capture registers, I think, as follows:

( ( ) ( ( ) ) )
1 2 3 4

Scanning left to right, we identify the open left parentheses
which have matching closing parentheses, and number these in order
starting from 1.

There is a convention that capture register 0 is reserved for
the full match for the expression. This is how it is with
the array reported by POSIX's regexec. Thus the numbering is
one based.

The POSIX standard clearly says what happens when a parenthesized
subexpression matches something more than once.

This is spelled out in the documentation page on the regcomp,
regexec and regfree functions. Look for this text:

"If subexpression i in a regular expression is not contained within
another subexpression, and it participated in the match several times,
then the byte offsets in pmatch[ i] shall delimit the last such match.

This is exactly the last match behavior observed by Janis in Awk's
match function.

Basically, subexpressions are dumb hack. As the regex automaton
traverses through its states in response to the input, it triggers
some anchor points associated with the original subexpression,
which copy some data, or keep track of some pointers to the start and
end of the match. When the submatch is complete, there is a data
transfer which clobbers any previous such a data transfer.

There are some tricky rules nested expressions.
Suppose that we have:

( ... ( ... ) ...)
1 2

2 is nested inside 1. Suppose that 1 matches multiple times.
Clearly, the corresponding register is left with the most
recent match when the matching is done.

But suppose that subexpression 2 sometimes matches when 1
matches, but sometimes doe snot match when 1 matches.

I think the obscurely worded POSIX rules are trying to prevent an inconsistency.

In a nutshell, if a string is reported in register 2 from
matching subexpression 2, it has to be a substring of a match that is concurrently happening for subexpression 1.

Now suppose that that an iteration of 1 matches something,
but in that iteration, subexpression 2 does not match.
Then 2 has to be reset to indicate that it didn't match anything.

Probably, it's a good idea to implement the behavior follows: whenever a
new capture iteration begins for 1, the register for 2 must also be
cleared, so that it doesn't retain stale data in the event that a match
for 2 is not encountered in the new iteration of 1.

This stuff is not really that usable for repetition; captures
were clearly envisioned mainly for non-repeating matching without
any kleene stars or {m, n} repetitions.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca
--- Synchronet 3.20c-Linux NewsLink 1.2

From Kaz Kylheku@643-408-1753@kylheku.com to comp.lang.awk on Fri Apr 11 08:22:44 2025

From Newsgroup: comp.lang.awk

On 2025-04-11, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

On 11.04.2025 08:33, Aharon Robbins wrote:

In article <vt9dre$3t3po$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

The feature can be very useful,
but not for the case I was looking for. - Actually, it could have
provided the functionality I was seeking, but since GNU Awk relies
on the GNU regexp functions as they are implemented I cannot expect
that any provided features gets extended by Awk. - If GNU Awk would
have an own RE implementation then we could think about using, e.g.,
another array dimension to store the (now only temporary existing,
and generally unavailable) subexpressions.

Actually, this is not so trivial. The data structures at the C level
as mandated by POSIX are one dimensional; the submatches in parentheses
are counted from left to right. There's no way to represent the
subexpressions that are under control of interval expressions, which
would essentially require a two-dimensional data structure.

Yes, that's why I had thought about a 2-dimensional array [on GNU
Awk level] so that arr[n][i] for i=1..z would contain the patterns.
This is what I actually tried with GNU Awk (before I had asked you)
to see whether there's some undocumented feature.

I solved this problem 15 years ago in the TXR Pattern Language

$ echo 'R=r1,R=r2,R=r3,E=e' | txr -B -c '@(coll)R=@r,@(until)E@(end)E=@e' r[0]="r1"
r[1]="r2"
r[2]="r3"
e="e"

We can eval the output into Bash and have a ${r[@]} array.

We can see the captured variables in a Lisp format:

$ echo 'R=r1,R=r2,R=r3,E=e' | txr -l -c '@(coll)R=@r,@(until)E@(end)E=@e'
(r "r1" "r2" "r3")
(e . "e")

The matches occuring in repetition constructs like @(coll) or its
vertical, line-oriented counterpart @(collect), are automatically
tabulated into lists.

We can see that the "e" variable wasn't; it is string valued,
rather than list valued.

One possibility is to use the @(merge dest {sources}*) directive which
examines different nesting depths of its operands and
intelligently combines them.

$ echo 'R=r1,R=r2,R=r3,E=e' | txr -B -c '@(coll)R=@r,@(until)E@(end)E=@e @(merge x r e)'
r[0]="r1"
r[1]="r2"
r[2]="r3"
e="e"
x[0]="r1"
x[1]="r2"
x[2]="r3"
x[3]="e"

$ echo 'R=r1,R=r2,R=r3,E=e' | txr -B -c '@(coll)R=@r,@(until)E@(end)E=@e @(merge x r e)
@(forget r e)'
x[0]="r1"
x[1]="r2"
x[2]="r3"
x[3]="e"

A plethora of techniques are possible.

In Lisp, Split data along commas, then again on =

(flow "R=r1,R=r2,R=r3,E=e"

(spl ","))
("R=r1" "R=r2" "R=r3" "E=e")

(flow "R=r1,R=r2,R=r3,E=e"

(spl ",")
(map (op spl "=")))
(("R" "r1") ("R" "r2") ("R" "r3") ("E" "e"))

Or pattern match the comma splits:

(flow "R=r1,R=r2,R=r3,E=e"

(spl ",")
(map (do match `@key=@val` @1 (list key val))))
(("R" "r1") ("R" "r2") ("R" "r3") ("E" "e"))

Just the R's please

(flow "R=r1,R=r2,R=r3,E=e"

(spl ",")
(map (do if-match `R=@val` @1 val)))
("r1" "r2" "r3" nil)

Splice out the nils:

(flow "R=r1,R=r2,R=r3,E=e"

(spl ",")
(mappend (do if-match `R=@val` @1 (list val))))
("r1" "r2" "r3")

Or remove them:

(flow "R=r1,R=r2,R=r3,E=e"

(spl ",")
(map (do if-match `R=@val` @1 val))
(remq nil))

Heck, use a Lispified Awk. The variable f holds
the fields. Whenw e assign f to itself, that
forces the recalculation of variable rec with
the ofs:

(awk (:inputs '("R=r1,R=r2,R=r3,E=e"))

(:set fs "," ofs ":")
(t (set f f) (prn)))
R=r1:R=r2:R=r3:E=e
nil

Use two Awks, nested inside each other: inner Awk
processes the fields f produced by the outer Awk:

(awk (:inputs '("R=r1,R=r2,R=r3,E=e"))

(:set fs "," ofs ":")
(t (awk (:inputs f)
(:set fs "=")
(t (prn [f 1])))))
r1
r2
r3
e
nil
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca
--- Synchronet 3.20c-Linux NewsLink 1.2

From gazelle@gazelle@shell.xmission.com (Kenny McCormack) to comp.lang.awk on Fri Apr 11 08:57:22 2025

From Newsgroup: comp.lang.awk

In article <67f8b7af$0$705$14726298@news.sunsite.dk>,
Aharon Robbins <arnold@freefriends.org> wrote:
...

Mike Haertel is writing a new regexp matcher for gawk; it was announced
here some time agao: https://github.com/mikehaertel/minrx. The code is
in the feature/minrx branch of the gawk Git repository.

Just out of curiosity, does the new matcher address the issue raised by
Janis?

It sounds like you are implying that it does, but do not say so explicitly.

Again, just curiosity. I remember when you announced the new matcher, and
it sounded interesting, but the presentation left me wondering the usual question(s): Why should I care? (Why should I get excited about this?)

Incidentally, I remember that the primary issue with the new matcher was
that it was written in C++. It needs to be C-ified in order to be included
in a GAWK release version.
--
If the automobile had followed the same development cycle as the
computer, a Rolls-Royce today would cost $100, get a million miles to
the gallon, and explode once every few weeks, killing everyone inside.
--- Synchronet 3.20c-Linux NewsLink 1.2

From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Fri Apr 11 15:50:11 2025

From Newsgroup: comp.lang.awk

On 11.04.2025 10:57, Kenny McCormack wrote:

In article <67f8b7af$0$705$14726298@news.sunsite.dk>,
Aharon Robbins <arnold@freefriends.org> wrote:
...

Mike Haertel is writing a new regexp matcher for gawk; it was announced
here some time agao: https://github.com/mikehaertel/minrx. The code is
in the feature/minrx branch of the gawk Git repository.

Just out of curiosity, does the new matcher address the issue raised by Janis?

I read his post as if he put it under discussion ("I just opened an
issue, [...] about this question. We shall see what develops.") and
the provided link shows this as well.[*]

(I don't see the answers, though, since my browser obviously doesn't
support the web-page's (dynamic?) format. - So I cannot tell what the
state of that discussion is.)

It sounds like you are implying that it does, but do not say so explicitly.

[...]

Janis

[*] From https://github.com/mikehaertel/minrx/issues/43:

So there are two questions.

Is it theoretically possible to capture all the instances of
subexpressions matched by the interval expression?

Can this be brought out into the code? I understand it would take an extended API with a richer data structure in order to do this. gawk's
extended version of the match() function could then be (somehow)
extended to take advantage of this feature.

--- Synchronet 3.20c-Linux NewsLink 1.2

From Kaz Kylheku@643-408-1753@kylheku.com to comp.lang.awk on Fri Apr 11 17:54:07 2025

From Newsgroup: comp.lang.awk

On 2025-04-11, Aharon Robbins <arnold@freefriends.org> wrote:

In article <vt9dre$3t3po$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

The feature can be very useful,
but not for the case I was looking for. - Actually, it could have
provided the functionality I was seeking, but since GNU Awk relies
on the GNU regexp functions as they are implemented I cannot expect
that any provided features gets extended by Awk. - If GNU Awk would
have an own RE implementation then we could think about using, e.g., >>another array dimension to store the (now only temporary existing,
and generally unavailable) subexpressions.

Actually, this is not so trivial. The data structures at the C level
as mandated by POSIX are one dimensional; the submatches in parentheses
are counted from left to right. There's no way to represent the subexpressions that are under control of interval expressions, which
would essentially require a two-dimensional data structure.

Here is what I believe is the right requirement, if you want repeatedly
visited subexpressions to capture all their iterations.

The dimensionality has to be such that the entire array of matches is
versioned as a whole.

In other words, abstractly, we have

matches[history][register]

where history counts from 0, that being the latest matches.
register also goes from zero; [0] is the match for the entire
expression, [1] for subexpression 1 and so on.

Any time there is a repetition in any subexpression, matches[0]
is duplicated and pushed into the history.

We can imagine the matches[h][0..(n-1)] giving a trace of the
matches through the tree of subexpressions, from root to leaf.
Each time someting is matched, the entire trace is recorded
in the history, so everything is consistent.

Say we want to parse the syntax

key=v1,v2,v3 foo=a,b

Using something like :

([^ =]+=([^ ,]*,?)* *)*
1 2

Then we have the subgroups 1 and 2. We would like to end up with
a two dimensional match array like this:

match[hist][reg] =

reg

hist 0 1 2

0 key=v1,v2,v3 foo=a,b foo=a,b b

1 key=v1,v2,v3 foo=a,b foo=a,b a,

2 key=v1,v2,v3 foo=a,b key=v1,v2,v3 v3

3 key=v1,v2,v3 foo=a,b key=v1,v2,v3 v2,

4 key=v1,v2,v3 foo=a,b key=v1,v2,v3 v1,

This gives us the raw trace snashpot data from which a tree could be
built using a simple algorithm (say, still in the order of leftmost
being more recent match):

"key=v1,v2,v3 foo=a,b"
/ \
"foo=a,b" "key=v1,v2,v3"
/ \ / | \
"b" "a," "v3" "v2," "v1,"

This structure provides more logical access.

Anyway, I feel this problem is better solved using approaches
that avoid regexes, or that use regexes for just some low-level
tokenizing.

With my above regex, there are stray commas in the items,
because they had to be included in the repetition, and there
is no nice way to exclude them without adding another level
of parentheses.

Each time we play with the parentheses, we radically change
the structure and size of the output.

It just ends up a wrongheaded academic exercise.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca
--- Synchronet 3.20c-Linux NewsLink 1.2

From Ed Morton@mortonspam@gmail.com to comp.lang.awk on Sun Apr 13 12:52:27 2025

From Newsgroup: comp.lang.awk

On 4/10/2025 8:07 PM, Ed Morton wrote:

On 4/10/2025 2:09 AM, Janis Papanagnou wrote:

On 10.04.2025 09:06, Janis Papanagnou wrote:

I'm looking for subexpressions of regexp-matches using GNU Awk's
third parameter of match(). For example

data = "R=r1,R=r2,R=r3,E=e"
match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr)

The result stored in 'arr' seems to be determined by the static
parenthesis structure, so with the pattern repetition {2,5} only
the last matched data in the subexpression (r3) seems to persist
in arr. - I suppose there's no cute way to achieve what I wanted?

To clarify; what I wanted is access of the values "r1", "r2", "r3",
and "e" through 'arr'.

Correct, you can't do what you want using just `match()`, it's simply matching a regexp with capture groups against a string, just like sed does.

There are, of course, several other ways to get `arr[]` populated the
way you want. e.g split(), patsplit(), while(match()), or dynamically generating the regexp. The best one to choose will depend on the real
values that r1, etc. can have, for example it'd be hard to use split()
if `r1` can be a quoted string that might itself contain similar
substrings such as `data = "R=\"R=r1,R=r2\",R=r2,R=r3,E=e"`.

FWIW, probably more for the benefit of any awk newcomers reading this,
if your data really could have quoted fields (otherwise a simple `split(data,",")` is all you need) then, assuming they follow the same
quoting rules as for CSVs, I'd use either of these or similar with GNU
awk (for `patsplit()`:

data = "R=\"R=r1,R=r2\",R=r2,R=r3,E=e"
nf = patsplit(data, arr, /[RE]=([^,]*|"([^"]|"")*")/)
delete arr
for ( i in arr ) {
sub(/[^=]+=/, "", arr[i])
}

or any awk:

data = "R=\"R=r1,R=r2\",R=r2,R=r3,E=e"
nf = 0
delete arr
while ( match(data, /[RE]=([^,]*|"([^"]|"")*")/, a) ) {
arr[++nf] = substr(data, RSTART+2, RLENGTH-2)
data = substr(data, RSTART+RLENGTH)
}

either of which would populate `arr[]` with:

"R=r1,R=r2"
r2
r3
e

and set `nf` to the number of entries in `arr[]`.

Regards,

Ed.
--- Synchronet 3.20c-Linux NewsLink 1.2

From gazelle@gazelle@shell.xmission.com (Kenny McCormack) to comp.lang.awk on Mon Apr 14 18:20:31 2025

From Newsgroup: comp.lang.awk

In article <vtgtkr$3br8e$1@dont-email.me>,
Ed Morton <mortonspam@gmail.com> wrote:
...

data = "R=\"R=r1,R=r2\",R=r2,R=r3,E=e"
nf = patsplit(data, arr, /[RE]=([^,]*|"([^"]|"")*")/)
delete arr
for ( i in arr ) {
sub(/[^=]+=/, "", arr[i])
}

This can't be right, since if the sequence:
delete arr
for (i in arr) ...
can't possibly do anything. I.e., the for statement will be a no-op, since
the array is empty at that point.

or any awk:

data = "R=\"R=r1,R=r2\",R=r2,R=r3,E=e"
nf = 0
delete arr
while ( match(data, /[RE]=([^,]*|"([^"]|"")*")/, a) ) {
arr[++nf] = substr(data, RSTART+2, RLENGTH-2)
data = substr(data, RSTART+RLENGTH)
}

I believe "delete arr" (without an index, hence removing the entire array)
is an "extension". I can't quite quote chapter and verse, but I note that
"man mawk" explicitly mentions that mawk supports this syntax, thereby
implying that it isn't "standard". Of course, gawk supports it as well.

So, if by "any awk", you mean "strictly standard", then, well, you can see where I am going with this.
--
"Only a genius could lose a billion dollars running a casino."
"You know what they say: the house always loses."
"When life gives you lemons, don't pay taxes."
"Grab 'em by the p***y!"
--- Synchronet 3.20c-Linux NewsLink 1.2

From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Mon Apr 14 20:53:01 2025

From Newsgroup: comp.lang.awk

On 14.04.2025 20:20, Kenny McCormack wrote:

In article <vtgtkr$3br8e$1@dont-email.me>,
Ed Morton <mortonspam@gmail.com> wrote:
...

data = "R=\"R=r1,R=r2\",R=r2,R=r3,E=e"
nf = patsplit(data, arr, /[RE]=([^,]*|"([^"]|"")*")/)
delete arr
for ( i in arr ) {
sub(/[^=]+=/, "", arr[i])
}

This can't be right, since if the sequence:
delete arr
for (i in arr) ...
can't possibly do anything. I.e., the for statement will be a no-op, since the array is empty at that point.

or any awk:

data = "R=\"R=r1,R=r2\",R=r2,R=r3,E=e"
nf = 0
delete arr
while ( match(data, /[RE]=([^,]*|"([^"]|"")*")/, a) ) {
arr[++nf] = substr(data, RSTART+2, RLENGTH-2)
data = substr(data, RSTART+RLENGTH)
}

I believe "delete arr" (without an index, hence removing the entire array)
is an "extension". I can't quite quote chapter and verse, but I note that "man mawk" explicitly mentions that mawk supports this syntax, thereby implying that it isn't "standard". Of course, gawk supports it as well.

So, if by "any awk", you mean "strictly standard", then, well, you can see where I am going with this.

I seem to recall that a standard way to clear an array could be using
split("", arr)
for example. To my taste it looks a bit clumsy, not as nice as using
'delete', but well, whatever one prefers.

Janis

--- Synchronet 3.20c-Linux NewsLink 1.2

From Ed Morton@mortonspam@gmail.com to comp.lang.awk on Mon Apr 14 18:55:22 2025

From Newsgroup: comp.lang.awk

On 4/14/2025 1:53 PM, Janis Papanagnou wrote:

On 14.04.2025 20:20, Kenny McCormack wrote:

In article <vtgtkr$3br8e$1@dont-email.me>,
Ed Morton <mortonspam@gmail.com> wrote:
...

data = "R=\"R=r1,R=r2\",R=r2,R=r3,E=e"
nf = patsplit(data, arr, /[RE]=([^,]*|"([^"]|"")*")/)
delete arr
for ( i in arr ) {
sub(/[^=]+=/, "", arr[i])
}

This can't be right, since if the sequence:
delete arr
for (i in arr) ...
can't possibly do anything. I.e., the for statement will be a no-op, since >> the array is empty at that point.

Yeah, remove that `delete arr`, it's not necessary since `patsplit()`
will delete `arr` before populating it and `delete arr` in that location
would break the code.

or any awk:

data = "R=\"R=r1,R=r2\",R=r2,R=r3,E=e"
nf = 0
delete arr
while ( match(data, /[RE]=([^,]*|"([^"]|"")*")/, a) ) {
arr[++nf] = substr(data, RSTART+2, RLENGTH-2)
data = substr(data, RSTART+RLENGTH)
}

I believe "delete arr" (without an index, hence removing the entire array) >> is an "extension". I can't quite quote chapter and verse, but I note that >> "man mawk" explicitly mentions that mawk supports this syntax, thereby
implying that it isn't "standard". Of course, gawk supports it as well.

`delete arr` is defined by the current POSIX standard (https://pubs.opengroup.org/onlinepubs/9799919799/utilities/awk.html) as equivalent to `for (index in array) delete array[index]` but for years
prior to that [almost?] every maintained awk supported `delete arr` anyway.

So, if by "any awk", you mean "strictly standard", then, well, you can see >> where I am going with this.

I seem to recall that a standard way to clear an array could be using
split("", arr)

`split("", arr)` was the defacto "standard" way to delete an array's
content without looping before `delete arr` was adopted by POSIX. In all seriousness if anyone is using an awk that doesn't support `delete arr`
then they need to get a new awk as who knows what other features it
might be lacking.

Ed.

for example. To my taste it looks a bit clumsy, not as nice as using 'delete', but well, whatever one prefers.

Janis

--- Synchronet 3.20c-Linux NewsLink 1.2

From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Tue Apr 15 05:35:14 2025

From Newsgroup: comp.lang.awk

On 15.04.2025 01:55, Ed Morton wrote:

In all seriousness if anyone is using an awk that doesn't support
`delete arr` then they need to get a new awk as who knows what other
features it might be lacking.

One nice property of Awk was that for decades its powerful features
and its kernel functionality persisted and fancy features were not
necessary to make good use of that tool.

Janis

Ed.

--- Synchronet 3.20c-Linux NewsLink 1.2

From Manuel Collado@mcollado2011@gmail.com to comp.lang.awk on Fri Apr 18 12:03:15 2025

From Newsgroup: comp.lang.awk

El 11/4/25 a las 9:10, Janis Papanagnou escribió:

On 11.04.2025 08:33, Aharon Robbins wrote:

In article <vt9dre$3t3po$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

The feature can be very useful,
but not for the case I was looking for. - Actually, it could have
provided the functionality I was seeking, but since GNU Awk relies
on the GNU regexp functions as they are implemented I cannot expect
that any provided features gets extended by Awk. - If GNU Awk would
have an own RE implementation then we could think about using, e.g.,
another array dimension to store the (now only temporary existing,
and generally unavailable) subexpressions.

Actually, this is not so trivial. The data structures at the C level
as mandated by POSIX are one dimensional; the submatches in parentheses
are counted from left to right. There's no way to represent the
subexpressions that are under control of interval expressions, which
would essentially require a two-dimensional data structure.

Yes, that's why I had thought about a 2-dimensional array [on GNU
Awk level] so that arr[n][i] for i=1..z would contain the patterns.
This is what I actually tried with GNU Awk (before I had asked you)
to see whether there's some undocumented feature.

A 2-dimensional array is not strictly necessary. It could be possible to
keep the one dimensional array interface and use the same trick for multidimensional arrays indices in Posix AWK. I.e., return a list of
matched values delimited by SUBSEP.

Just my 2c.
--- Synchronet 3.20c-Linux NewsLink 1.2

From gazelle@gazelle@shell.xmission.com (Kenny McCormack) to comp.lang.awk on Fri Apr 18 12:01:18 2025

From Newsgroup: comp.lang.awk

In article <vtt813$2ovai$1@dont-email.me>,
Manuel Collado <mcollado2011@gmail.com> wrote:
...

A 2-dimensional array is not strictly necessary. It could be possible to
keep the one dimensional array interface and use the same trick for >multidimensional arrays indices in Posix AWK. I.e., return a list of
matched values delimited by SUBSEP.

But why would you want to?

GAWK has multidimensional arrays; they should be used.
--
(Cruz certainly has an odd face) ... it looks like someone sewed pieces of a waterlogged Reagan mask together at gunpoint ...

http://www.rollingstone.com/politics/news/how-america-made-donald-trump-unstoppable-20160224
--- Synchronet 3.20c-Linux NewsLink 1.2

Who's Online
Recent Visitors
- Microbot
  Fri Apr 18 06:08:31 2025
  from Moore, Ok via Telnet
- Noozle
  Fri Apr 18 05:04:09 2025
  from Noozle City via Telnet
- Oodler
  Fri Apr 18 01:30:42 2025
  from Houston, Texas via Raw
- Noozle
  Thu Apr 17 18:11:52 2025
  from Noozle City via Telnet

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,030
Nodes:	10 (0 / 10)
Uptime:	200:40:26
Calls:	13,340
Calls today:	3
Files:	186,574
D/L today:	3,438 files (1,066M bytes)
Messages:	3,357,051

Experiences with match() subexpressions?

Who's Online

Recent Visitors

System Info