I'm looking for subexpressions of regexp-matches using GNU Awk's
third parameter of match(). For example
data = "R=r1,R=r2,R=r3,E=e"
match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr)
The result stored in 'arr' seems to be determined by the static
parenthesis structure, so with the pattern repetition {2,5} only
the last matched data in the subexpression (r3) seems to persist
in arr. - I suppose there's no cute way to achieve what I wanted?
Janis
On 10.04.2025 09:06, Janis Papanagnou wrote:
I'm looking for subexpressions of regexp-matches using GNU Awk's
third parameter of match(). For example
data = "R=r1,R=r2,R=r3,E=e"
match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr)
The result stored in 'arr' seems to be determined by the static
parenthesis structure, so with the pattern repetition {2,5} only
the last matched data in the subexpression (r3) seems to persist
in arr. - I suppose there's no cute way to achieve what I wanted?
To clarify; what I wanted is access of the values "r1", "r2", "r3",
and "e" through 'arr'.
In article <vt7qs4$2gior$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
On 10.04.2025 09:06, Janis Papanagnou wrote:
I'm looking for subexpressions of regexp-matches using GNU Awk's
third parameter of match(). For example
data = "R=r1,R=r2,R=r3,E=e"
match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr)
The result stored in 'arr' seems to be determined by the static
parenthesis structure, so with the pattern repetition {2,5} only
the last matched data in the subexpression (r3) seems to persist
in arr. - I suppose there's no cute way to achieve what I wanted?
To clarify; what I wanted is access of the values "r1", "r2", "r3",
and "e" through 'arr'.
I have to admit that I (still) don't really understand how this match third arg stuff works.
I.e., I can never predict what will happen, so I always
just dump out the array and try to reverse-engineer it each time I need to use it.
I adapted your code into the following test script:
--- Cut Here ---
#!/bin/sh
gawk 'BEGIN {
data = "R=r1,R=r2,R=r3,E=e"
match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr)
for (i in arr) print i,arr[i]
}'
# To clarify; what I wanted is access of the values "r1", "r2", "r3",
# and "e" through 'arr'.
--- Cut Here ---
The output I get is:
--- Cut Here ---
0start 1
0length 18
3start 18
1start 11
2start 13
3length 1
2length 2
1length 5
0 R=r1,R=r2,R=r3,E=e
1 R=r3,
2 r3
3 e
--- Cut Here ---
After playing around a bit, I could not come up with any sensible way of getting what you want to get.
As an alternative, it sounds like you could just could just split the
string on the comma; that would get you:
R=r1
R=r2
R=r3
E=e
Or, for finer control, you could use patsplit().
...I have to admit that I (still) don't really understand how this match third >> arg stuff works.
...I.e., I can never predict what will happen, so I always
just dump out the array and try to reverse-engineer it each time I need to >> use it.
Above output stuff appears because in 'arr' there's additional elements
about the pattern positions stored.
I think I'll do the parsing the straightforward two-step way as I did
before the GNU Awk specific functions were available; it's probably
also the clearest way to program that functionality.
In article <vt8bit$2uiq5$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
...
...I have to admit that I (still) don't really understand how this match third >>> arg stuff works.
...I.e., I can never predict what will happen, so I always
just dump out the array and try to reverse-engineer it each time I need to >>> use it.
Above output stuff appears because in 'arr' there's additional elements
about the pattern positions stored.
Just to clarify, I wasn't looking for a tutorial (man page regurgitation).
I understand the man page description of match's 3rd arg as well as anyone;
I just find it that it doesn't do as much in practice as (I think) it
should - and that it is unpredictable (by me, anyway) what it will do (you have to dump out the array and trial-and-error it to get it to do what you want).
It promises more than it delivers.
[...]
None of which is criticism of the feature; as you say below, it basically does as much as the underlying regexp library allows it to do.
...
I think I'll do the parsing the straightforward two-step way as I did
before the GNU Awk specific functions were available; it's probably
also the clearest way to program that functionality.
Probably so. BTW, it is not really "GNU Awk specific"; lots of languages have this general capability.
[...]
On 10.04.2025 09:06, Janis Papanagnou wrote:
I'm looking for subexpressions of regexp-matches using GNU Awk's
third parameter of match(). For example
data = "R=r1,R=r2,R=r3,E=e"
match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr)
The result stored in 'arr' seems to be determined by the static
parenthesis structure, so with the pattern repetition {2,5} only
the last matched data in the subexpression (r3) seems to persist
in arr. - I suppose there's no cute way to achieve what I wanted?
To clarify; what I wanted is access of the values "r1", "r2", "r3",
and "e" through 'arr'.
The feature can be very useful,
but not for the case I was looking for. - Actually, it could have
provided the functionality I was seeking, but since GNU Awk relies
on the GNU regexp functions as they are implemented I cannot expect
that any provided features gets extended by Awk. - If GNU Awk would
have an own RE implementation then we could think about using, e.g.,
another array dimension to store the (now only temporary existing,
and generally unavailable) subexpressions.
In article <vt9dre$3t3po$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
The feature can be very useful,
but not for the case I was looking for. - Actually, it could have
provided the functionality I was seeking, but since GNU Awk relies
on the GNU regexp functions as they are implemented I cannot expect
that any provided features gets extended by Awk. - If GNU Awk would
have an own RE implementation then we could think about using, e.g.,
another array dimension to store the (now only temporary existing,
and generally unavailable) subexpressions.
Actually, this is not so trivial. The data structures at the C level
as mandated by POSIX are one dimensional; the submatches in parentheses
are counted from left to right. There's no way to represent the subexpressions that are under control of interval expressions, which
would essentially require a two-dimensional data structure.
Mike Haertel is writing a new regexp matcher for gawk; it was announced
here some time agao: https://github.com/mikehaertel/minrx. The code is
in the feature/minrx branch of the gawk Git repository.
I just opened an issue, https://github.com/mikehaertel/minrx/issues/43,
about this question. We shall see what develops.
Arnold
In article <vt9dre$3t3po$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
The feature can be very useful,
but not for the case I was looking for. - Actually, it could have
provided the functionality I was seeking, but since GNU Awk relies
on the GNU regexp functions as they are implemented I cannot expect
that any provided features gets extended by Awk. - If GNU Awk would
have an own RE implementation then we could think about using, e.g., >>another array dimension to store the (now only temporary existing,
and generally unavailable) subexpressions.
Actually, this is not so trivial. The data structures at the C level
as mandated by POSIX are one dimensional; the submatches in parentheses
are counted from left to right. There's no way to represent the subexpressions that are under control of interval expressions, which
would essentially require a two-dimensional data structure.
Mike Haertel is writing a new regexp matcher for gawk; it was announced
here some time agao: https://github.com/mikehaertel/minrx. The code is
in the feature/minrx branch of the gawk Git repository.
I just opened an issue, https://github.com/mikehaertel/minrx/issues/43,
about this question. We shall see what develops.
On 11.04.2025 08:33, Aharon Robbins wrote:
In article <vt9dre$3t3po$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
The feature can be very useful,
but not for the case I was looking for. - Actually, it could have
provided the functionality I was seeking, but since GNU Awk relies
on the GNU regexp functions as they are implemented I cannot expect
that any provided features gets extended by Awk. - If GNU Awk would
have an own RE implementation then we could think about using, e.g.,
another array dimension to store the (now only temporary existing,
and generally unavailable) subexpressions.
Actually, this is not so trivial. The data structures at the C level
as mandated by POSIX are one dimensional; the submatches in parentheses
are counted from left to right. There's no way to represent the
subexpressions that are under control of interval expressions, which
would essentially require a two-dimensional data structure.
Yes, that's why I had thought about a 2-dimensional array [on GNU
Awk level] so that arr[n][i] for i=1..z would contain the patterns.
This is what I actually tried with GNU Awk (before I had asked you)
to see whether there's some undocumented feature.
(flow "R=r1,R=r2,R=r3,E=e"(spl ","))
(flow "R=r1,R=r2,R=r3,E=e"(spl ",")
(flow "R=r1,R=r2,R=r3,E=e"(spl ",")
(flow "R=r1,R=r2,R=r3,E=e"(spl ",")
(flow "R=r1,R=r2,R=r3,E=e"(spl ",")
(flow "R=r1,R=r2,R=r3,E=e"(spl ",")
(awk (:inputs '("R=r1,R=r2,R=r3,E=e"))(:set fs "," ofs ":")
(awk (:inputs '("R=r1,R=r2,R=r3,E=e"))(:set fs "," ofs ":")
Mike Haertel is writing a new regexp matcher for gawk; it was announced
here some time agao: https://github.com/mikehaertel/minrx. The code is
in the feature/minrx branch of the gawk Git repository.
In article <67f8b7af$0$705$14726298@news.sunsite.dk>,
Aharon Robbins <arnold@freefriends.org> wrote:
...
Mike Haertel is writing a new regexp matcher for gawk; it was announced
here some time agao: https://github.com/mikehaertel/minrx. The code is
in the feature/minrx branch of the gawk Git repository.
Just out of curiosity, does the new matcher address the issue raised by Janis?
It sounds like you are implying that it does, but do not say so explicitly.
[...]
In article <vt9dre$3t3po$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
The feature can be very useful,
but not for the case I was looking for. - Actually, it could have
provided the functionality I was seeking, but since GNU Awk relies
on the GNU regexp functions as they are implemented I cannot expect
that any provided features gets extended by Awk. - If GNU Awk would
have an own RE implementation then we could think about using, e.g., >>another array dimension to store the (now only temporary existing,
and generally unavailable) subexpressions.
Actually, this is not so trivial. The data structures at the C level
as mandated by POSIX are one dimensional; the submatches in parentheses
are counted from left to right. There's no way to represent the subexpressions that are under control of interval expressions, which
would essentially require a two-dimensional data structure.
On 4/10/2025 2:09 AM, Janis Papanagnou wrote:
On 10.04.2025 09:06, Janis Papanagnou wrote:
I'm looking for subexpressions of regexp-matches using GNU Awk's
third parameter of match(). For example
data = "R=r1,R=r2,R=r3,E=e"
match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr)
The result stored in 'arr' seems to be determined by the static
parenthesis structure, so with the pattern repetition {2,5} only
the last matched data in the subexpression (r3) seems to persist
in arr. - I suppose there's no cute way to achieve what I wanted?
To clarify; what I wanted is access of the values "r1", "r2", "r3",
and "e" through 'arr'.
Correct, you can't do what you want using just `match()`, it's simply matching a regexp with capture groups against a string, just like sed does.
There are, of course, several other ways to get `arr[]` populated the
way you want. e.g split(), patsplit(), while(match()), or dynamically generating the regexp. The best one to choose will depend on the real
values that r1, etc. can have, for example it'd be hard to use split()
if `r1` can be a quoted string that might itself contain similar
substrings such as `data = "R=\"R=r1,R=r2\",R=r2,R=r3,E=e"`.
data = "R=\"R=r1,R=r2\",R=r2,R=r3,E=e"
nf = patsplit(data, arr, /[RE]=([^,]*|"([^"]|"")*")/)
delete arr
for ( i in arr ) {
sub(/[^=]+=/, "", arr[i])
}
or any awk:
data = "R=\"R=r1,R=r2\",R=r2,R=r3,E=e"
nf = 0
delete arr
while ( match(data, /[RE]=([^,]*|"([^"]|"")*")/, a) ) {
arr[++nf] = substr(data, RSTART+2, RLENGTH-2)
data = substr(data, RSTART+RLENGTH)
}
In article <vtgtkr$3br8e$1@dont-email.me>,
Ed Morton <mortonspam@gmail.com> wrote:
...
data = "R=\"R=r1,R=r2\",R=r2,R=r3,E=e"
nf = patsplit(data, arr, /[RE]=([^,]*|"([^"]|"")*")/)
delete arr
for ( i in arr ) {
sub(/[^=]+=/, "", arr[i])
}
This can't be right, since if the sequence:
delete arr
for (i in arr) ...
can't possibly do anything. I.e., the for statement will be a no-op, since the array is empty at that point.
or any awk:
data = "R=\"R=r1,R=r2\",R=r2,R=r3,E=e"
nf = 0
delete arr
while ( match(data, /[RE]=([^,]*|"([^"]|"")*")/, a) ) {
arr[++nf] = substr(data, RSTART+2, RLENGTH-2)
data = substr(data, RSTART+RLENGTH)
}
I believe "delete arr" (without an index, hence removing the entire array)
is an "extension". I can't quite quote chapter and verse, but I note that "man mawk" explicitly mentions that mawk supports this syntax, thereby implying that it isn't "standard". Of course, gawk supports it as well.
So, if by "any awk", you mean "strictly standard", then, well, you can see where I am going with this.
On 14.04.2025 20:20, Kenny McCormack wrote:
In article <vtgtkr$3br8e$1@dont-email.me>,
Ed Morton <mortonspam@gmail.com> wrote:
...
data = "R=\"R=r1,R=r2\",R=r2,R=r3,E=e"
nf = patsplit(data, arr, /[RE]=([^,]*|"([^"]|"")*")/)
delete arr
for ( i in arr ) {
sub(/[^=]+=/, "", arr[i])
}
This can't be right, since if the sequence:
delete arr
for (i in arr) ...
can't possibly do anything. I.e., the for statement will be a no-op, since >> the array is empty at that point.
or any awk:
data = "R=\"R=r1,R=r2\",R=r2,R=r3,E=e"
nf = 0
delete arr
while ( match(data, /[RE]=([^,]*|"([^"]|"")*")/, a) ) {
arr[++nf] = substr(data, RSTART+2, RLENGTH-2)
data = substr(data, RSTART+RLENGTH)
}
I believe "delete arr" (without an index, hence removing the entire array) >> is an "extension". I can't quite quote chapter and verse, but I note that >> "man mawk" explicitly mentions that mawk supports this syntax, thereby
implying that it isn't "standard". Of course, gawk supports it as well.
So, if by "any awk", you mean "strictly standard", then, well, you can see >> where I am going with this.
I seem to recall that a standard way to clear an array could be using
split("", arr)
for example. To my taste it looks a bit clumsy, not as nice as using 'delete', but well, whatever one prefers.
Janis
In all seriousness if anyone is using an awk that doesn't support
`delete arr` then they need to get a new awk as who knows what other
features it might be lacking.
Ed.
On 11.04.2025 08:33, Aharon Robbins wrote:
In article <vt9dre$3t3po$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
The feature can be very useful,
but not for the case I was looking for. - Actually, it could have
provided the functionality I was seeking, but since GNU Awk relies
on the GNU regexp functions as they are implemented I cannot expect
that any provided features gets extended by Awk. - If GNU Awk would
have an own RE implementation then we could think about using, e.g.,
another array dimension to store the (now only temporary existing,
and generally unavailable) subexpressions.
Actually, this is not so trivial. The data structures at the C level
as mandated by POSIX are one dimensional; the submatches in parentheses
are counted from left to right. There's no way to represent the
subexpressions that are under control of interval expressions, which
would essentially require a two-dimensional data structure.
Yes, that's why I had thought about a 2-dimensional array [on GNU
Awk level] so that arr[n][i] for i=1..z would contain the patterns.
This is what I actually tried with GNU Awk (before I had asked you)
to see whether there's some undocumented feature.
A 2-dimensional array is not strictly necessary. It could be possible to
keep the one dimensional array interface and use the same trick for >multidimensional arrays indices in Posix AWK. I.e., return a list of
matched values delimited by SUBSEP.
Sysop: | DaiTengu |
---|---|
Location: | Appleton, WI |
Users: | 1,030 |
Nodes: | 10 (0 / 10) |
Uptime: | 200:40:26 |
Calls: | 13,340 |
Calls today: | 3 |
Files: | 186,574 |
D/L today: |
3,438 files (1,066M bytes) |
Messages: | 3,357,051 |