Forum: War Ensemble BBS

Simplify an AWK pipeline?

From Robert Mesibov@robert.mesibov@gmail.com to comp.lang.awk on Wed Aug 16 16:48:57 2023

From Newsgroup: comp.lang.awk

I'm hoping for ideas to simplify a duplicate-listing job. In this space-separated file "demo", fields 1 and 3 are codes unique to each record.

fld1 fld2 fld3 fld4 fld5
001 rose aa hat apple
002 pear bb hat apple
003 rose cc hat apple
004 shoe dd try tiger
005 worm ee law tiger
006 pear ff law tiger
007 pear gg hat apple
008 shoe hh cup heron
009 worm ii cup heron

To find the partial duplicate records which are identical except in those unique codes, I can parse "demo" twice like this:

awk 'FNR==NR {$1=$3=1; a[$0]++; next} {x=$0; $1=$3=1} a[$0]>1 {print x}' demo demo

which returns

001 rose aa hat apple
002 pear bb hat apple
003 rose cc hat apple
007 pear gg hat apple

I would like those 2 sets of partial duplicates (the rose-hat-apple set and the pear-hat-apple set) to be sorted alphabetically and separated, like this:

002 pear bb hat apple
007 pear gg hat apple

001 rose aa hat apple
003 rose cc hat apple

I can do that by piping the first AWK command's output to

sort -t" " -k2 | awk 'NR==1 {print; $1=$3=1; x=$0} NR>1 {y=$0; $1=$3=1; print $0==x ? y : "\n"y; x=$0}'

but this seems like a lot of coding for a result. I'd be grateful for suggestions on how to get the sorted, separated result in a single AWK command, if possible.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Kaz Kylheku@864-117-4973@kylheku.com to comp.lang.awk on Thu Aug 17 00:59:20 2023

From Newsgroup: comp.lang.awk

On 2023-08-16, Robert Mesibov <robert.mesibov@gmail.com> wrote:

I'm hoping for ideas to simplify a duplicate-listing job. In this space-separated file "demo", fields 1 and 3 are codes unique to each record.

fld1 fld2 fld3 fld4 fld5
001 rose aa hat apple
002 pear bb hat apple
003 rose cc hat apple
004 shoe dd try tiger
005 worm ee law tiger
006 pear ff law tiger
007 pear gg hat apple
008 shoe hh cup heron
009 worm ii cup heron

To find the partial duplicate records which are identical except in those unique codes, I can parse "demo" twice like this:

awk 'FNR==NR {$1=$3=1; a[$0]++; next} {x=$0; $1=$3=1} a[$0]>1 {print x}' demo demo

which returns

001 rose aa hat apple
002 pear bb hat apple
003 rose cc hat apple
007 pear gg hat apple

I would like those 2 sets of partial duplicates (the rose-hat-apple
set and the pear-hat-apple set) to be sorted alphabetically and
separated, like this:

002 pear bb hat apple
007 pear gg hat apple

001 rose aa hat apple
003 rose cc hat apple

Like this?

$ txr group.tl < data
002 pear bb hat apple
007 pear gg hat apple

006 pear ff law tiger

001 rose aa hat apple
003 rose cc hat apple

008 shoe hh cup heron

004 shoe dd try tiger

009 worm ii cup heron

005 worm ee law tiger

$ cat group.tl
(flow (get-lines)
(sort-group @1 (opip (spl " ") [callf list* 1 3 4..:]))
(each ((group @1))
(put-lines group)
(put-line)))

Here's a dime kid, ...
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca
--- Synchronet 3.20a-Linux NewsLink 1.114

From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Thu Aug 17 05:38:54 2023

From Newsgroup: comp.lang.awk

On 17.08.2023 01:48, Robert Mesibov wrote:

I'm hoping for ideas to simplify a duplicate-listing job. In this space-separated file "demo", fields 1 and 3 are codes unique to each
record.

fld1 fld2 fld3 fld4 fld5
001 rose aa hat apple
002 pear bb hat apple
003 rose cc hat apple
004 shoe dd try tiger
005 worm ee law tiger
006 pear ff law tiger
007 pear gg hat apple
008 shoe hh cup heron
009 worm ii cup heron

To find the partial duplicate records which are identical except in
those unique codes, I can parse "demo" twice like this:

awk 'FNR==NR {$1=$3=1; a[$0]++; next} {x=$0; $1=$3=1} a[$0]>1 {print x}' demo demo

which returns

001 rose aa hat apple
002 pear bb hat apple
003 rose cc hat apple
007 pear gg hat apple

I would like those 2 sets of partial duplicates (the rose-hat-apple
set and the pear-hat-apple set) to be sorted alphabetically and
separated, like this:

002 pear bb hat apple
007 pear gg hat apple

001 rose aa hat apple
003 rose cc hat apple

I can do that by piping the first AWK command's output to

sort -t" " -k2 | awk 'NR==1 {print; $1=$3=1; x=$0} NR>1 {y=$0; $1=$3=1; print $0==x ? y : "\n"y; x=$0}'

but this seems like a lot of coding for a result. I'd be grateful for suggestions on how to get the sorted, separated result in a single
AWK command, if possible.

You can alternatively do it (e.g.) in one instance also like this...

{ k = $2 SUBSEP $4 SUBSEP $5 ; a[k] = a[k] RS $0 ; c[k]++ }
END { for(k in a) if (c[k]>1) print a[k] }

which is not (not much) shorter character wise but doesn't need the
external sort command, it is all in one awk instance (as you want),
and single pass. (I think the code is also a bit clearer than the
one you posted above, but YMMV.)

Janis

--- Synchronet 3.20a-Linux NewsLink 1.114

From Robert Mesibov@robert.mesibov@gmail.com to comp.lang.awk on Thu Aug 17 13:56:47 2023

From Newsgroup: comp.lang.awk

On Thursday, August 17, 2023 at 1:38:58 PM UTC+10, Janis Papanagnou wrote:

You can alternatively do it (e.g.) in one instance also like this...

{ k = $2 SUBSEP $4 SUBSEP $5 ; a[k] = a[k] RS $0 ; c[k]++ }
END { for(k in a) if (c[k]>1) print a[k] }

which is not (not much) shorter character wise but doesn't need the
external sort command, it is all in one awk instance (as you want),
and single pass. (I think the code is also a bit clearer than the
one you posted above, but YMMV.)

Janis

Many thanks, Janis, that's very nice, but it depends on specifying the non-unique fields 2, 4 and 5. In the real-world cases I work with, there are 1-2 unique ID code fields and sometimes 300+ non-unique-ID fields (2, 4, 5...300+). That's why I replace the unique-ID fields with the arbitrary value "1" when testing for duplication.
Bob
--- Synchronet 3.20a-Linux NewsLink 1.114

From gazelle@gazelle@shell.xmission.com (Kenny McCormack) to comp.lang.awk on Thu Aug 17 21:27:41 2023

From Newsgroup: comp.lang.awk

In article <b00f43d1-f50f-44ca-bb1f-517065cc3e28n@googlegroups.com>,
Robert Mesibov <robert.mesibov@gmail.com> wrote:
...

Many thanks, Janis, that's very nice, but it depends on specifying the >non-unique fields 2, 4 and 5. In the real-world cases I work with,
there are 1-2 unique ID code fields and sometimes 300+ non-unique-ID >fields (2, 4, 5...300+). That's why I replace the unique-ID fields with
the arbitrary value "1" when testing for duplication.

1) Well, it seems like it shouldn't be too hard for you to retrofit your
hack ($1 = $3 = 1) into Janis's hack. FWIW, I would probably just set to
"" instead of 1.

2) You probably don't need to mess with SUBSEP. Your data seems to be OK
with assuming no embedded spaces (i.e., so using space as the delimiter is OK) Note that SUBSEP is intended to be used as the delimiter for the
implementation of old-fashioned pseudo-multi-dimensional arrays in AWK, but nobody uses that functionality anymore. Therefore, some AWK programmers
have co-opted SUBSEP as a symbol provided by the language to represent a character that is more-or-less guaranteed to never occur in user data.

3) I don't see how Janis's solution implements your need for sorting.
Unless he is using the WHINY_USERS option. Or asort or asorti or PROCINFO["sorted_in"] or ...
--
"Every time Mitt opens his mouth, a swing state gets its wings."

(Should be on a bumper sticker)
--- Synchronet 3.20a-Linux NewsLink 1.114

From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Thu Aug 17 23:42:50 2023

From Newsgroup: comp.lang.awk

On 17.08.2023 22:56, Robert Mesibov wrote:

On Thursday, August 17, 2023 at 1:38:58 PM UTC+10, Janis Papanagnou wrote:

You can alternatively do it (e.g.) in one instance also like this...

{ k = $2 SUBSEP $4 SUBSEP $5 ; a[k] = a[k] RS $0 ; c[k]++ }
END { for(k in a) if (c[k]>1) print a[k] }

which is not (not much) shorter character wise but doesn't need the
external sort command, it is all in one awk instance (as you want),
and single pass. (I think the code is also a bit clearer than the
one you posted above, but YMMV.)

Janis

Many thanks, Janis, that's very nice, but it depends on specifying
the non-unique fields 2, 4 and 5. In the real-world cases I work
with, there are 1-2 unique ID code fields and sometimes 300+
non-unique-ID fields (2, 4, 5...300+). That's why I replace the
unique-ID fields with the arbitrary value "1" when testing for
duplication.

That was not apparent from your description. But defining the key
by constructing it is not mandatory, you can also define it using
elimination (as in your code); the point was what is following in
the code after the k=... statement.

Janis

Bob

--- Synchronet 3.20a-Linux NewsLink 1.114

From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Fri Aug 18 00:03:50 2023

From Newsgroup: comp.lang.awk

On 17.08.2023 23:27, Kenny McCormack wrote:

In article <b00f43d1-f50f-44ca-bb1f-517065cc3e28n@googlegroups.com>,
Robert Mesibov <robert.mesibov@gmail.com> wrote:
...

Many thanks, Janis, that's very nice, but it depends on specifying the
non-unique fields 2, 4 and 5. In the real-world cases I work with,
there are 1-2 unique ID code fields and sometimes 300+ non-unique-ID
fields (2, 4, 5...300+). That's why I replace the unique-ID fields with
the arbitrary value "1" when testing for duplication.

1) Well, it seems like it shouldn't be too hard for you to retrofit your
hack ($1 = $3 = 1) into Janis's hack. FWIW, I would probably just set to
"" instead of 1.

Yes, indeed. (See my other post.)

2) You probably don't need to mess with SUBSEP. Your data seems to be OK with assuming no embedded spaces (i.e., so using space as the delimiter is OK)
Note that SUBSEP is intended to be used as the delimiter for the implementation of old-fashioned pseudo-multi-dimensional arrays in AWK, but nobody uses that functionality anymore. Therefore, some AWK programmers
have co-opted SUBSEP as a symbol provided by the language to represent a character that is more-or-less guaranteed to never occur in user data.

Yes, SUBSEP is the default separation character for arrays and. Of
course you can use other characters (that require less text). Why
you think that "nobody uses that functionality anymore" is beyond
me; I doubt you have any evidence for that, so I interpret it just
as "I [Kenny] don't use it anymore.", which is fine by me.

3) I don't see how Janis's solution implements your need for sorting.

Sort can make sense in three different abstractions.

I interpreted the OP as doing the 'sort' just to be able to compare
the actual data set with the previous data set, to have them together;
this is unnecessary, though, with the approach I used with the keys
in associative array. Since the original data is also already sorted
my a unique numeric key and I sequentially concatenate the data it's
also not necessary to sort the data in that respect. So what's left
is the third thing that can be sorted, and that's the order of the
classes; that all, say, "pear" elements come before all "rose"
elements. This sort, in case it would be desired, is not reflected
in my approach.

Janis

Unless he is using the WHINY_USERS option. Or asort or asorti or PROCINFO["sorted_in"] or ...

--- Synchronet 3.20a-Linux NewsLink 1.114

From Robert Mesibov@robert.mesibov@gmail.com to comp.lang.awk on Thu Aug 17 15:36:38 2023

From Newsgroup: comp.lang.awk

Apologies for not explaining that there are numerous non-unique-ID fields, and yes, what I am aiming for is a sort beginning with the first non-unique-ID field.

My code is complicated because I need to preserve the original records for the output, while also modifying the original records by "de-uniquifying" the unique-ID fields in order to hunt for partial duplicates.

I'll continue to tinker with this and report back if I can simplify the code, but I would be grateful for any other AWK solutions.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Kaz Kylheku@864-117-4973@kylheku.com to comp.lang.awk on Thu Aug 17 23:00:45 2023

From Newsgroup: comp.lang.awk

On 2023-08-17, Kaz Kylheku <864-117-4973@kylheku.com> wrote:

(flow (get-lines)
(sort-group @1 (opip (spl " ") [callf list* 1 3 4..:]))

^^^^^^^^

[...]

This selects the second, fourth and fifth fields and each field after
the fifth, as the non-unique fields on which to group.

I inferred the requirement that the complement of the unique fields
should be used: all fields which are not the unique ones.
--- Synchronet 3.20a-Linux NewsLink 1.114

From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Fri Aug 18 01:47:10 2023

From Newsgroup: comp.lang.awk

On 18.08.2023 00:36, Robert Mesibov wrote:

I'll continue to tinker with this and report back if I can simplify
the code, but I would be grateful for any other AWK solutions.

For any additional sorting Kenny gave hints (see his point 3) that
can simply be added if you're using GNU awk.

Janis
--- Synchronet 3.20a-Linux NewsLink 1.114

From Robert Mesibov@robert.mesibov@gmail.com to comp.lang.awk on Fri Aug 18 01:23:00 2023

From Newsgroup: comp.lang.awk

Many thanks again, Janis. I doubt that I can improve on

awk '{x=$0; $1=$3=1; y=$0; a[y]=a[y] RS x; b[y]++}; END {for (i in a) if (b[i]>1) print a[i]}' demo

and the sorting isn't critical.

Bob
--- Synchronet 3.20a-Linux NewsLink 1.114

From gazelle@gazelle@shell.xmission.com (Kenny McCormack) to comp.lang.awk on Mon Aug 21 14:54:53 2023

From Newsgroup: comp.lang.awk

In article <ubm5g7$3u7rt$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
...

2) You probably don't need to mess with SUBSEP. Your data seems
to be OK with assuming no embedded spaces (i.e., so using space
as the delimiter is OK) Note that SUBSEP is intended to be
used as the delimiter for the implementation of old-fashioned
pseudo-multi-dimensional arrays in AWK, but nobody uses that
functionality anymore. Therefore, some AWK programmers have co-opted
SUBSEP as a symbol provided by the language to represent a character
that is more-or-less guaranteed to never occur in user data.

Yes, SUBSEP is the default separation character for arrays and. Of
course you can use other characters (that require less text). Why
you think that "nobody uses that functionality anymore" is beyond
me; I doubt you have any evidence for that, so I interpret it just
as "I [Kenny] don't use it anymore.", which is fine by me.

It may be a language barrier - I understand that English is not your first language - but in colloquial English, the phrase "nobody does X anymore"
often means something close to "nobody should do X anymore" or "Only uncool people still do X". Obviously, *some* people still do. BTW, see also the famous Yogi Berra quip: (Of a certain restaurant) "Nobody goes there
anymore; it's too crowded."

Anyway, this is definitely true of old-fashioned AWK pseudo-multi-dimensional arrays. They never really worked well, and now that we have true MDAs,
nobody should be using the old stuff.
--
The randomly chosen signature file that would have appeared here is more than 4 lines long. As such, it violates one or more Usenet RFCs. In order to remain in compliance with said RFCs, the actual sig can be found at the following URL:
http://user.xmission.com/~gazelle/Sigs/GodDelusion
--- Synchronet 3.20a-Linux NewsLink 1.114

From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Mon Aug 21 18:10:10 2023

From Newsgroup: comp.lang.awk

On 21.08.2023 16:54, Kenny McCormack wrote:

In article <ubm5g7$3u7rt$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
...

2) You probably don't need to mess with SUBSEP. Your data seems
to be OK with assuming no embedded spaces (i.e., so using space
as the delimiter is OK) Note that SUBSEP is intended to be
used as the delimiter for the implementation of old-fashioned
pseudo-multi-dimensional arrays in AWK, but nobody uses that
functionality anymore. Therefore, some AWK programmers have co-opted
SUBSEP as a symbol provided by the language to represent a character
that is more-or-less guaranteed to never occur in user data.

Yes, SUBSEP is the default separation character for arrays and. Of
course you can use other characters (that require less text). Why
you think that "nobody uses that functionality anymore" is beyond
me; I doubt you have any evidence for that, so I interpret it just
as "I [Kenny] don't use it anymore.", which is fine by me.

It may be a language barrier - I understand that English is not your first language - but in colloquial English, the phrase "nobody does X anymore" often means something close to "nobody should do X anymore" or "Only uncool people still do X". Obviously, *some* people still do. BTW, see also the famous Yogi Berra quip: (Of a certain restaurant) "Nobody goes there
anymore; it's too crowded."

Anyway, this is definitely true of old-fashioned AWK pseudo-multi-dimensional arrays. They never really worked well, and now that we have true MDAs, nobody should be using the old stuff.

Okay, thanks for explaining. So I've interpreted it right
(despite any probably existing language barrier). - And I
disagree with you in the given thread context, still also
generally.

"True" multi-dimensional arrays are unnecessary here, and
if you use separate keys where you need only one composite
key is not only unnecessary it seems to complicate matters.
(But you may provide code to prove me wrong if you like;
how would multidimensional arrays help here?)

In the past I used Gnu Awk's multi-dimensional arrays in
contexts where it was necessary, and there it simplified
*these* things. But usually when using awk I observed that
"simple [associative] arrays" is what I need in 98% of my
awk applications[*] - of course the situation where _you_
(personally) use Awk arrays may be different (that would
actually mean "I [Kenny] don't use it anymore.", what I
interpreted upthread).[**]

Since a[k] is a/the common use the question is, in which
contexts is a[k1][k2] necessary and in which is a[k1,k2]
sufficient? - My observation is that only where you need
true multi-dimensional access a[k1][k2] is advantageous;
but this appears not to be the common case. (BTW, [***].)

I think it boils down to observe that the concrete given
solution uses just one composed index and that there's no
need for non-standard "true multi-dimensional arrays"
because here there are no multi-dimensional arrays.[****]

Thanks for reading.

Janis

[*] Reminds me of the reasons why in Pascal the supported
only loops based on integral indices (and not FP); because
there was evidence that this was used most of the times.
(It doesn't mean that there aren't sensible applications
beyond that.)

[**] Of course you may also provide evidence and reasons
for the given hypotheses "nobody should do X anymore" -
Why? - and "Only uncool people still do X" - "uncool"? -
for (X = "don't use simple awk arrays". - I think such
statements make just no sense, yet if they are just fuzzy
(non determined) or personal without evidence.

[***] I deliberately ignored that the GNU Awk extension
is also non-standard, since it's not necessary for our
dispute.

[****] You see that where 'k' is composed and only a[k]
and c[k] used; simply and without disadvantage.

--- Synchronet 3.20a-Linux NewsLink 1.114

From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Mon Aug 21 18:56:33 2023

From Newsgroup: comp.lang.awk

On 21.08.2023 16:54, Kenny McCormack wrote:

Anyway, this is definitely true of old-fashioned AWK pseudo-multi-dimensional arrays. They never really worked well, and now that we have true MDAs, nobody should be using the old stuff.

I think the misunderstandings in this subthread were...
- we have no disagreement where "MDAs" are _necessary_ and used,
- in this thread's solutions we had no application of "MDAs"
(just a composed key), and "MDAs" also weren't necessary,
- (thesis) basic associative arrays are predominantly used
(mileages may probably vary depending on where awk is used),
- "MDAs" support associative functionality thus hardly avoidable
(is a[k] "old stuff" or is it an MDA with one dimension?)

Janis

--- Synchronet 3.20a-Linux NewsLink 1.114

From gazelle@gazelle@shell.xmission.com (Kenny McCormack) to comp.lang.awk on Mon Aug 21 19:04:17 2023

From Newsgroup: comp.lang.awk

In article <uc0501$1vsmp$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

On 21.08.2023 16:54, Kenny McCormack wrote:

Anyway, this is definitely true of old-fashioned AWK pseudo-multi-dimensional
arrays. They never really worked well, and now that we have true MDAs,
nobody should be using the old stuff.

I think the misunderstandings in this subthread were...
- we have no disagreement where "MDAs" are _necessary_ and used,
- in this thread's solutions we had no application of "MDAs"
(just a composed key), and "MDAs" also weren't necessary,
- (thesis) basic associative arrays are predominantly used
(mileages may probably vary depending on where awk is used),
- "MDAs" support associative functionality thus hardly avoidable
(is a[k] "old stuff" or is it an MDA with one dimension?)

I never said anything about any of that - That is, anything about whether
or not MDAs were needed in the context of this thread (Clearly, they are
not).

My content was, as it usually is, entirely "meta". Thus, the following two comments:

1) It sounded like you had misunderstood my comment about "nobody does
that anymore", so I clarified what the colloquial meaning of that
expression is. Note that I have hit a similar thing a while back in the
shell group - where I stated that nobody uses backticks anymore,
because we now have $(), which, as we all know, is better in just about
every way (the only exception that I can think of is that if you are
programming in csh or tcsh, then you have to use backticks - although
this may sound facetious, I still do some tcsh stuff, so I have to keep
this in mind).

I got a lot of blowback from indignant people who wanted me to know
that they still use backticks and they were personally insulted that I
claimed that no one did that anymore. Clearly, those people did not
understand the idiomatic meaning of the expression either.

2) You had used SUBSEP in your script (reply to OP), but were
(obviously) not using (any form of) MDAs, so I made some comments (not
for your benefit, but for OP's) about your usage of SUBSEP (i.e., how
it is usually only used when using pseudo-MDAs, but that some people
have co-opted it for other uses).
--
The plural of "anecdote" is _not_ "data".
--- Synchronet 3.20a-Linux NewsLink 1.114

Who's Online
Recent Visitors
- Microbot
  Mon May 6 20:15:29 2024
  from Moore, Ok via Telnet
- Duke
  Mon May 6 11:17:35 2024
  from London via Telnet
- Grey Gamer
  Mon May 6 07:57:21 2024
  from Show Low, Az via Telnet
- Grey Gamer
  Tue May 7 06:11:28 2024
  from Show Low, Az via Telnet

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	920
Nodes:	10 (0 / 10)
Uptime:	102:19:10
Calls:	12,189
Calls today:	1
Files:	186,527
Messages:	2,237,476

Simplify an AWK pipeline?

Who's Online

Recent Visitors

System Info