I'm hoping for ideas to simplify a duplicate-listing job. In this space-separated file "demo", fields 1 and 3 are codes unique to each record.
fld1 fld2 fld3 fld4 fld5
001 rose aa hat apple
002 pear bb hat apple
003 rose cc hat apple
004 shoe dd try tiger
005 worm ee law tiger
006 pear ff law tiger
007 pear gg hat apple
008 shoe hh cup heron
009 worm ii cup heron
To find the partial duplicate records which are identical except in those unique codes, I can parse "demo" twice like this:
awk 'FNR==NR {$1=$3=1; a[$0]++; next} {x=$0; $1=$3=1} a[$0]>1 {print x}' demo demo
which returns
001 rose aa hat apple
002 pear bb hat apple
003 rose cc hat apple
007 pear gg hat apple
I would like those 2 sets of partial duplicates (the rose-hat-apple
set and the pear-hat-apple set) to be sorted alphabetically and
separated, like this:
002 pear bb hat apple
007 pear gg hat apple
001 rose aa hat apple
003 rose cc hat apple
I'm hoping for ideas to simplify a duplicate-listing job. In this space-separated file "demo", fields 1 and 3 are codes unique to each
record.
fld1 fld2 fld3 fld4 fld5
001 rose aa hat apple
002 pear bb hat apple
003 rose cc hat apple
004 shoe dd try tiger
005 worm ee law tiger
006 pear ff law tiger
007 pear gg hat apple
008 shoe hh cup heron
009 worm ii cup heron
To find the partial duplicate records which are identical except in
those unique codes, I can parse "demo" twice like this:
awk 'FNR==NR {$1=$3=1; a[$0]++; next} {x=$0; $1=$3=1} a[$0]>1 {print x}' demo demo
which returns
001 rose aa hat apple
002 pear bb hat apple
003 rose cc hat apple
007 pear gg hat apple
I would like those 2 sets of partial duplicates (the rose-hat-apple
set and the pear-hat-apple set) to be sorted alphabetically and
separated, like this:
002 pear bb hat apple
007 pear gg hat apple
001 rose aa hat apple
003 rose cc hat apple
I can do that by piping the first AWK command's output to
sort -t" " -k2 | awk 'NR==1 {print; $1=$3=1; x=$0} NR>1 {y=$0; $1=$3=1; print $0==x ? y : "\n"y; x=$0}'
but this seems like a lot of coding for a result. I'd be grateful for suggestions on how to get the sorted, separated result in a single
AWK command, if possible.
You can alternatively do it (e.g.) in one instance also like this...Many thanks, Janis, that's very nice, but it depends on specifying the non-unique fields 2, 4 and 5. In the real-world cases I work with, there are 1-2 unique ID code fields and sometimes 300+ non-unique-ID fields (2, 4, 5...300+). That's why I replace the unique-ID fields with the arbitrary value "1" when testing for duplication.
{ k = $2 SUBSEP $4 SUBSEP $5 ; a[k] = a[k] RS $0 ; c[k]++ }
END { for(k in a) if (c[k]>1) print a[k] }
which is not (not much) shorter character wise but doesn't need the
external sort command, it is all in one awk instance (as you want),
and single pass. (I think the code is also a bit clearer than the
one you posted above, but YMMV.)
Janis
Many thanks, Janis, that's very nice, but it depends on specifying the >non-unique fields 2, 4 and 5. In the real-world cases I work with,
there are 1-2 unique ID code fields and sometimes 300+ non-unique-ID >fields (2, 4, 5...300+). That's why I replace the unique-ID fields with
the arbitrary value "1" when testing for duplication.
On Thursday, August 17, 2023 at 1:38:58 PM UTC+10, Janis Papanagnou wrote:
You can alternatively do it (e.g.) in one instance also like this...
{ k = $2 SUBSEP $4 SUBSEP $5 ; a[k] = a[k] RS $0 ; c[k]++ }
END { for(k in a) if (c[k]>1) print a[k] }
which is not (not much) shorter character wise but doesn't need the
external sort command, it is all in one awk instance (as you want),
and single pass. (I think the code is also a bit clearer than the
one you posted above, but YMMV.)
Janis
Many thanks, Janis, that's very nice, but it depends on specifying
the non-unique fields 2, 4 and 5. In the real-world cases I work
with, there are 1-2 unique ID code fields and sometimes 300+
non-unique-ID fields (2, 4, 5...300+). That's why I replace the
unique-ID fields with the arbitrary value "1" when testing for
duplication.
Bob
In article <b00f43d1-f50f-44ca-bb1f-517065cc3e28n@googlegroups.com>,
Robert Mesibov <robert.mesibov@gmail.com> wrote:
...
Many thanks, Janis, that's very nice, but it depends on specifying the
non-unique fields 2, 4 and 5. In the real-world cases I work with,
there are 1-2 unique ID code fields and sometimes 300+ non-unique-ID
fields (2, 4, 5...300+). That's why I replace the unique-ID fields with
the arbitrary value "1" when testing for duplication.
1) Well, it seems like it shouldn't be too hard for you to retrofit your
hack ($1 = $3 = 1) into Janis's hack. FWIW, I would probably just set to
"" instead of 1.
2) You probably don't need to mess with SUBSEP. Your data seems to be OK with assuming no embedded spaces (i.e., so using space as the delimiter is OK)
Note that SUBSEP is intended to be used as the delimiter for the implementation of old-fashioned pseudo-multi-dimensional arrays in AWK, but nobody uses that functionality anymore. Therefore, some AWK programmers
have co-opted SUBSEP as a symbol provided by the language to represent a character that is more-or-less guaranteed to never occur in user data.
3) I don't see how Janis's solution implements your need for sorting.
Unless he is using the WHINY_USERS option. Or asort or asorti or PROCINFO["sorted_in"] or ...
(flow (get-lines)^^^^^^^^
(sort-group @1 (opip (spl " ") [callf list* 1 3 4..:]))
[...]
I'll continue to tinker with this and report back if I can simplify
the code, but I would be grateful for any other AWK solutions.
2) You probably don't need to mess with SUBSEP. Your data seems
to be OK with assuming no embedded spaces (i.e., so using space
as the delimiter is OK) Note that SUBSEP is intended to be
used as the delimiter for the implementation of old-fashioned
pseudo-multi-dimensional arrays in AWK, but nobody uses that
functionality anymore. Therefore, some AWK programmers have co-opted
SUBSEP as a symbol provided by the language to represent a character
that is more-or-less guaranteed to never occur in user data.
Yes, SUBSEP is the default separation character for arrays and. Of
course you can use other characters (that require less text). Why
you think that "nobody uses that functionality anymore" is beyond
me; I doubt you have any evidence for that, so I interpret it just
as "I [Kenny] don't use it anymore.", which is fine by me.
In article <ubm5g7$3u7rt$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
...
2) You probably don't need to mess with SUBSEP. Your data seems
to be OK with assuming no embedded spaces (i.e., so using space
as the delimiter is OK) Note that SUBSEP is intended to be
used as the delimiter for the implementation of old-fashioned
pseudo-multi-dimensional arrays in AWK, but nobody uses that
functionality anymore. Therefore, some AWK programmers have co-opted
SUBSEP as a symbol provided by the language to represent a character
that is more-or-less guaranteed to never occur in user data.
Yes, SUBSEP is the default separation character for arrays and. Of
course you can use other characters (that require less text). Why
you think that "nobody uses that functionality anymore" is beyond
me; I doubt you have any evidence for that, so I interpret it just
as "I [Kenny] don't use it anymore.", which is fine by me.
It may be a language barrier - I understand that English is not your first language - but in colloquial English, the phrase "nobody does X anymore" often means something close to "nobody should do X anymore" or "Only uncool people still do X". Obviously, *some* people still do. BTW, see also the famous Yogi Berra quip: (Of a certain restaurant) "Nobody goes there
anymore; it's too crowded."
Anyway, this is definitely true of old-fashioned AWK pseudo-multi-dimensional arrays. They never really worked well, and now that we have true MDAs, nobody should be using the old stuff.
Anyway, this is definitely true of old-fashioned AWK pseudo-multi-dimensional arrays. They never really worked well, and now that we have true MDAs, nobody should be using the old stuff.
On 21.08.2023 16:54, Kenny McCormack wrote:
Anyway, this is definitely true of old-fashioned AWK pseudo-multi-dimensional
arrays. They never really worked well, and now that we have true MDAs,
nobody should be using the old stuff.
I think the misunderstandings in this subthread were...
- we have no disagreement where "MDAs" are _necessary_ and used,
- in this thread's solutions we had no application of "MDAs"
(just a composed key), and "MDAs" also weren't necessary,
- (thesis) basic associative arrays are predominantly used
(mileages may probably vary depending on where awk is used),
- "MDAs" support associative functionality thus hardly avoidable
(is a[k] "old stuff" or is it an MDA with one dimension?)
Sysop: | DaiTengu |
---|---|
Location: | Appleton, WI |
Users: | 920 |
Nodes: | 10 (0 / 10) |
Uptime: | 102:19:10 |
Calls: | 12,189 |
Calls today: | 1 |
Files: | 186,527 |
Messages: | 2,237,476 |