• Simplify an AWK pipeline?

    From Robert Mesibov@robert.mesibov@gmail.com to comp.lang.awk on Wed Aug 16 16:48:57 2023
    From Newsgroup: comp.lang.awk

    I'm hoping for ideas to simplify a duplicate-listing job. In this space-separated file "demo", fields 1 and 3 are codes unique to each record.

    fld1 fld2 fld3 fld4 fld5
    001 rose aa hat apple
    002 pear bb hat apple
    003 rose cc hat apple
    004 shoe dd try tiger
    005 worm ee law tiger
    006 pear ff law tiger
    007 pear gg hat apple
    008 shoe hh cup heron
    009 worm ii cup heron

    To find the partial duplicate records which are identical except in those unique codes, I can parse "demo" twice like this:

    awk 'FNR==NR {$1=$3=1; a[$0]++; next} {x=$0; $1=$3=1} a[$0]>1 {print x}' demo demo

    which returns

    001 rose aa hat apple
    002 pear bb hat apple
    003 rose cc hat apple
    007 pear gg hat apple

    I would like those 2 sets of partial duplicates (the rose-hat-apple set and the pear-hat-apple set) to be sorted alphabetically and separated, like this:

    002 pear bb hat apple
    007 pear gg hat apple

    001 rose aa hat apple
    003 rose cc hat apple

    I can do that by piping the first AWK command's output to

    sort -t" " -k2 | awk 'NR==1 {print; $1=$3=1; x=$0} NR>1 {y=$0; $1=$3=1; print $0==x ? y : "\n"y; x=$0}'

    but this seems like a lot of coding for a result. I'd be grateful for suggestions on how to get the sorted, separated result in a single AWK command, if possible.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Kaz Kylheku@864-117-4973@kylheku.com to comp.lang.awk on Thu Aug 17 00:59:20 2023
    From Newsgroup: comp.lang.awk

    On 2023-08-16, Robert Mesibov <robert.mesibov@gmail.com> wrote:
    I'm hoping for ideas to simplify a duplicate-listing job. In this space-separated file "demo", fields 1 and 3 are codes unique to each record.

    fld1 fld2 fld3 fld4 fld5
    001 rose aa hat apple
    002 pear bb hat apple
    003 rose cc hat apple
    004 shoe dd try tiger
    005 worm ee law tiger
    006 pear ff law tiger
    007 pear gg hat apple
    008 shoe hh cup heron
    009 worm ii cup heron

    To find the partial duplicate records which are identical except in those unique codes, I can parse "demo" twice like this:

    awk 'FNR==NR {$1=$3=1; a[$0]++; next} {x=$0; $1=$3=1} a[$0]>1 {print x}' demo demo

    which returns

    001 rose aa hat apple
    002 pear bb hat apple
    003 rose cc hat apple
    007 pear gg hat apple

    I would like those 2 sets of partial duplicates (the rose-hat-apple
    set and the pear-hat-apple set) to be sorted alphabetically and
    separated, like this:

    002 pear bb hat apple
    007 pear gg hat apple

    001 rose aa hat apple
    003 rose cc hat apple

    Like this?

    $ txr group.tl < data
    002 pear bb hat apple
    007 pear gg hat apple

    006 pear ff law tiger

    001 rose aa hat apple
    003 rose cc hat apple

    008 shoe hh cup heron

    004 shoe dd try tiger

    009 worm ii cup heron

    005 worm ee law tiger

    $ cat group.tl
    (flow (get-lines)
    (sort-group @1 (opip (spl " ") [callf list* 1 3 4..:]))
    (each ((group @1))
    (put-lines group)
    (put-line)))

    Here's a dime kid, ...
    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Thu Aug 17 05:38:54 2023
    From Newsgroup: comp.lang.awk

    On 17.08.2023 01:48, Robert Mesibov wrote:
    I'm hoping for ideas to simplify a duplicate-listing job. In this space-separated file "demo", fields 1 and 3 are codes unique to each
    record.

    fld1 fld2 fld3 fld4 fld5
    001 rose aa hat apple
    002 pear bb hat apple
    003 rose cc hat apple
    004 shoe dd try tiger
    005 worm ee law tiger
    006 pear ff law tiger
    007 pear gg hat apple
    008 shoe hh cup heron
    009 worm ii cup heron

    To find the partial duplicate records which are identical except in
    those unique codes, I can parse "demo" twice like this:

    awk 'FNR==NR {$1=$3=1; a[$0]++; next} {x=$0; $1=$3=1} a[$0]>1 {print x}' demo demo

    which returns

    001 rose aa hat apple
    002 pear bb hat apple
    003 rose cc hat apple
    007 pear gg hat apple

    I would like those 2 sets of partial duplicates (the rose-hat-apple
    set and the pear-hat-apple set) to be sorted alphabetically and
    separated, like this:

    002 pear bb hat apple
    007 pear gg hat apple

    001 rose aa hat apple
    003 rose cc hat apple

    I can do that by piping the first AWK command's output to

    sort -t" " -k2 | awk 'NR==1 {print; $1=$3=1; x=$0} NR>1 {y=$0; $1=$3=1; print $0==x ? y : "\n"y; x=$0}'

    but this seems like a lot of coding for a result. I'd be grateful for suggestions on how to get the sorted, separated result in a single
    AWK command, if possible.

    You can alternatively do it (e.g.) in one instance also like this...

    { k = $2 SUBSEP $4 SUBSEP $5 ; a[k] = a[k] RS $0 ; c[k]++ }
    END { for(k in a) if (c[k]>1) print a[k] }

    which is not (not much) shorter character wise but doesn't need the
    external sort command, it is all in one awk instance (as you want),
    and single pass. (I think the code is also a bit clearer than the
    one you posted above, but YMMV.)

    Janis

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Robert Mesibov@robert.mesibov@gmail.com to comp.lang.awk on Thu Aug 17 13:56:47 2023
    From Newsgroup: comp.lang.awk

    On Thursday, August 17, 2023 at 1:38:58 PM UTC+10, Janis Papanagnou wrote:
    You can alternatively do it (e.g.) in one instance also like this...

    { k = $2 SUBSEP $4 SUBSEP $5 ; a[k] = a[k] RS $0 ; c[k]++ }
    END { for(k in a) if (c[k]>1) print a[k] }

    which is not (not much) shorter character wise but doesn't need the
    external sort command, it is all in one awk instance (as you want),
    and single pass. (I think the code is also a bit clearer than the
    one you posted above, but YMMV.)

    Janis
    Many thanks, Janis, that's very nice, but it depends on specifying the non-unique fields 2, 4 and 5. In the real-world cases I work with, there are 1-2 unique ID code fields and sometimes 300+ non-unique-ID fields (2, 4, 5...300+). That's why I replace the unique-ID fields with the arbitrary value "1" when testing for duplication.
    Bob
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From gazelle@gazelle@shell.xmission.com (Kenny McCormack) to comp.lang.awk on Thu Aug 17 21:27:41 2023
    From Newsgroup: comp.lang.awk

    In article <b00f43d1-f50f-44ca-bb1f-517065cc3e28n@googlegroups.com>,
    Robert Mesibov <robert.mesibov@gmail.com> wrote:
    ...
    Many thanks, Janis, that's very nice, but it depends on specifying the >non-unique fields 2, 4 and 5. In the real-world cases I work with,
    there are 1-2 unique ID code fields and sometimes 300+ non-unique-ID >fields (2, 4, 5...300+). That's why I replace the unique-ID fields with
    the arbitrary value "1" when testing for duplication.

    1) Well, it seems like it shouldn't be too hard for you to retrofit your
    hack ($1 = $3 = 1) into Janis's hack. FWIW, I would probably just set to
    "" instead of 1.

    2) You probably don't need to mess with SUBSEP. Your data seems to be OK
    with assuming no embedded spaces (i.e., so using space as the delimiter is OK) Note that SUBSEP is intended to be used as the delimiter for the
    implementation of old-fashioned pseudo-multi-dimensional arrays in AWK, but nobody uses that functionality anymore. Therefore, some AWK programmers
    have co-opted SUBSEP as a symbol provided by the language to represent a character that is more-or-less guaranteed to never occur in user data.

    3) I don't see how Janis's solution implements your need for sorting.
    Unless he is using the WHINY_USERS option. Or asort or asorti or PROCINFO["sorted_in"] or ...
    --
    "Every time Mitt opens his mouth, a swing state gets its wings."

    (Should be on a bumper sticker)
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Thu Aug 17 23:42:50 2023
    From Newsgroup: comp.lang.awk

    On 17.08.2023 22:56, Robert Mesibov wrote:
    On Thursday, August 17, 2023 at 1:38:58 PM UTC+10, Janis Papanagnou wrote:

    You can alternatively do it (e.g.) in one instance also like this...

    { k = $2 SUBSEP $4 SUBSEP $5 ; a[k] = a[k] RS $0 ; c[k]++ }
    END { for(k in a) if (c[k]>1) print a[k] }

    which is not (not much) shorter character wise but doesn't need the
    external sort command, it is all in one awk instance (as you want),
    and single pass. (I think the code is also a bit clearer than the
    one you posted above, but YMMV.)

    Janis

    Many thanks, Janis, that's very nice, but it depends on specifying
    the non-unique fields 2, 4 and 5. In the real-world cases I work
    with, there are 1-2 unique ID code fields and sometimes 300+
    non-unique-ID fields (2, 4, 5...300+). That's why I replace the
    unique-ID fields with the arbitrary value "1" when testing for
    duplication.

    That was not apparent from your description. But defining the key
    by constructing it is not mandatory, you can also define it using
    elimination (as in your code); the point was what is following in
    the code after the k=... statement.

    Janis


    Bob


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Fri Aug 18 00:03:50 2023
    From Newsgroup: comp.lang.awk

    On 17.08.2023 23:27, Kenny McCormack wrote:
    In article <b00f43d1-f50f-44ca-bb1f-517065cc3e28n@googlegroups.com>,
    Robert Mesibov <robert.mesibov@gmail.com> wrote:
    ...
    Many thanks, Janis, that's very nice, but it depends on specifying the
    non-unique fields 2, 4 and 5. In the real-world cases I work with,
    there are 1-2 unique ID code fields and sometimes 300+ non-unique-ID
    fields (2, 4, 5...300+). That's why I replace the unique-ID fields with
    the arbitrary value "1" when testing for duplication.

    1) Well, it seems like it shouldn't be too hard for you to retrofit your
    hack ($1 = $3 = 1) into Janis's hack. FWIW, I would probably just set to
    "" instead of 1.

    Yes, indeed. (See my other post.)


    2) You probably don't need to mess with SUBSEP. Your data seems to be OK with assuming no embedded spaces (i.e., so using space as the delimiter is OK)
    Note that SUBSEP is intended to be used as the delimiter for the implementation of old-fashioned pseudo-multi-dimensional arrays in AWK, but nobody uses that functionality anymore. Therefore, some AWK programmers
    have co-opted SUBSEP as a symbol provided by the language to represent a character that is more-or-less guaranteed to never occur in user data.

    Yes, SUBSEP is the default separation character for arrays and. Of
    course you can use other characters (that require less text). Why
    you think that "nobody uses that functionality anymore" is beyond
    me; I doubt you have any evidence for that, so I interpret it just
    as "I [Kenny] don't use it anymore.", which is fine by me.


    3) I don't see how Janis's solution implements your need for sorting.

    Sort can make sense in three different abstractions.

    I interpreted the OP as doing the 'sort' just to be able to compare
    the actual data set with the previous data set, to have them together;
    this is unnecessary, though, with the approach I used with the keys
    in associative array. Since the original data is also already sorted
    my a unique numeric key and I sequentially concatenate the data it's
    also not necessary to sort the data in that respect. So what's left
    is the third thing that can be sorted, and that's the order of the
    classes; that all, say, "pear" elements come before all "rose"
    elements. This sort, in case it would be desired, is not reflected
    in my approach.

    Janis

    Unless he is using the WHINY_USERS option. Or asort or asorti or PROCINFO["sorted_in"] or ...


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Robert Mesibov@robert.mesibov@gmail.com to comp.lang.awk on Thu Aug 17 15:36:38 2023
    From Newsgroup: comp.lang.awk

    Apologies for not explaining that there are numerous non-unique-ID fields, and yes, what I am aiming for is a sort beginning with the first non-unique-ID field.

    My code is complicated because I need to preserve the original records for the output, while also modifying the original records by "de-uniquifying" the unique-ID fields in order to hunt for partial duplicates.

    I'll continue to tinker with this and report back if I can simplify the code, but I would be grateful for any other AWK solutions.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Kaz Kylheku@864-117-4973@kylheku.com to comp.lang.awk on Thu Aug 17 23:00:45 2023
    From Newsgroup: comp.lang.awk

    On 2023-08-17, Kaz Kylheku <864-117-4973@kylheku.com> wrote:
    (flow (get-lines)
    (sort-group @1 (opip (spl " ") [callf list* 1 3 4..:]))
    ^^^^^^^^
    [...]

    This selects the second, fourth and fifth fields and each field after
    the fifth, as the non-unique fields on which to group.

    I inferred the requirement that the complement of the unique fields
    should be used: all fields which are not the unique ones.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Fri Aug 18 01:47:10 2023
    From Newsgroup: comp.lang.awk

    On 18.08.2023 00:36, Robert Mesibov wrote:

    I'll continue to tinker with this and report back if I can simplify
    the code, but I would be grateful for any other AWK solutions.

    For any additional sorting Kenny gave hints (see his point 3) that
    can simply be added if you're using GNU awk.

    Janis
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Robert Mesibov@robert.mesibov@gmail.com to comp.lang.awk on Fri Aug 18 01:23:00 2023
    From Newsgroup: comp.lang.awk

    Many thanks again, Janis. I doubt that I can improve on

    awk '{x=$0; $1=$3=1; y=$0; a[y]=a[y] RS x; b[y]++}; END {for (i in a) if (b[i]>1) print a[i]}' demo

    and the sorting isn't critical.

    Bob
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From gazelle@gazelle@shell.xmission.com (Kenny McCormack) to comp.lang.awk on Mon Aug 21 14:54:53 2023
    From Newsgroup: comp.lang.awk

    In article <ubm5g7$3u7rt$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    ...
    2) You probably don't need to mess with SUBSEP. Your data seems
    to be OK with assuming no embedded spaces (i.e., so using space
    as the delimiter is OK) Note that SUBSEP is intended to be
    used as the delimiter for the implementation of old-fashioned
    pseudo-multi-dimensional arrays in AWK, but nobody uses that
    functionality anymore. Therefore, some AWK programmers have co-opted
    SUBSEP as a symbol provided by the language to represent a character
    that is more-or-less guaranteed to never occur in user data.

    Yes, SUBSEP is the default separation character for arrays and. Of
    course you can use other characters (that require less text). Why
    you think that "nobody uses that functionality anymore" is beyond
    me; I doubt you have any evidence for that, so I interpret it just
    as "I [Kenny] don't use it anymore.", which is fine by me.

    It may be a language barrier - I understand that English is not your first language - but in colloquial English, the phrase "nobody does X anymore"
    often means something close to "nobody should do X anymore" or "Only uncool people still do X". Obviously, *some* people still do. BTW, see also the famous Yogi Berra quip: (Of a certain restaurant) "Nobody goes there
    anymore; it's too crowded."

    Anyway, this is definitely true of old-fashioned AWK pseudo-multi-dimensional arrays. They never really worked well, and now that we have true MDAs,
    nobody should be using the old stuff.
    --
    The randomly chosen signature file that would have appeared here is more than 4 lines long. As such, it violates one or more Usenet RFCs. In order to remain in compliance with said RFCs, the actual sig can be found at the following URL:
    http://user.xmission.com/~gazelle/Sigs/GodDelusion
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Mon Aug 21 18:10:10 2023
    From Newsgroup: comp.lang.awk

    On 21.08.2023 16:54, Kenny McCormack wrote:
    In article <ubm5g7$3u7rt$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    ...
    2) You probably don't need to mess with SUBSEP. Your data seems
    to be OK with assuming no embedded spaces (i.e., so using space
    as the delimiter is OK) Note that SUBSEP is intended to be
    used as the delimiter for the implementation of old-fashioned
    pseudo-multi-dimensional arrays in AWK, but nobody uses that
    functionality anymore. Therefore, some AWK programmers have co-opted
    SUBSEP as a symbol provided by the language to represent a character
    that is more-or-less guaranteed to never occur in user data.

    Yes, SUBSEP is the default separation character for arrays and. Of
    course you can use other characters (that require less text). Why
    you think that "nobody uses that functionality anymore" is beyond
    me; I doubt you have any evidence for that, so I interpret it just
    as "I [Kenny] don't use it anymore.", which is fine by me.

    It may be a language barrier - I understand that English is not your first language - but in colloquial English, the phrase "nobody does X anymore" often means something close to "nobody should do X anymore" or "Only uncool people still do X". Obviously, *some* people still do. BTW, see also the famous Yogi Berra quip: (Of a certain restaurant) "Nobody goes there
    anymore; it's too crowded."

    Anyway, this is definitely true of old-fashioned AWK pseudo-multi-dimensional arrays. They never really worked well, and now that we have true MDAs, nobody should be using the old stuff.

    Okay, thanks for explaining. So I've interpreted it right
    (despite any probably existing language barrier). - And I
    disagree with you in the given thread context, still also
    generally.

    "True" multi-dimensional arrays are unnecessary here, and
    if you use separate keys where you need only one composite
    key is not only unnecessary it seems to complicate matters.
    (But you may provide code to prove me wrong if you like;
    how would multidimensional arrays help here?)

    In the past I used Gnu Awk's multi-dimensional arrays in
    contexts where it was necessary, and there it simplified
    *these* things. But usually when using awk I observed that
    "simple [associative] arrays" is what I need in 98% of my
    awk applications[*] - of course the situation where _you_
    (personally) use Awk arrays may be different (that would
    actually mean "I [Kenny] don't use it anymore.", what I
    interpreted upthread).[**]

    Since a[k] is a/the common use the question is, in which
    contexts is a[k1][k2] necessary and in which is a[k1,k2]
    sufficient? - My observation is that only where you need
    true multi-dimensional access a[k1][k2] is advantageous;
    but this appears not to be the common case. (BTW, [***].)

    I think it boils down to observe that the concrete given
    solution uses just one composed index and that there's no
    need for non-standard "true multi-dimensional arrays"
    because here there are no multi-dimensional arrays.[****]

    Thanks for reading.

    Janis

    [*] Reminds me of the reasons why in Pascal the supported
    only loops based on integral indices (and not FP); because
    there was evidence that this was used most of the times.
    (It doesn't mean that there aren't sensible applications
    beyond that.)

    [**] Of course you may also provide evidence and reasons
    for the given hypotheses "nobody should do X anymore" -
    Why? - and "Only uncool people still do X" - "uncool"? -
    for (X = "don't use simple awk arrays". - I think such
    statements make just no sense, yet if they are just fuzzy
    (non determined) or personal without evidence.

    [***] I deliberately ignored that the GNU Awk extension
    is also non-standard, since it's not necessary for our
    dispute.

    [****] You see that where 'k' is composed and only a[k]
    and c[k] used; simply and without disadvantage.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Mon Aug 21 18:56:33 2023
    From Newsgroup: comp.lang.awk

    On 21.08.2023 16:54, Kenny McCormack wrote:

    Anyway, this is definitely true of old-fashioned AWK pseudo-multi-dimensional arrays. They never really worked well, and now that we have true MDAs, nobody should be using the old stuff.

    I think the misunderstandings in this subthread were...
    - we have no disagreement where "MDAs" are _necessary_ and used,
    - in this thread's solutions we had no application of "MDAs"
    (just a composed key), and "MDAs" also weren't necessary,
    - (thesis) basic associative arrays are predominantly used
    (mileages may probably vary depending on where awk is used),
    - "MDAs" support associative functionality thus hardly avoidable
    (is a[k] "old stuff" or is it an MDA with one dimension?)

    Janis

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From gazelle@gazelle@shell.xmission.com (Kenny McCormack) to comp.lang.awk on Mon Aug 21 19:04:17 2023
    From Newsgroup: comp.lang.awk

    In article <uc0501$1vsmp$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    On 21.08.2023 16:54, Kenny McCormack wrote:

    Anyway, this is definitely true of old-fashioned AWK pseudo-multi-dimensional
    arrays. They never really worked well, and now that we have true MDAs,
    nobody should be using the old stuff.

    I think the misunderstandings in this subthread were...
    - we have no disagreement where "MDAs" are _necessary_ and used,
    - in this thread's solutions we had no application of "MDAs"
    (just a composed key), and "MDAs" also weren't necessary,
    - (thesis) basic associative arrays are predominantly used
    (mileages may probably vary depending on where awk is used),
    - "MDAs" support associative functionality thus hardly avoidable
    (is a[k] "old stuff" or is it an MDA with one dimension?)

    I never said anything about any of that - That is, anything about whether
    or not MDAs were needed in the context of this thread (Clearly, they are
    not).

    My content was, as it usually is, entirely "meta". Thus, the following two comments:

    1) It sounded like you had misunderstood my comment about "nobody does
    that anymore", so I clarified what the colloquial meaning of that
    expression is. Note that I have hit a similar thing a while back in the
    shell group - where I stated that nobody uses backticks anymore,
    because we now have $(), which, as we all know, is better in just about
    every way (the only exception that I can think of is that if you are
    programming in csh or tcsh, then you have to use backticks - although
    this may sound facetious, I still do some tcsh stuff, so I have to keep
    this in mind).

    I got a lot of blowback from indignant people who wanted me to know
    that they still use backticks and they were personally insulted that I
    claimed that no one did that anymore. Clearly, those people did not
    understand the idiomatic meaning of the expression either.

    2) You had used SUBSEP in your script (reply to OP), but were
    (obviously) not using (any form of) MDAs, so I made some comments (not
    for your benefit, but for OP's) about your usage of SUBSEP (i.e., how
    it is usually only used when using pseudo-MDAs, but that some people
    have co-opted it for other uses).
    --
    The plural of "anecdote" is _not_ "data".
    --- Synchronet 3.20a-Linux NewsLink 1.114