• Unique Characters Only

    From porkchop@porkchop@invalid.foo (Mike Sanders) to comp.lang.awk on Sun Oct 1 09:38:01 2023
    From Newsgroup: comp.lang.awk

    run as...

    awk -f uniqueChars.awk

    output...

    Input string: Mary had a little lamb who's fleece was white as snow...
    Unique chars: Mary hdlitembwo'sfcn.

    script...

    BEGIN {

    a = "Mary had a little lamb who's fleece was white as snow..."
    b = uniqueChars(a)

    print "Input string: " a
    print "Unique chars: " b

    }

    function uniqueChars(str, x, y, c, tmp, uniqueStr) {

    y = length(str)
    uniqueStr = ""
    delete tmp # clear array for each new string

    while(++x <= y) {
    c = substr(str, x, 1)
    if (!(c in tmp)) {
    uniqueStr = uniqueStr c
    tmp[c]
    }
    }

    return uniqueStr

    }
    --
    :wq
    Mike Sanders

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Sun Oct 1 12:05:39 2023
    From Newsgroup: comp.lang.awk

    If you want to avoid the substr() function calls in the loop and don't
    mind using non-standard (GNU Awk) features you can also use split():

    function uniqueChars (t, s, n, i, c, o, seen)
    {
    delete seen
    n = split (t, s, "")
    for (i=1; i<=n; i++)
    if (!seen[c = s[i]]++)
    o = o c

    return o
    }

    {
    printf "In:\t%s\nOut:\t%s\n", $0, uniqueChars($0)
    }


    (Just for a variant.)

    Janis


    On 01.10.2023 11:38, Mike Sanders wrote:
    run as...

    awk -f uniqueChars.awk

    output...

    Input string: Mary had a little lamb who's fleece was white as snow...
    Unique chars: Mary hdlitembwo'sfcn.

    script...

    BEGIN {

    a = "Mary had a little lamb who's fleece was white as snow..."
    b = uniqueChars(a)

    print "Input string: " a
    print "Unique chars: " b

    }

    function uniqueChars(str, x, y, c, tmp, uniqueStr) {

    y = length(str)
    uniqueStr = ""
    delete tmp # clear array for each new string

    while(++x <= y) {
    c = substr(str, x, 1)
    if (!(c in tmp)) {
    uniqueStr = uniqueStr c
    tmp[c]
    }
    }

    return uniqueStr

    }


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From porkchop@porkchop@invalid.foo (Mike Sanders) to comp.lang.awk on Sun Oct 1 10:26:19 2023
    From Newsgroup: comp.lang.awk

    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

    If you want to avoid the substr() function calls in the loop and don't
    mind using non-standard (GNU Awk) features you can also use split():

    function uniqueChars (t, s, n, i, c, o, seen)
    {
    delete seen
    n = split (t, s, "")
    for (i=1; i<=n; i++)
    if (!seen[c = s[i]]++)
    o = o c

    return o
    }

    {
    printf "In:\t%s\nOut:\t%s\n", $0, uniqueChars($0)
    }


    (Just for a variant.)

    Janis

    Janis you are good! =)
    --
    :wq
    Mike Sanders

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From yeti@yeti@tilde.institute to comp.lang.awk on Sun Oct 1 10:28:05 2023
    From Newsgroup: comp.lang.awk

    You can use `ìndex` to avoid adding the same char multiple times to the
    result string. That may or may not be faster than using an additional
    array, I've not benchmarked it (yet?). You may call me Vroomfondel
    today.
    --
    1. Hitchhiker 25: (59) Scarcely pausing for breath, Vroomfondel shouted,
    "We don't demand solid facts! What we demand is a total absence of solid facts. I demand that I may or may not be Vroomfondel!"
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Vroomfondel@yeti@tilde.institute to comp.lang.awk on Sun Oct 1 10:35:37 2023
    From Newsgroup: comp.lang.awk

    Reading beyond this line equals signing an implicit "Non Laughing
    Agrement". ;-)


    Only minimally tested:

    function uniqueChars(str,_c,_i,_seen) {
    _seen=""
    for(_i=1;_i<=length(str);_i++) {
    if(0<index(str,_c=substr(str,_i,1)))
    if(index(_seen,_c)<1)
    _seen=_seen _c
    }
    return _seen
    }

    Maybe the assumption that array ops are more expensive than string ops
    is wrong and this idea is worse than the original.


    Next stop: Recursion!
    (Just joking!)
    --
    1. Hitchhiker 25: (59) Scarcely pausing for breath, Vroomfondel shouted,
    "We don't demand solid facts! What we demand is a total absence of solid facts. I demand that I may or may not be Vroomfondel!"
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From porkchop@porkchop@invalid.foo (Mike Sanders) to comp.lang.awk on Sun Oct 1 11:12:57 2023
    From Newsgroup: comp.lang.awk

    yeti <yeti@tilde.institute> wrote:

    You can use `??ndex` to avoid adding the same char multiple times to the result string. That may or may not be faster than using an additional
    array, I've not benchmarked it (yet?). You may call me Vroomfondel
    today.

    Hey, sounds interesting, please post your complete code.
    --
    :wq
    Mike Sanders

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From porkchop@porkchop@invalid.foo (Mike Sanders) to comp.lang.awk on Sun Oct 1 11:15:25 2023
    From Newsgroup: comp.lang.awk

    Vroomfondel <yeti@tilde.institute> wrote:

    Reading beyond this line equals signing an implicit "Non Laughing
    Agrement". ;-)


    Only minimally tested:

    function uniqueChars(str,_c,_i,_seen) {
    _seen=""
    for(_i=1;_i<=length(str);_i++) {
    if(0<index(str,_c=substr(str,_i,1)))
    if(index(_seen,_c)<1)
    _seen=_seen _c
    }
    return _seen
    }

    Maybe the assumption that array ops are more expensive than string ops
    is wrong and this idea is worse than the original.

    But then again, maybe not. I must study this more (thank you!).

    Next stop: Recursion!
    (Just joking!)

    Chuckle, GNU acronyms come to mind...
    --
    :wq
    Mike Sanders

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From porkchop@porkchop@invalid.foo (Mike Sanders) to comp.lang.awk on Mon Oct 2 05:14:16 2023
    From Newsgroup: comp.lang.awk

    Mike Sanders <porkchop@invalid.foo> wrote:

    function uniqueChars(str, x, y, c, tmp, uniqueStr) {

    y = length(str)
    uniqueStr = ""
    delete tmp # clear array for each new string

    while(++x <= y) {
    c = substr(str, x, 1)
    if (!(c in tmp)) {
    uniqueStr = uniqueStr c
    tmp[c]
    }
    }

    return uniqueStr

    }


    okay, got rid of tmp array...

    function uniqueChars(str, x, y, c, uniqueStr) {

    y = length(str)
    uniqueStr = ""

    while(++x <= y) {
    c = substr(str, x, 1)
    if (index(uniqueStr, c) == 0) uniqueStr = uniqueStr c
    }

    return uniqueStr

    }
    --
    :wq
    Mike Sanders

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From J Naman@jnaman2@gmail.com to comp.lang.awk on Mon Oct 2 10:21:50 2023
    From Newsgroup: comp.lang.awk

    On Sunday, 1 October 2023 at 06:05:43 UTC-4, Janis Papanagnou wrote:
    If you want to avoid the substr() function calls in the loop and don't
    mind using non-standard (GNU Awk) features you can also use split():

    function uniqueChars (t, s, n, i, c, o, seen)
    {
    delete seen
    n = split (t, s, "")
    for (i=1; i<=n; i++)
    if (!seen[c = s[i]]++)
    o = o c

    return o
    }

    {
    printf "In:\t%s\nOut:\t%s\n", $0, uniqueChars($0)
    }


    (Just for a variant.)

    Janis
    On 01.10.2023 11:38, Mike Sanders wrote:
    run as...

    awk -f uniqueChars.awk

    output...

    Input string: Mary had a little lamb who's fleece was white as snow... Unique chars: Mary hdlitembwo'sfcn.

    script...

    BEGIN {

    a = "Mary had a little lamb who's fleece was white as snow..."
    b = uniqueChars(a)

    print "Input string: " a
    print "Unique chars: " b

    }

    function uniqueChars(str, x, y, c, tmp, uniqueStr) {

    y = length(str)
    uniqueStr = ""
    delete tmp # clear array for each new string

    while(++x <= y) {
    c = substr(str, x, 1)
    if (!(c in tmp)) {
    uniqueStr = uniqueStr c
    tmp[c]
    }
    }

    return uniqueStr

    }

    The Unique Chars Only discussion of string Op variants -- substr() split() index() -- reminded me of an interesting note in the most excellent Awk book, 2nd Edition. They benchmark split() vs substr() for single character operations and substr() is 40% faster than split(). I always assumed that split() was faster.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Janis Papanagnou@janis_papanagnou+ng@hotmail.com to comp.lang.awk on Mon Oct 2 20:46:16 2023
    From Newsgroup: comp.lang.awk

    On 02.10.2023 19:21, J Naman wrote:

    The Unique Chars Only discussion of string Op variants -- substr()
    split() index() -- reminded me of an interesting note in the most
    excellent Awk book, 2nd Edition. They benchmark split() vs substr()
    for single character operations and substr() is 40% faster than
    split(). I always assumed that split() was faster.

    In many cases it's not obvious until it is measured in practice!
    Specifically you (usually) have to consider also the data sizes.

    (Also note that the Book may not consider GNU Awk's special case
    of split'ing characters that may perform better or worse.)

    And you may also consider other factors (if the speed difference
    is not that significant, as with the small data set we have here),
    e.g. what construct is better maintainable (with various aspects).
    I'd expect that substr(s,p,1) is extremely fast because that should
    basically be only an indexed access on a linear array[*], and such
    operations are often performed in one CPU cycle. Considering other implementations for this substring from other languages, say, s[p]
    will not give the impression of a performance problem - there's no
    explicit function call but there shouldn't be much difference given
    the primitivity of this operation[**].

    The reason for my split()-based variant was to have no function call
    overheads and algorithmically removing an "invariant" from the loop
    (preparing the data). Note that the access of the split-array might
    also be a [fast] indexed access, or it may be a [fast] hash-array
    access (as I think it is in Awk).

    But that all said you also have to consider that arrays are memory
    consumptive! If you have huge data sets it will be considerable
    worse to use split with arrays than simply index-access a string.
    (It doesn't matter here, but be aware.)

    In short; I don't think that efficiency is here an issue. That all
    considered I'd *not* use the split method - which is, as mentioned,
    not even standard with FS="" - to be on the safe side.

    Janis

    [*] A caveat e.g. in case of variable-width characters in arrays
    used for string representation; it might be more expensive than it
    appears at first glance.

    [**] Performance issues usually arise on higher algorithmic levels.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Ed Morton@mortonspam@gmail.com to comp.lang.awk on Sun Nov 5 09:11:13 2023
    From Newsgroup: comp.lang.awk

    On 10/1/2023 4:38 AM, Mike Sanders wrote:
    run as...

    awk -f uniqueChars.awk

    output...

    Input string: Mary had a little lamb who's fleece was white as snow... Unique chars: Mary hdlitembwo'sfcn.

    script...

    BEGIN {

    a = "Mary had a little lamb who's fleece was white as snow..."
    b = uniqueChars(a)

    print "Input string: " a
    print "Unique chars: " b

    }

    function uniqueChars(str, x, y, c, tmp, uniqueStr) {

    y = length(str)
    uniqueStr = ""
    delete tmp # clear array for each new string
    You don't need to do that `delete` - just having "tmp" listed in the
    args list will re-init it every time the function is called. Removing
    that statement will also make your script portable to awks than don't
    support `delete array` (but most, possibly all, modern awks do support
    that even though it's technically still undefined behavior).


    while(++x <= y) {
    Using a `while` instead of `for` loop for that makes your code a bit
    less clear, a bit more fragile (what if `x` gets set above?), and a bit
    harder to maintain (what if in future you need to increment x by 2 every iteration?). It's not worth saving the few characters over the
    traditional `for ( x=1; x<=y; x++ )`

    c = substr(str, x, 1)
    if (!(c in tmp)) {
    Idiomatically that'd be implemented as

    if ( !tmp[c]++ ) {

    and then you'd remove the `tmp[c]` below but the array in that case is
    almost always named `seen[]` rather than `tmp[]`.

    uniqueStr = uniqueStr c
    tmp[c]
    }
    }

    return uniqueStr

    }

    Alternatively, if the order of the characters returned doesn't matter,
    you could do:

    function uniqueChars(str, x, y, c, tmp, uniqueStr) {

    y = length(str)
    uniqueStr = ""
    for ( x=1; x<=y; x++ ) {
    tmp[substr(str,x,1)]
    }
    for ( c in tmp ) {
    uniqueStr = uniqueStr c
    }

    return uniqueStr

    }

    I don't expect that to be any faster or anything, it's just different,
    but if you have GNU awk then it can be tweaked to:

    function uniqueChars(str, x, y, c, tmp, uniqueStr) {

    y = length(str)
    uniqueStr = ""
    for ( x=1; x<=y; x++ ) {
    tmp[substr(str,x,1)]
    }
    PROCINFO["sorted_in"] = "@ind_str_asc"
    for ( c in tmp ) {
    uniqueStr = uniqueStr c
    }

    return uniqueStr

    }

    and then it'll return the unique characters sorted in alphabetic order
    which may be useful.

    Regards,

    Ed.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From porkchop@porkchop@invalid.foo (Mike Sanders) to comp.lang.awk on Mon Nov 6 03:18:19 2023
    From Newsgroup: comp.lang.awk

    Ed Morton <mortonspam@gmail.com> wrote:

    You don't need to do that `delete` - just having "tmp" listed in the
    args list will re-init it every time the function is called. Removing
    that statement will also make your script portable to awks than don't support `delete array` (but most, possibly all, modern awks do support
    that even though it's technically still undefined behavior).

    You know I wondered about that, thought I'd play it safe, but yeah,
    noted: array always created anew, good to know.

    while(++x <= y) {
    Using a `while` instead of `for` loop for that makes your code a bit
    less clear, a bit more fragile (what if `x` gets set above?), and a bit harder to maintain (what if in future you need to increment x by 2 every iteration?).

    Aye.

    It's not worth saving the few characters over the
    traditional `for ( x=1; x<=y; x++ )`

    c = substr(str, x, 1)
    if (!(c in tmp)) {
    Idiomatically that'd be implemented as

    if ( !tmp[c]++ ) {

    and then you'd remove the `tmp[c]` below but the array in that case is almost always named `seen[]` rather than `tmp[]`.

    uniqueStr = uniqueStr c
    tmp[c]
    }
    }

    return uniqueStr

    }

    Alternatively, if the order of the characters returned doesn't matter,
    you could do:

    function uniqueChars(str, x, y, c, tmp, uniqueStr) {

    y = length(str)
    uniqueStr = ""
    for ( x=1; x<=y; x++ ) {
    tmp[substr(str,x,1)]
    }
    for ( c in tmp ) {
    uniqueStr = uniqueStr c
    }

    return uniqueStr

    }

    I don't expect that to be any faster or anything, it's just different,
    but if you have GNU awk then it can be tweaked to:

    function uniqueChars(str, x, y, c, tmp, uniqueStr) {

    y = length(str)
    uniqueStr = ""
    for ( x=1; x<=y; x++ ) {
    tmp[substr(str,x,1)]
    }
    PROCINFO["sorted_in"] = "@ind_str_asc"
    for ( c in tmp ) {
    uniqueStr = uniqueStr c
    }

    return uniqueStr

    }

    and then it'll return the unique characters sorted in alphabetic order
    which may be useful.

    Must add these examples to my notes.
    --
    :wq
    Mike Sanders

    --- Synchronet 3.20a-Linux NewsLink 1.114