run as...
awk -f uniqueChars.awk
output...
Input string: Mary had a little lamb who's fleece was white as snow...
Unique chars: Mary hdlitembwo'sfcn.
script...
BEGIN {
a = "Mary had a little lamb who's fleece was white as snow..."
b = uniqueChars(a)
print "Input string: " a
print "Unique chars: " b
}
function uniqueChars(str, x, y, c, tmp, uniqueStr) {
y = length(str)
uniqueStr = ""
delete tmp # clear array for each new string
while(++x <= y) {
c = substr(str, x, 1)
if (!(c in tmp)) {
uniqueStr = uniqueStr c
tmp[c]
}
}
return uniqueStr
}
If you want to avoid the substr() function calls in the loop and don't
mind using non-standard (GNU Awk) features you can also use split():
function uniqueChars (t, s, n, i, c, o, seen)
{
delete seen
n = split (t, s, "")
for (i=1; i<=n; i++)
if (!seen[c = s[i]]++)
o = o c
return o
}
{
printf "In:\t%s\nOut:\t%s\n", $0, uniqueChars($0)
}
(Just for a variant.)
Janis
You can use `??ndex` to avoid adding the same char multiple times to the result string. That may or may not be faster than using an additional
array, I've not benchmarked it (yet?). You may call me Vroomfondel
today.
Reading beyond this line equals signing an implicit "Non Laughing
Agrement". ;-)
Only minimally tested:
function uniqueChars(str,_c,_i,_seen) {
_seen=""
for(_i=1;_i<=length(str);_i++) {
if(0<index(str,_c=substr(str,_i,1)))
if(index(_seen,_c)<1)
_seen=_seen _c
}
return _seen
}
Maybe the assumption that array ops are more expensive than string ops
is wrong and this idea is worse than the original.
Next stop: Recursion!
(Just joking!)
function uniqueChars(str, x, y, c, tmp, uniqueStr) {
y = length(str)
uniqueStr = ""
delete tmp # clear array for each new string
while(++x <= y) {
c = substr(str, x, 1)
if (!(c in tmp)) {
uniqueStr = uniqueStr c
tmp[c]
}
}
return uniqueStr
}
If you want to avoid the substr() function calls in the loop and don't
mind using non-standard (GNU Awk) features you can also use split():
function uniqueChars (t, s, n, i, c, o, seen)
{
delete seen
n = split (t, s, "")
for (i=1; i<=n; i++)
if (!seen[c = s[i]]++)
o = o c
return o
}
{
printf "In:\t%s\nOut:\t%s\n", $0, uniqueChars($0)
}
(Just for a variant.)
Janis
On 01.10.2023 11:38, Mike Sanders wrote:
run as...
awk -f uniqueChars.awk
output...
Input string: Mary had a little lamb who's fleece was white as snow... Unique chars: Mary hdlitembwo'sfcn.
script...
BEGIN {
a = "Mary had a little lamb who's fleece was white as snow..."
b = uniqueChars(a)
print "Input string: " a
print "Unique chars: " b
}
function uniqueChars(str, x, y, c, tmp, uniqueStr) {
y = length(str)
uniqueStr = ""
delete tmp # clear array for each new string
while(++x <= y) {
c = substr(str, x, 1)
if (!(c in tmp)) {
uniqueStr = uniqueStr c
tmp[c]
}
}
return uniqueStr
}
The Unique Chars Only discussion of string Op variants -- substr()
split() index() -- reminded me of an interesting note in the most
excellent Awk book, 2nd Edition. They benchmark split() vs substr()
for single character operations and substr() is 40% faster than
split(). I always assumed that split() was faster.
run as...You don't need to do that `delete` - just having "tmp" listed in the
awk -f uniqueChars.awk
output...
Input string: Mary had a little lamb who's fleece was white as snow... Unique chars: Mary hdlitembwo'sfcn.
script...
BEGIN {
a = "Mary had a little lamb who's fleece was white as snow..."
b = uniqueChars(a)
print "Input string: " a
print "Unique chars: " b
}
function uniqueChars(str, x, y, c, tmp, uniqueStr) {
y = length(str)
uniqueStr = ""
delete tmp # clear array for each new string
while(++x <= y) {Using a `while` instead of `for` loop for that makes your code a bit
c = substr(str, x, 1)Idiomatically that'd be implemented as
if (!(c in tmp)) {
uniqueStr = uniqueStr c
tmp[c]
}
}
return uniqueStr
}
You don't need to do that `delete` - just having "tmp" listed in the
args list will re-init it every time the function is called. Removing
that statement will also make your script portable to awks than don't support `delete array` (but most, possibly all, modern awks do support
that even though it's technically still undefined behavior).
while(++x <= y) {Using a `while` instead of `for` loop for that makes your code a bit
less clear, a bit more fragile (what if `x` gets set above?), and a bit harder to maintain (what if in future you need to increment x by 2 every iteration?).
It's not worth saving the few characters over the
traditional `for ( x=1; x<=y; x++ )`
c = substr(str, x, 1)Idiomatically that'd be implemented as
if (!(c in tmp)) {
if ( !tmp[c]++ ) {
and then you'd remove the `tmp[c]` below but the array in that case is almost always named `seen[]` rather than `tmp[]`.
uniqueStr = uniqueStr cAlternatively, if the order of the characters returned doesn't matter,
tmp[c]
}
}
return uniqueStr
}
you could do:
function uniqueChars(str, x, y, c, tmp, uniqueStr) {
y = length(str)
uniqueStr = ""
for ( x=1; x<=y; x++ ) {
tmp[substr(str,x,1)]
}
for ( c in tmp ) {
uniqueStr = uniqueStr c
}
return uniqueStr
}
I don't expect that to be any faster or anything, it's just different,
but if you have GNU awk then it can be tweaked to:
function uniqueChars(str, x, y, c, tmp, uniqueStr) {
y = length(str)
uniqueStr = ""
for ( x=1; x<=y; x++ ) {
tmp[substr(str,x,1)]
}
PROCINFO["sorted_in"] = "@ind_str_asc"
for ( c in tmp ) {
uniqueStr = uniqueStr c
}
return uniqueStr
}
and then it'll return the unique characters sorted in alphabetic order
which may be useful.
Sysop: | DaiTengu |
---|---|
Location: | Appleton, WI |
Users: | 920 |
Nodes: | 10 (1 / 9) |
Uptime: | 89:45:01 |
Calls: | 12,188 |
Files: | 186,527 |
Messages: | 2,237,199 |