Greetings folks, just sharing the knowledge (& test posting using google groups).
Beware wordwrap...
function djb2(str, hash, i, char) {
# initialize hash to an arbitrary prime number
hash = 5381
n = length(str)
# iterate over characters and compute hash
for(i = 1; i <= n; i++) {
char = substr(str, i, 1)
# modulo simulates 32-bit unsigned integer
hash = (hash * 33 + ord(char)) % 2147483647
}
return hash
}
function ord(char) {
return (char ~ /[[:digit:]]/ ? int(char) : int(ord[char]))
}
BEGIN {
for(i=32;i<=126;i++) ord[sprintf("%c", i)] = i
print djb2("my key is...")
}
On 22.09.2023 21:50, Michael Sanders wrote:
[...]
# modulo simulates 32-bit unsigned integer
The previous comment suggests that the subsequent code should use
the constant 2147483648 if you want a 32-bit unsigned integer. [...]
Greetings folks, just sharing the knowledge (& test posting using google groups).--
Beware wordwrap...
function djb2(str, hash, i, char) {
# initialize hash to an arbitrary prime number
hash = 5381
n = length(str)
# iterate over characters and compute hash
for(i = 1; i <= n; i++) {
char = substr(str, i, 1)
# modulo simulates 32-bit unsigned integer
hash = (hash * 33 + ord(char)) % 2147483647
}
return hash
}
function ord(char) {
return (char ~ /[[:digit:]]/ ? int(char) : int(ord[char]))
}
BEGIN {
for(i=32;i<=126;i++) ord[sprintf("%c", i)] = i
print djb2("my key is...")
}
notes:
https://stackoverflow.com/questions/1579721/why-are-5381-and-33-so-important-in-the-djb2-algorithm
https://stackoverflow.com/questions/10696223/reason-for-the-number-5381-in-the-djb-hash-function
https://groups.google.com/g/comp.lang.c/c/lSKWXiuNOAk
https://en.wikipedia.org/wiki/Daniel_J._Bernstein
Variable 'n' not declared as local (just for good measure, and inAye, I spotted this *after* I posted (of course...). Updated
case you want to use the code in other code contexts).
Is it guaranteed that the expression evaluates correctly in allHuh? From my initial post in this thread:
awks? The intermediate results can become as large as 70866960411 (0x108000001B, which is larger than 32 bit), which needs to be
representable in awk (not sure what POSIX awk defines/requires
here).
Erm..., _unsigned_ 32 bit integer would of course mean 4294967296That's the max, but certainly not the only possibility...
[...]And yet, this rational assumes an extension to the language,
(I would also try to avoid name clashes like ord() and ord[],
BTW. Here it's simple, because the function is superfluous;
just replace the ord(char) at the calling side by ord[char]
and remove the function completely.)
What will the code produce with, say, "extended" ASCII, or theNo issues with TAB (see my function ord()), & non 7bit character
common ISO 8859-x family of character sets? Is there any reason
to restrict it here?
What are the criteria for the chosen character subset? How will
or how should a TAB character in the data change the result?
As I said, just a couple of more or less obvious questions.Thanks Janis, as always, appreciate your insights. Hoping to get
There is something wrong:--- Synchronet 3.20a-Linux NewsLink 1.114
$ LC_ALL=C gawk 'function ord(char) {
return (char ~ /[[:digit:]]/ ? int(char) : int(ord[char]))
}
BEGIN {print ord("x")}'
gawk: cmd. line:3: error: function `ord' called with space between name
and `(',
or used as a variable or an array
Please note ord[char] vs. ord(char)
HTH
(I would also try to avoid name clashes like ord() and ord[],One more quick question... does gawk have an ord() builtin
BTW. Here it's simple, because the function is superfluous;
just replace the ord(char) at the calling side by ord[char]
and remove the function completely.)
On Saturday, September 23, 2023 at 3:45:10 AM UTC-5, Janis Papanagnou wrote: [...]
Is it guaranteed that the expression evaluates correctly in all
awks? The intermediate results can become as large as 70866960411
(0x108000001B, which is larger than 32 bit), which needs to be
representable in awk (not sure what POSIX awk defines/requires
here).
Huh? From my initial post in this thread:
# modulo simulates 32-bit unsigned integer
hash = (hash * 33 + ord(char)) % 2147483647
Modulo wraps at 2147483647, it'll never outgrow
it limits. Maybe I'm not grokking what you meant.
[...]
(I would also try to avoid name clashes like ord() and ord[],
BTW. Here it's simple, because the function is superfluous;
just replace the ord(char) at the calling side by ord[char]
and remove the function completely.)
And yet, this rational assumes an extension to the language,
which implies *only* gawk though right? [...]
What will the code produce with, say, "extended" ASCII, or the
common ISO 8859-x family of character sets? Is there any reason
to restrict it here?
What are the criteria for the chosen character subset? How will
or how should a TAB character in the data change the result?
No issues with TAB (see my function ord()), & non 7bit character
sets? Dunno... only need lower ASCII here, hence the hard-coded
limit:
min: 32, max: 126
If so compelled, test other sets & post your findings here.
Would be interested to read what you come up with.
As I said, just a couple of more or less obvious questions.
Thanks Janis, as always, appreciate your insights. Hoping to get
back to using tin as my news reader soon, the google groups
interface its perhaps not my cup of tea. :/
On Saturday, September 23, 2023 at 3:45:10 AM UTC-5, Janis Papanagnou wrote:
(I would also try to avoid name clashes like ord() and ord[],
BTW. Here it's simple, because the function is superfluous;
just replace the ord(char) at the calling side by ord[char]
and remove the function completely.)
One more quick question... does gawk have an ord() builtin
function or not?
Greetings folks, just sharing the knowledge (& test posting using
google groups).
Beware wordwrap...
function djb2(str, hash, i, char) {
# initialize hash to an arbitrary prime number
hash = 5381
n = length(str)
# iterate over characters and compute hash
for(i = 1; i <= n; i++) {
char = substr(str, i, 1)
# modulo simulates 32-bit unsigned integer
hash = (hash * 33 + ord(char)) % 2147483647
}
return hash
}
function ord(char) {
return (char ~ /[[:digit:]]/ ? int(char) : int(ord[char]))
}--
BEGIN {
for(i=32;i<=126;i++) ord[sprintf("%c", i)] = i
print djb2("my key is...")
}
This modulus does nor do what you claim you want to do (as has beenNo sir, it does, in fact.
pointed out).
What version of AWK, run with what flags, accepts this? I can't get ithttps://gnuwin32.sourceforge.net/packages/mawk.htm https://frippery.org/busybox/
past gawk, nawk, mawk nor original-awk.
Greetings folks, just sharing the knowledge (& test posting using google groups).Note to self: built a version for gawk & tested online here: https://ideone.com/ZZVQl2
On Saturday, September 23, 2023 at 4:29:40 PM UTC-5, Ben Bacarisse wrote:
This modulus does nor do what you claim you want to do (as has been
pointed out).
No sir, it does, in fact.
# modulo simulates 32-bit unsigned integer
On 23.09.2023 14:25, Michael Sanders wrote:
On Saturday, September 23, 2023 at 3:45:10 AM UTC-5, Janis Papanagnou wrote:
One more quick question... does gawk have an ord() builtin
function or not?
Not that I know of. - I seem to recall that when I once needed that
I implemented an array (as you did). - There's a chance that GNU Awk
has something in some extension libraries (but I'm too lazy to check).
Your sample (the fact I pointed out in an earlier post and Ben repeatedThat's not correct & there is is no error using 2147483647, it works just
here) uses x % 2147483647 which doesn't match a "32-bit simulation".
(Mind the differing last digit of modulo and bitwise-and expressions!)See the dbj2gawk() version I posted elsewhere in this thread. But of
Since you don't have such large registers...Exactly, hence the signed int masquerading as an unsigned int.
in case that the prime number _2147483647_ was *intended*...Yes, intended.
Yes, ord() and chr() are included in the gawk distribution as a dynamic extension.--- Synchronet 3.20a-Linux NewsLink 1.114
gawk -l ordchr 'BEGIN{print ord("x")}'
120
HTH
On Saturday, September 23, 2023 at 11:39:04 PM UTC-5, Janis Papanagnou wrote:
Your sample (the fact I pointed out in an earlier post and Ben repeated
here) uses x % 2147483647 which doesn't match a "32-bit simulation".
That's not correct & there is is no error using 2147483647, it works just
as I intended.
But elsewhere... I do think my choice of naming ord()
& ord[] were poor, as these may be intrinsic to some versions of
a given awk's namespace.
(Mind the differing last digit of modulo and bitwise-and expressions!)
See the dbj2gawk() version I posted elsewhere in this thread. But of
course the older awks lack bitwise...
To add something new with this post I want to point out that a modulus
with primes (like 2147483647, a Mersenne prime) is regularly done e.g.
in random number generators or in cryptography. In these areas you may
have large numbers x and a comparable small prime modulus. Since you
don't have such large registers for the x you can iterate over parts
of the number sequentially (using a loop) to perform the operation. [...]
function mod(n, s, l, z, m) {
while (length(s) > 4) {
z = substr(s, 1, 4)
m = z % n
s = m substr(s, 5)
}
return s % n
}
function mod97(s) {
return mod(97, s)
}
checksum = 98 - mod97(iban)72,73c84
ok = mod97(iban) == 1
On Saturday, September 23, 2023 at 4:29:40 PM UTC-5, Ben Bacarisse wrote:
This modulus does nor do what you claim you want to do (as has been
pointed out).
No sir, it does, in fact.
# modulo simulates 32-bit unsigned integer
https://en.wikipedia.org/wiki/2,147,483,647
https://www.bing.com/search?FORM=U523DF&PC=U523&q=is+this+2147483647+a+32-bit+number
https://www.google.com/search?q=is+this+2147483647+a+32-bit+number
What version of AWK, run with what flags, accepts this? I can't get it
past gawk, nawk, mawk nor original-awk.
https://gnuwin32.sourceforge.net/packages/mawk.htm
https://frippery.org/busybox/
On Sunday, September 24, 2023 at 3:27:47 AM UTC-5, Manuel Collado wrote:
Yes it does help. Earnest thanks Manuel!
[...]
On 24.09.2023 23:17, Michael Sanders wrote:
[...]
You are right, we all wasted our time.
In article <ueqe6u$1i163$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
On 24.09.2023 23:17, Michael Sanders wrote:
[...]
You are right, we all wasted our time.
And you've got no one to blame but yourself.
You & Ben.
It was pretty clear OP wasn't interested in the usual "human compiler" treatment, but you guys went ahead with it anyway.
On Sunday, September 24, 2023 at 2:39:49 PM UTC-5, Ben Bacarisse wrote:
[...]
Ben, listen the number wraps (or folds as stated elsewhere)
to zero at its max using modulo. Thats obvious. I, you, Janis
(& likely others) see that.
Now the thing is, the comment I placed in the script was the
word 'simulates', if you want to fritter away your time debating
that, have at it. Me? Too busy to deal with the snarky attitude.
The scripts I posted, both dbj2() & dbj2gawk() work fine & correctly
on my end & it seems rich to me that you cant get them to run but
otherwise know that I'm wrong.
Hmmm... None of this is the point,
we simply want a large number
(32bit - 1 as is the case) to lesson
the chance of collisions in the hash & that's it...
I cant believe
you & Janis cant see that for what it is. Good grief, I should've
known...
For example, the DJB hash of the string "abcdef" is 4048079738, but yourReader's from the future, re-read the 1st paragraph...
code gives 1900599327. Your hash can go no higher than 2**31 - 2, but
the original one uses the full range of 32-bit unsigned int.
[...]
This is why halving the range by using only 31 bits is not a good idea.
It's not a /terrible/ idea because 31 bits is still a huge range, but
since 32-bit wrapping is free on most hardware, there is absolutely no
upside to using modulo 2147483647 arithmetic for a hash.
And your saying "32bit - 1 as is the case" means you still don't know
what the code you wrote is really doing.
Greetings folks, just sharing the knowledge (& test posting using google groups).For the sake of completeness, a key generator (you supply a 32bit key):
On Sunday, September 24, 2023 at 8:27:50 PM UTC-5, Ben Bacarisse wrote:
For example, the DJB hash of the string "abcdef" is 4048079738, but your
code gives 1900599327. Your hash can go no higher than 2**31 - 2, but
the original one uses the full range of 32-bit unsigned int.
[...]
This is why halving the range by using only 31 bits is not a good idea.
It's not a /terrible/ idea because 31 bits is still a huge range, but
since 32-bit wrapping is free on most hardware, there is absolutely no
upside to using modulo 2147483647 arithmetic for a hash.
And your saying "32bit - 1 as is the case" means you still don't know
what the code you wrote is really doing.
Reader's from the future, re-read the 1st paragraph...
Here's where good ol' Ben has (unwittingly, chuckle) proven, & in his own words mind you, the importance of using of using a unique key number.
What's a key number?It doesn't matter Ben. I do appreciate your
On Monday, September 25, 2023 at 5:50:45 PM UTC-5, Ben Bacarisse wrote:
What's a key number?
It doesn't matter Ben.
OK. I thought you might want to be understood.
function djb2(str, hash, x, y, ascii) {[...]
hash = 5381
y = length(str)
for(x = 1; x <= y; x++) {
ascii = substr(str, x, 1)
hash = (hash * 33 + alphanum(ascii)) % ((2^32) - 1) # 32-bit key or
# hash = (hash * 33 + alphanum(ascii)) % 2147483647 # generated key
}
return hash
}
function alphanum(tmp) {
return index("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", tmp)
}
Why do you call this "djb2"?
Here's a description of the djb2 hash function from <https://www.eecs.yorku.ca/~oz/hash.html>:
this algorithm (k=33) was first reported by dan bernstein many years
ago in comp.lang.c. another version of this algorithm (now favored
by bernstein) uses xor: hash(i) = hash(i - 1) * 33 ^ str[i]; the
magic of number 33 (why it works better than many other constants,
prime or not) has never been adequately explained.
unsigned long
hash(unsigned char *str)
{
unsigned long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}
I see some similarities between the original djb and your hash function,
and yours was undoubtedly inspired by djb2, but they're very much not the same thing. Would you agree that calling it "djb2" is misleading?
I find it odd that your function treats all non-alphanumeric characters
as the same. Is there a reason for that?
Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
Why do you call this "djb2"?
Well, because that is what its called.
Here's a description of the djb2 hash function from
<https://www.eecs.yorku.ca/~oz/hash.html>:
Yes, one of the links in the notes section has a reference to that url.
Did you not read the notes?
[...]I see some similarities between the original djb and your hash function,
and yours was undoubtedly inspired by djb2, but they're very much not the
same thing. Would you agree that calling it "djb2" is misleading?
Nope, wont change it either. =)
You mean that's what you decided to call it. I'm asking why.
What you've posted is not djb2. You shouldn't claim that it is. You've
made it clear that you intend to continue to do so.
Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
You mean that's what you decided to call it. I'm asking why.
Well, I don't think I have an anwser that will satisfy you.
What you've posted is not djb2. You shouldn't claim that it is. You've
made it clear that you intend to continue to do so.
What do you feel I ought to do to clarify the problem
with this cavet:
It wont be taken down or renamed.
Perhaps add clarification to the document how my take only deals with alphanumeric chars? Let's work it out.
Work it out yourself.
Will do.
porkchop@invalid.foo (Mike Sanders) writes:
[...]
Why do you call this "djb2"?
Here's a description of the djb2 hash function from <https://www.eecs.yorku.ca/~oz/hash.html>:
[...]
I see some similarities between the original djb and your hash function,
and yours was undoubtedly inspired by djb2, but they're very much not the same thing. Would you agree that calling it "djb2" is misleading?
I find it odd that your function treats all non-alphanumeric characters
as the same. Is there a reason for that?
Indeed. I as well was misleaded by the impression I've got that he
was trying to implement (just in a wrong way) an existing algorithm.
I noticed that (my) mistake not before I saw two versions of his code
in one post, each using different character set encodings - which of
course would lead to different results even if we compare only the
OP's own implementations' results. This arbitrariness in conjunction
with the inherent flaws the code versions had (and may still have;
I don't bother any more) makes clear that we obviously have a "Not
Invented Here" case (https://en.wikipedia.org/wiki/Not_invented_here),
which most likely makes all in depth discussions void anyway. Glad
it became obvious after all. (Bad that I noticed that too late. And I
see that the OP just recently followed your suggestion to rename his
hash function.)
It's always good to get information about the _intention_ of a piece
of posted code early and with the original post; that way bandworm
threads and misunderstandings could be avoided.
Let's come back to the actual application and application domain ...
Inferred from the OP's posting where he wrote: dbj2("policy_number"))
and "simple a way to obfuscate small chucks of patient records." ...
In statistical medicine context there's demand to anonymize patient
data. For statistical purposes and cause-effect insights it's usually necessary to be able to assign data to an "individual entity" without disclosing his identity. But a hash code does not support that; it
can lead to "ID"-clashes that assigns different results to one same
subject. You need to create (real) keys. - Remember Ben's confusion
about (the OP's misnomer) "key number"! The OP may seem to require
keys but he's creating a hash value.
Of course if a hash value is sufficient - only the OP knows this! -
then I'd have used some existing and proven algorithm instead of
inventing my own (with possible flaws or non-obvious bad properties).
(But this as well has the OP to decide and is not an Awk question
anyway.)
</OT>
Janis
Of course if a hash value is sufficient - only the OP knows this!key = 32bitmax - random_lesser
Sysop: | DaiTengu |
---|---|
Location: | Appleton, WI |
Users: | 920 |
Nodes: | 10 (0 / 10) |
Uptime: | 108:43:51 |
Calls: | 12,190 |
Calls today: | 2 |
Files: | 186,527 |
Messages: | 2,237,607 |