* Re: find words that contains some specific letters
[not found] ` <78rpbkF1mqrc7U1@mid.individual.net>
@ 2009-06-06 16:13 ` John B. Matthews
2009-06-06 16:22 ` Patricia Shanahan
` (2 more replies)
0 siblings, 3 replies; 8+ messages in thread
From: John B. Matthews @ 2009-06-06 16:13 UTC (permalink / raw)
In article <78rpbkF1mqrc7U1@mid.individual.net>,
"Giovanni Azua" <bravegag@hotmail.com> wrote:
> "Patricia Shanahan" <pats@acm.org> wrote in message
> > A perfect hash function is one that guarantees distinct hash codes
> > for a specific set of keys. There are algorithms for constructing
> > one, given a set of keys.
>
> For example the Character class.
I think all of the primitive wrappers (except Double) have trivially
perfect integer hashes, as they wrap distinct values. Patricia was
perhaps referring to algorithms of the kind discussed here:
<http://en.wikipedia.org/wiki/Perfect_hash_function>
The jumble algorithm relies on a hashed map for efficiency, but a
perfect hash is not essential. I recently implemented the algorithm
in both Java and Ada using library routines:
<http://en.wikipedia.org/wiki/Jumble>
<http://sites.google.com/site/drjohnbmatthews/jumble>
<http://home.roadrunner.com/~jbmatthews/jumble.html>
I was intrigued to compare the two languages' library implementations
of a general purpose string hash. Using a multiplicative function with
an initial value of zero, both offer excellent performance. Java uses a
multiplier of 31 on a signed 32-bit integer; Ada uses a multiplier of
eight on an unsigned (modular) value whose size is implementation
defined (e.g. mod 2**32):
<http://www.docjar.org/html/api/java/lang/String.java.html>
<http://gcc.gnu.org/viewcvs/trunk/gcc/ada/a-strhas.adb>
[Cross-posted to comp.lang.java.programmer and comp.lang.ada;
Followup-To header not set.]
--
John B. Matthews
trashgod at gmail dot com
<http://sites.google.com/site/drjohnbmatthews>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: find words that contains some specific letters
2009-06-06 16:13 ` find words that contains some specific letters John B. Matthews
@ 2009-06-06 16:22 ` Patricia Shanahan
2009-06-06 20:42 ` Giovanni Azua
2009-06-06 17:11 ` Mark Space
2009-06-06 18:23 ` Lew
2 siblings, 1 reply; 8+ messages in thread
From: Patricia Shanahan @ 2009-06-06 16:22 UTC (permalink / raw)
John B. Matthews wrote:
> In article <78rpbkF1mqrc7U1@mid.individual.net>,
> "Giovanni Azua" <bravegag@hotmail.com> wrote:
>
>> "Patricia Shanahan" <pats@acm.org> wrote in message
>>> A perfect hash function is one that guarantees distinct hash codes
>>> for a specific set of keys. There are algorithms for constructing
>>> one, given a set of keys.
>> For example the Character class.
>
> I think all of the primitive wrappers (except Double) have trivially
> perfect integer hashes, as they wrap distinct values. Patricia was
> perhaps referring to algorithms of the kind discussed here:
>
> <http://en.wikipedia.org/wiki/Perfect_hash_function>
Yes, that's the sort of thing I meant.
The Long hash code is also not perfect, and cannot be, because there are
more longs than ints.
Patricia
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: find words that contains some specific letters
2009-06-06 16:13 ` find words that contains some specific letters John B. Matthews
2009-06-06 16:22 ` Patricia Shanahan
@ 2009-06-06 17:11 ` Mark Space
2009-06-06 18:23 ` Lew
2 siblings, 0 replies; 8+ messages in thread
From: Mark Space @ 2009-06-06 17:11 UTC (permalink / raw)
John B. Matthews wrote:
>> For example the Character class.
>
> I think all of the primitive wrappers (except Double) have trivially
> perfect integer hashes, as they wrap distinct values. Patricia was
> perhaps referring to algorithms of the kind discussed here:
>
> <http://en.wikipedia.org/wiki/Perfect_hash_function>
I remember in school we implemented a minimal perfect hash function on a
given set of keys: the keywords in the Pascal language. We were given
the parameters of the algorithm: use the length of the keyword, the
first character and the last character only.
The concept was obviously to show how a fast look-up could be done when
implementing a compiler, but the hash function itself was interesting at
the time.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: find words that contains some specific letters
2009-06-06 16:13 ` find words that contains some specific letters John B. Matthews
2009-06-06 16:22 ` Patricia Shanahan
2009-06-06 17:11 ` Mark Space
@ 2009-06-06 18:23 ` Lew
2009-06-06 23:58 ` Tom Anderson
2 siblings, 1 reply; 8+ messages in thread
From: Lew @ 2009-06-06 18:23 UTC (permalink / raw)
John B. Matthews wrote:
> The jumble algorithm relies on a hashed map for efficiency, but a
Because it has constant-time performance for get() with the String hash code.
> perfect hash is not essential. I recently implemented the algorithm
> in both Java and Ada using library routines:
>
> <http://en.wikipedia.org/wiki/Jumble>
> <http://sites.google.com/site/drjohnbmatthews/jumble>
> <http://home.roadrunner.com/~jbmatthews/jumble.html>
>
> I was intrigued to compare the two languages' library implementations
> of a general purpose string hash. Using a multiplicative function with
> an initial value of zero, both offer excellent performance. Java uses a
> multiplier of 31 on a signed 32-bit integer; Ada uses a multiplier of
> eight on an unsigned (modular) value whose size is implementation
> defined (e.g. mod 2**32):
>
> <http://www.docjar.org/html/api/java/lang/String.java.html>
> <http://gcc.gnu.org/viewcvs/trunk/gcc/ada/a-strhas.adb>
The Java String hash is just about verbatim from Knuth. It has the virtue of
being a well-studied algorithm that has been shown to have good distribution
characteristics.
I'm not so sure about the Ada one. I haven't studied this in a while, but I
seem to recall that power-of-two multipliers mod 2**32 don't have as good a
distribution. Perhaps someone could correct me on that.
--
Lew
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: find words that contains some specific letters
2009-06-06 16:22 ` Patricia Shanahan
@ 2009-06-06 20:42 ` Giovanni Azua
2009-06-06 23:43 ` Giovanni Azua
0 siblings, 1 reply; 8+ messages in thread
From: Giovanni Azua @ 2009-06-06 20:42 UTC (permalink / raw)
"Patricia Shanahan" <pats@acm.org> wrote:
>>>> A perfect hash function is one that guarantees distinct hash codes for
>>>> a specific set of keys. There are algorithms for constructing one,
>>>> given a set of keys.
"Giovanni Azua" <bravegag@hotmail.com> wrote:
>>> For example the Character class.
"John B. Matthews" <nospam@nospam.invalid> wrote:
>> I think all of the primitive wrappers (except Double) have trivially
>> perfect integer hashes, as they wrap distinct values. Patricia was
>> perhaps referring to algorithms of the kind discussed here:
>>
>> <http://en.wikipedia.org/wiki/Perfect_hash_function>
"Patricia Shanahan" <pats@acm.org> wrote:
> Yes, that's the sort of thing I meant.
>
> The Long hash code is also not perfect, and cannot be, because there are
> more longs than ints.
>
To achieve the perfect, optimal and ideal no collisions Hash table "perfect
Hash table" these are the two factors that play a role:
1-. the "perfect" property of the hash function as defined in the wiki page
cited above
2-. the size of the elements set
If #2 goes arbitrarily large meaning allocating a hash table of the size of
that elements set becomes prohibitive memory-wise then it does not really
matter how perfect the hash function is, there will be collisions anyway and
an additional search algorithm will be needed to disambiguate the colliding
elements that fall under the same buckets. This is what I think John
Matthews hasn't realized yet about his implementation of the Jumble
algorithm. For a realistic dataset he will be returning elements that do not
match the of input (String built on top of the sorted character input) but
matches that happen to collide under the same bucket.
e.g. trying to allocate a Hash table of the size ( Integer.MAX_VALUE -
Integer.MIN_VALUE) would be prohibitive unless you place unrealistic
requirements so you have the Integer with a perfect hash function but you
will not get a "perfect Hash table". It is of course worse for the String
class where neither #1 nor #2 are fulfilled.
Character class offers the perfect hash function but since the elements set
is discrete i.e. 26 elements http://en.wikipedia.org/wiki/English_alphabet
you can build a perfect Hash structure like the one I described and edited
by Tom McGlynn.
Best regards,
Giovanni
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: find words that contains some specific letters
2009-06-06 20:42 ` Giovanni Azua
@ 2009-06-06 23:43 ` Giovanni Azua
0 siblings, 0 replies; 8+ messages in thread
From: Giovanni Azua @ 2009-06-06 23:43 UTC (permalink / raw)
"Giovanni Azua" <bravegag@hotmail.com> wrote in message
> elements that fall under the same buckets. This is what I think John
> Matthews hasn't realized yet about his implementation of the Jumble
> algorithm. For a realistic dataset he will be returning elements that do
> not match the of input (String built on top of the sorted character input)
> but matches that happen to collide under the same bucket.
>
I am sorry, this is wrong. Matthews's Jumble implementation is correct and
does not have the problem I described here.
Best regards,
Giovanni
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: find words that contains some specific letters
2009-06-06 18:23 ` Lew
@ 2009-06-06 23:58 ` Tom Anderson
2009-06-07 19:30 ` John B. Matthews
0 siblings, 1 reply; 8+ messages in thread
From: Tom Anderson @ 2009-06-06 23:58 UTC (permalink / raw)
On Sat, 6 Jun 2009, Lew wrote:
> John B. Matthews wrote:
>> The jumble algorithm relies on a hashed map for efficiency, but a
>> perfect hash is not essential. I recently implemented the algorithm
>> in both Java and Ada using library routines:
>>
>> <http://en.wikipedia.org/wiki/Jumble>
>> <http://sites.google.com/site/drjohnbmatthews/jumble>
>> <http://home.roadrunner.com/~jbmatthews/jumble.html>
>>
>> I was intrigued to compare the two languages' library implementations of a
>> general purpose string hash. Using a multiplicative function with an
>> initial value of zero, both offer excellent performance. Java uses a
>> multiplier of 31 on a signed 32-bit integer; Ada uses a multiplier of eight
>> on an unsigned (modular) value whose size is implementation defined (e.g.
>> mod 2**32):
>>
>> <http://www.docjar.org/html/api/java/lang/String.java.html>
>> <http://gcc.gnu.org/viewcvs/trunk/gcc/ada/a-strhas.adb>
John, i think you've missed a trick there - doesn't Rotate_Left do a
barrel shift, where the bits that fall off the left-hand end get shoved
back in at the right, rather than a plain shift? If so, the basic
algebraic structure is actually a bit different to java's hash.
> The Java String hash is just about verbatim from Knuth. It has the virtue of
> being a well-studied algorithm that has been shown to have good distribution
> characteristics.
>
> I'm not so sure about the Ada one. I haven't studied this in a while,
> but I seem to recall that power-of-two multipliers mod 2**32 don't have
> as good a distribution. Perhaps someone could correct me on that.
If Rotate_Left is a plain shift, then the Ada hash is based only on the
last 10 characters of the string. Multiplying by eight is like shifting
left three places; modding by 2*32 is like anding with 0xffffffff (if
you'll pardon my verbs). When you add in a character, the rightmost bit
you can affect is bit 0, the LSB; after ten shifts of three bits, that is
now at bit thirty, and after one more, it's at bit thirty-three, and has
vanished.
If Rotate_Left is a barrel shift, then our bit of interest doesn't vanish,
because it comes back to the right-hand end of the hash after the eleventh
shift in position 1. Because 3 is coprime with 32, it will then proceed to
make a complete circuit of all positions in the hash over the following
twenty-one shifts, winding up back in position 0 after the thirty-second.
I have a vague worry that this means that strings built from the same
32-character blocks arranged in any order will have the same hash, but i
suspect that this is not the case, because of carries. My maths isn't up
to proving it, though.
I still think this is a shoddy hash. It has no diffusion: if you hash two
strings which differ at one character position, the difference between the
hashes will always be confined to the sixteen bits which correspond to
where that character ended up in the rotating hash. Java's
multiply-and-mod approach quickly spreads the bits of each character out
over the whole hash.
tom
--
Science of a sufficiently advanced form is indistinguishable from magic
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: find words that contains some specific letters
2009-06-06 23:58 ` Tom Anderson
@ 2009-06-07 19:30 ` John B. Matthews
0 siblings, 0 replies; 8+ messages in thread
From: John B. Matthews @ 2009-06-07 19:30 UTC (permalink / raw)
In article <alpine.DEB.1.10.0906070035110.5405@urchin.earth.li>,
Tom Anderson <twic@urchin.earth.li> wrote:
> > John B. Matthews wrote:
> >> The jumble algorithm relies on a hashed map for efficiency, but a
> >> perfect hash is not essential. I recently implemented the
> >> algorithm in both Java and Ada using library routines:
> >>
> >> <http://en.wikipedia.org/wiki/Jumble>
> >> <http://sites.google.com/site/drjohnbmatthews/jumble>
> >> <http://home.roadrunner.com/~jbmatthews/jumble.html>
> >>
> >> I was intrigued to compare the two languages' library
> >> implementations of a general purpose string hash. Using a
> >> multiplicative function with an initial value of zero, both offer
> >> excellent performance. Java uses a multiplier of 31 on a signed
> >> 32-bit integer; Ada uses a multiplier of eight on an unsigned
> >> (modular) value whose size is implementation defined (e.g. mod
> >> 2**32):
> >>
> >> <http://www.docjar.org/html/api/java/lang/String.java.html>
> >> <http://gcc.gnu.org/viewcvs/trunk/gcc/ada/a-strhas.adb>
>
> John, i think you've missed a trick there - doesn't Rotate_Left do a
> barrel shift, where the bits that fall off the left-hand end get
> shoved back in at the right, rather than a plain shift? If so, the
> basic algebraic structure is actually a bit different to java's hash.
D'oh, you're absolutely right. I overlooked this when I was tinkering
with my own hash function, too. Thank you for looking at this.
[...]
> I still think this is a shoddy hash.
Well, it's a legacy from GNAT.HTable.Hash, but it does very well in the
context of Ada.Containers.Hashed_Maps. Java uses a bucket list whose
size is an integral power of two, while (this implementation of) Ada
uses the next largest prime. Modeling an Ada collision counter on your
example <http://urchin.earth.li/~twic/tmp/Collisions.java>, I got these
results for collision count, number of buckets having that count, and
the percentage of the total:
$ ./collisions
0: 218586 (55.6%)
1: 126250 (32.1%)
2: 38432 (9.8%)
3: 8362 (2.1%)
4: 1354 (0.3%)
5: 229 (0.1%)
6: 24 (0.0%)
7: 4 (0.0%)
8: 1 (0.0%)
The code is here: <http://home.roadrunner.com/~jbmatthews/jumble.html>
[Followup-To set to comp.lang.ada.]
--
John B. Matthews
trashgod at gmail dot com
<http://sites.google.com/site/drjohnbmatthews>
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2009-06-07 19:30 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <edaeda6b-3793-48f2-8f8f-6c9701759de5@r33g2000yqn.googlegroups.com>
[not found] ` <f350aaae-d045-450d-b255-308eb7fb8f8a@e24g2000vbe.googlegroups.com>
[not found] ` <d121e6f6-6afa-43e7-bcd1-c9a770e12704@z9g2000yqi.googlegroups.com>
[not found] ` <nospam-83CE7B.07382201062009@news.aioe.org>
[not found] ` <78i1lbF1m69gkU1@mid.individual.net>
[not found] ` <78i2ajF1m4nglU1@mid.individual.net>
[not found] ` <h00l40$r5h$1@news.albasani.net>
[not found] ` <78i4i2F1magkfU1@mid.individual.net>
[not found] ` <3f1d007f-bcae-42b4-afb0-215b18f51b9c@n21g2000vba.googlegroups.com>
[not found] ` <78ib19F1mfh7qU1@mid.individual.net>
[not found] ` <3b519936-3db9-4dad-85ba-371fa4b29c8f@z5g2000vba.googlegroups.com>
[not found] ` <78quiaF1n95fsU1@mid.individual.net>
[not found] ` <h09q6f$41t$1@news.albasani.net>
[not found] ` <MJWdnUzVI6hADrXXnZ2dnUVZ_q2dnZ2d@earthlink.com>
[not found] ` <78rpbkF1mqrc7U1@mid.individual.net>
2009-06-06 16:13 ` find words that contains some specific letters John B. Matthews
2009-06-06 16:22 ` Patricia Shanahan
2009-06-06 20:42 ` Giovanni Azua
2009-06-06 23:43 ` Giovanni Azua
2009-06-06 17:11 ` Mark Space
2009-06-06 18:23 ` Lew
2009-06-06 23:58 ` Tom Anderson
2009-06-07 19:30 ` John B. Matthews
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox