Re: find words that contains some specific letters

comp.lang.ada
 help / color / mirror / Atom feed

* Re: find words that contains some specific letters
       [not found]                           ` <78rpbkF1mqrc7U1@mid.individual.net>
@ 2009-06-06 16:13                             ` John B. Matthews
  2009-06-06 16:22                               ` Patricia Shanahan
                                                 ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: John B. Matthews @ 2009-06-06 16:13 UTC (permalink / raw)

In article <78rpbkF1mqrc7U1@mid.individual.net>,
 "Giovanni Azua" <bravegag@hotmail.com> wrote:

> "Patricia Shanahan" <pats@acm.org> wrote in message
> > A perfect hash function is one that guarantees distinct hash codes 
> > for a specific set of keys. There are algorithms for constructing 
> > one, given a set of keys.
>
> For example the Character class.

I think all of the primitive wrappers (except Double) have trivially 
perfect integer hashes, as they wrap distinct values. Patricia was 
perhaps referring to algorithms of the kind discussed here:

<http://en.wikipedia.org/wiki/Perfect_hash_function>

The jumble algorithm relies on a hashed map for efficiency, but a
perfect hash is not essential. I recently implemented the algorithm
in both Java and Ada using library routines:

<http://en.wikipedia.org/wiki/Jumble>
<http://sites.google.com/site/drjohnbmatthews/jumble>
<http://home.roadrunner.com/~jbmatthews/jumble.html>

I was intrigued to compare the two languages' library implementations 
of a general purpose string hash. Using a multiplicative function with 
an initial value of zero, both offer excellent performance. Java uses a 
multiplier of 31 on a signed 32-bit integer; Ada uses a multiplier of 
eight on an unsigned (modular) value whose size is implementation 
defined (e.g. mod 2**32):

<http://www.docjar.org/html/api/java/lang/String.java.html>
<http://gcc.gnu.org/viewcvs/trunk/gcc/ada/a-strhas.adb>

[Cross-posted to comp.lang.java.programmer and comp.lang.ada;
Followup-To header not set.]

-- 
John B. Matthews
trashgod at gmail dot com
<http://sites.google.com/site/drjohnbmatthews>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: find words that contains some specific letters
  2009-06-06 16:13                             ` find words that contains some specific letters John B. Matthews
@ 2009-06-06 16:22                               ` Patricia Shanahan
  2009-06-06 20:42                                 ` Giovanni Azua
  2009-06-06 17:11                               ` Mark Space
  2009-06-06 18:23                               ` Lew
  2 siblings, 1 reply; 8+ messages in thread
From: Patricia Shanahan @ 2009-06-06 16:22 UTC (permalink / raw)


John B. Matthews wrote:
> In article <78rpbkF1mqrc7U1@mid.individual.net>,
>  "Giovanni Azua" <bravegag@hotmail.com> wrote:
> 
>> "Patricia Shanahan" <pats@acm.org> wrote in message
>>> A perfect hash function is one that guarantees distinct hash codes 
>>> for a specific set of keys. There are algorithms for constructing 
>>> one, given a set of keys.
>> For example the Character class.
> 
> I think all of the primitive wrappers (except Double) have trivially 
> perfect integer hashes, as they wrap distinct values. Patricia was 
> perhaps referring to algorithms of the kind discussed here:
> 
> <http://en.wikipedia.org/wiki/Perfect_hash_function>

Yes, that's the sort of thing I meant.

The Long hash code is also not perfect, and cannot be, because there are
more longs than ints.

Patricia



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: find words that contains some specific letters
  2009-06-06 16:13                             ` find words that contains some specific letters John B. Matthews
  2009-06-06 16:22                               ` Patricia Shanahan
@ 2009-06-06 17:11                               ` Mark Space
  2009-06-06 18:23                               ` Lew
  2 siblings, 0 replies; 8+ messages in thread
From: Mark Space @ 2009-06-06 17:11 UTC (permalink / raw)

John B. Matthews wrote:

>> For example the Character class.
> 
> I think all of the primitive wrappers (except Double) have trivially 
> perfect integer hashes, as they wrap distinct values. Patricia was 
> perhaps referring to algorithms of the kind discussed here:
> 
> <http://en.wikipedia.org/wiki/Perfect_hash_function>

I remember in school we implemented a minimal perfect hash function on a 
given set of keys:  the keywords in the Pascal language.  We were given 
the parameters of the algorithm: use the length of the keyword, the 
first character and the last character only.

The concept was obviously to show how a fast look-up could be done when 
implementing a compiler, but the hash function itself was interesting at 
the time.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: find words that contains some specific letters
  2009-06-06 16:13                             ` find words that contains some specific letters John B. Matthews
  2009-06-06 16:22                               ` Patricia Shanahan
  2009-06-06 17:11                               ` Mark Space
@ 2009-06-06 18:23                               ` Lew
  2009-06-06 23:58                                 ` Tom Anderson
  2 siblings, 1 reply; 8+ messages in thread
From: Lew @ 2009-06-06 18:23 UTC (permalink / raw)


John B. Matthews wrote:
> The jumble algorithm relies on a hashed map for efficiency, but a

Because it has constant-time performance for get() with the String hash code.

> perfect hash is not essential. I recently implemented the algorithm
> in both Java and Ada using library routines:
> 
> <http://en.wikipedia.org/wiki/Jumble>
> <http://sites.google.com/site/drjohnbmatthews/jumble>
> <http://home.roadrunner.com/~jbmatthews/jumble.html>
> 
> I was intrigued to compare the two languages' library implementations 
> of a general purpose string hash. Using a multiplicative function with 
> an initial value of zero, both offer excellent performance. Java uses a 
> multiplier of 31 on a signed 32-bit integer; Ada uses a multiplier of 
> eight on an unsigned (modular) value whose size is implementation 
> defined (e.g. mod 2**32):
> 
> <http://www.docjar.org/html/api/java/lang/String.java.html>
> <http://gcc.gnu.org/viewcvs/trunk/gcc/ada/a-strhas.adb>

The Java String hash is just about verbatim from Knuth.  It has the virtue of 
being a well-studied algorithm that has been shown to have good distribution 
characteristics.

I'm not so sure about the Ada one.  I haven't studied this in a while, but I 
seem to recall that power-of-two multipliers mod 2**32 don't have as good a 
distribution.  Perhaps someone could correct me on that.

-- 
Lew



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: find words that contains some specific letters
  2009-06-06 16:22                               ` Patricia Shanahan
@ 2009-06-06 20:42                                 ` Giovanni Azua
  2009-06-06 23:43                                   ` Giovanni Azua
  0 siblings, 1 reply; 8+ messages in thread
From: Giovanni Azua @ 2009-06-06 20:42 UTC (permalink / raw)

"Patricia Shanahan" <pats@acm.org> wrote:
>>>> A perfect hash function is one that guarantees distinct hash codes for 
>>>> a specific set of keys. There are algorithms for constructing one, 
>>>> given a set of keys.

"Giovanni Azua" <bravegag@hotmail.com> wrote:
>>> For example the Character class.

"John B. Matthews" <nospam@nospam.invalid> wrote:
>> I think all of the primitive wrappers (except Double) have trivially 
>> perfect integer hashes, as they wrap distinct values. Patricia was 
>> perhaps referring to algorithms of the kind discussed here:
>>
>> <http://en.wikipedia.org/wiki/Perfect_hash_function>

"Patricia Shanahan" <pats@acm.org> wrote:
> Yes, that's the sort of thing I meant.
>
> The Long hash code is also not perfect, and cannot be, because there are
> more longs than ints.
>

To achieve the perfect, optimal and ideal no collisions Hash table "perfect 
Hash table" these are the two factors that play a role:

1-. the "perfect" property of the hash function as defined in the wiki page 
cited above
2-. the size of the elements set

If #2 goes arbitrarily large meaning allocating a hash table of the size of 
that elements set becomes prohibitive memory-wise then it does not really 
matter how perfect the hash function is, there will be collisions anyway and 
an additional search algorithm will be needed to disambiguate the colliding 
elements that fall under the same buckets. This is what I think John 
Matthews hasn't realized yet about his implementation of the Jumble 
algorithm. For a realistic dataset he will be returning elements that do not 
match the of input (String built on top of the sorted character input) but 
matches that happen to collide under the same bucket.

e.g. trying to allocate a Hash table of the size ( Integer.MAX_VALUE - 
Integer.MIN_VALUE) would be prohibitive unless you place unrealistic 
requirements so you have the Integer with a perfect hash function but you 
will not get a "perfect Hash table". It is of course worse for the String 
class where neither #1 nor #2 are fulfilled.

Character class offers the perfect hash function but since the elements set 
is discrete i.e. 26 elements http://en.wikipedia.org/wiki/English_alphabet 
you can build a perfect Hash structure like the one I described and edited 
by Tom McGlynn.

Best regards,
Giovanni 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: find words that contains some specific letters
  2009-06-06 20:42                                 ` Giovanni Azua
@ 2009-06-06 23:43                                   ` Giovanni Azua
  0 siblings, 0 replies; 8+ messages in thread
From: Giovanni Azua @ 2009-06-06 23:43 UTC (permalink / raw)



"Giovanni Azua" <bravegag@hotmail.com> wrote in message
> elements that fall under the same buckets. This is what I think John 
> Matthews hasn't realized yet about his implementation of the Jumble 
> algorithm. For a realistic dataset he will be returning elements that do 
> not match the of input (String built on top of the sorted character input) 
> but matches that happen to collide under the same bucket.
>
I am sorry, this is wrong. Matthews's Jumble implementation is correct and 
does not have the problem I described here.

Best regards,
Giovanni 





^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: find words that contains some specific letters
  2009-06-06 18:23                               ` Lew
@ 2009-06-06 23:58                                 ` Tom Anderson
  2009-06-07 19:30                                   ` John B. Matthews
  0 siblings, 1 reply; 8+ messages in thread
From: Tom Anderson @ 2009-06-06 23:58 UTC (permalink / raw)

On Sat, 6 Jun 2009, Lew wrote:

> John B. Matthews wrote:
>> The jumble algorithm relies on a hashed map for efficiency, but a
>> perfect hash is not essential. I recently implemented the algorithm
>> in both Java and Ada using library routines:
>> 
>> <http://en.wikipedia.org/wiki/Jumble>
>> <http://sites.google.com/site/drjohnbmatthews/jumble>
>> <http://home.roadrunner.com/~jbmatthews/jumble.html>
>> 
>> I was intrigued to compare the two languages' library implementations of a 
>> general purpose string hash. Using a multiplicative function with an 
>> initial value of zero, both offer excellent performance. Java uses a 
>> multiplier of 31 on a signed 32-bit integer; Ada uses a multiplier of eight 
>> on an unsigned (modular) value whose size is implementation defined (e.g. 
>> mod 2**32):
>> 
>> <http://www.docjar.org/html/api/java/lang/String.java.html>
>> <http://gcc.gnu.org/viewcvs/trunk/gcc/ada/a-strhas.adb>

John, i think you've missed a trick there - doesn't Rotate_Left do a 
barrel shift, where the bits that fall off the left-hand end get shoved 
back in at the right, rather than a plain shift? If so, the basic 
algebraic structure is actually a bit different to java's hash.

> The Java String hash is just about verbatim from Knuth.  It has the virtue of 
> being a well-studied algorithm that has been shown to have good distribution 
> characteristics.
>
> I'm not so sure about the Ada one.  I haven't studied this in a while, 
> but I seem to recall that power-of-two multipliers mod 2**32 don't have 
> as good a distribution.  Perhaps someone could correct me on that.

If Rotate_Left is a plain shift, then the Ada hash is based only on the 
last 10 characters of the string. Multiplying by eight is like shifting 
left three places; modding by 2*32 is like anding with 0xffffffff (if 
you'll pardon my verbs). When you add in a character, the rightmost bit 
you can affect is bit 0, the LSB; after ten shifts of three bits, that is 
now at bit thirty, and after one more, it's at bit thirty-three, and has 
vanished.

If Rotate_Left is a barrel shift, then our bit of interest doesn't vanish, 
because it comes back to the right-hand end of the hash after the eleventh 
shift in position 1. Because 3 is coprime with 32, it will then proceed to 
make a complete circuit of all positions in the hash over the following 
twenty-one shifts, winding up back in position 0 after the thirty-second. 
I have a vague worry that this means that strings built from the same 
32-character blocks arranged in any order will have the same hash, but i 
suspect that this is not the case, because of carries. My maths isn't up 
to proving it, though.

I still think this is a shoddy hash. It has no diffusion: if you hash two 
strings which differ at one character position, the difference between the 
hashes will always be confined to the sixteen bits which correspond to 
where that character ended up in the rotating hash. Java's 
multiply-and-mod approach quickly spreads the bits of each character out 
over the whole hash.

tom

-- 
Science of a sufficiently advanced form is indistinguishable from magic

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: find words that contains some specific letters
  2009-06-06 23:58                                 ` Tom Anderson
@ 2009-06-07 19:30                                   ` John B. Matthews
  0 siblings, 0 replies; 8+ messages in thread
From: John B. Matthews @ 2009-06-07 19:30 UTC (permalink / raw)


In article <alpine.DEB.1.10.0906070035110.5405@urchin.earth.li>,
 Tom Anderson <twic@urchin.earth.li> wrote:

> > John B. Matthews wrote:
> >> The jumble algorithm relies on a hashed map for efficiency, but a 
> >> perfect hash is not essential. I recently implemented the 
> >> algorithm in both Java and Ada using library routines:
> >> 
> >> <http://en.wikipedia.org/wiki/Jumble>
> >> <http://sites.google.com/site/drjohnbmatthews/jumble>
> >> <http://home.roadrunner.com/~jbmatthews/jumble.html>
> >> 
> >> I was intrigued to compare the two languages' library 
> >> implementations of a general purpose string hash. Using a 
> >> multiplicative function with an initial value of zero, both offer 
> >> excellent performance. Java uses a multiplier of 31 on a signed 
> >> 32-bit integer; Ada uses a multiplier of eight on an unsigned 
> >> (modular) value whose size is implementation defined (e.g. mod 
> >> 2**32):
> >> 
> >> <http://www.docjar.org/html/api/java/lang/String.java.html>
> >> <http://gcc.gnu.org/viewcvs/trunk/gcc/ada/a-strhas.adb>
> 
> John, i think you've missed a trick there - doesn't Rotate_Left do a 
> barrel shift, where the bits that fall off the left-hand end get 
> shoved back in at the right, rather than a plain shift? If so, the 
> basic algebraic structure is actually a bit different to java's hash.

D'oh, you're absolutely right. I overlooked this when I was tinkering 
with my own hash function, too. Thank you for looking at this.

[...]

> I still think this is a shoddy hash.

Well, it's a legacy from GNAT.HTable.Hash, but it does very well in the 
context of Ada.Containers.Hashed_Maps. Java uses a bucket list whose 
size is an integral power of two, while (this implementation of) Ada 
uses the next largest prime. Modeling an Ada collision counter on your 
example <http://urchin.earth.li/~twic/tmp/Collisions.java>, I got these 
results for collision count, number of buckets having that count, and 
the percentage of the total:

$ ./collisions 
 0: 218586 (55.6%)
 1: 126250 (32.1%)
 2: 38432 (9.8%)
 3: 8362 (2.1%)
 4: 1354 (0.3%)
 5: 229 (0.1%)
 6: 24 (0.0%)
 7: 4 (0.0%)
 8: 1 (0.0%)

The code is here: <http://home.roadrunner.com/~jbmatthews/jumble.html>

[Followup-To set to comp.lang.ada.]
-- 
John B. Matthews
trashgod at gmail dot com
<http://sites.google.com/site/drjohnbmatthews>



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-06-07 19:30 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <edaeda6b-3793-48f2-8f8f-6c9701759de5@r33g2000yqn.googlegroups.com>
     [not found] ` <f350aaae-d045-450d-b255-308eb7fb8f8a@e24g2000vbe.googlegroups.com>
     [not found]   ` <d121e6f6-6afa-43e7-bcd1-c9a770e12704@z9g2000yqi.googlegroups.com>
     [not found]     ` <nospam-83CE7B.07382201062009@news.aioe.org>
     [not found]       ` <78i1lbF1m69gkU1@mid.individual.net>
     [not found]         ` <78i2ajF1m4nglU1@mid.individual.net>
     [not found]           ` <h00l40$r5h$1@news.albasani.net>
     [not found]             ` <78i4i2F1magkfU1@mid.individual.net>
     [not found]               ` <3f1d007f-bcae-42b4-afb0-215b18f51b9c@n21g2000vba.googlegroups.com>
     [not found]                 ` <78ib19F1mfh7qU1@mid.individual.net>
     [not found]                   ` <3b519936-3db9-4dad-85ba-371fa4b29c8f@z5g2000vba.googlegroups.com>
     [not found]                     ` <78quiaF1n95fsU1@mid.individual.net>
     [not found]                       ` <h09q6f$41t$1@news.albasani.net>
     [not found]                         ` <MJWdnUzVI6hADrXXnZ2dnUVZ_q2dnZ2d@earthlink.com>
     [not found]                           ` <78rpbkF1mqrc7U1@mid.individual.net>
2009-06-06 16:13                             ` find words that contains some specific letters John B. Matthews
2009-06-06 16:22                               ` Patricia Shanahan
2009-06-06 20:42                                 ` Giovanni Azua
2009-06-06 23:43                                   ` Giovanni Azua
2009-06-06 17:11                               ` Mark Space
2009-06-06 18:23                               ` Lew
2009-06-06 23:58                                 ` Tom Anderson
2009-06-07 19:30                                   ` John B. Matthews

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox