From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=unavailable autolearn_force=no version=3.4.4 X-Received: by 10.180.80.227 with SMTP id u3mr3242571wix.5.1378409422359; Thu, 05 Sep 2013 12:30:22 -0700 (PDT) Path: border1.nntp.ams3.giganews.com!border2.nntp.ams3.giganews.com!border2.nntp.ams2.giganews.com!border4.nntp.ams.giganews.com!nntp.giganews.com!hk9no9995990wib.1!news-out.google.com!ed8ni120020wic.0!nntp.google.com!goblin1!goblin3!goblin.stu.neva.ru!news.mb-net.net!open-news-network.org!aioe.org!.POSTED!not-for-mail From: "John B. Matthews" Newsgroups: comp.lang.ada Subject: Re: Hash Type Size Date: Thu, 05 Sep 2013 15:30:14 -0400 Organization: The Wasteland Message-ID: References: <1679ec49-424b-43bd-8f35-a5f69e658112@googlegroups.com> <7aa26916-cde1-46f8-9f49-d9ebcc2dee93@googlegroups.com> <782ef090-7299-4164-b4e5-14a06d1c1a44@googlegroups.com> <8268e85c-e372-4883-8449-ef5253e2c77e@googlegroups.com> <7xppspkx6d.fsf@ruckus.brouhaha.com> <7x7gexqj9d.fsf@ruckus.brouhaha.com> NNTP-Posting-Host: LQJtZWzu+iKlBROuDg+IUg.user.speranza.aioe.org Mime-Version: 1.0 X-Complaints-To: abuse@aioe.org User-Agent: MT-NewsWatcher/3.5.3b3 (Intel Mac OS X) X-Notice: Filtered by postfilter v. 0.8.2 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Original-Bytes: 3841 Xref: number.nntp.dca.giganews.com comp.lang.ada:183318 Date: 2013-09-05T15:30:14-04:00 List-Id: In article <7x7gexqj9d.fsf@ruckus.brouhaha.com>, Paul Rubin wrote: > Paul Rubin writes: > > It doesn't say how many words are in the dictionary, > Wait, it gives the numbers > > 0: 305643 (77.72%) > 1: 76947 (19.57%) > 2: 9790 (2.49%) > 3: 805 (0.20%) > 4: 56 (0.01%) > 5: 1 (0.00%) > > i.e. there are around 492K words in a 250K slot hash table. Of > course there will be a lot of collisions. If you want fewer > collisions, you need more slots than words. Thank you for commenting on this. I see now that the raw percentages are misleading. The number of slots reserved in the Table array is equal to the sum of the elements in the Counts Map. The number of dictionary words hashed is the sum of the product of each key/element pair in the Counts Map. In the example above, the 393241 slots reserved is much larger than required for the 99171 words in the Ubuntu dictionary; the load factor is just 25%. Focusing on the Counts Map, node 0 is the remaining empty slots in the Table array. Node 1 implies that 1 * 76947 / 99171 = 78% of the words can be retrieved directly, while node 5 implies that just 5 * 1 / 99171 << 1% of the words may require up to four additional operations to retrieve. I've updated the program to adapt the Table size and show a few details. Next to each count is the percentage of words in that occupancy class. Note that node zero is unused table entries; node one is table entries occupied by a single word; node two is table entries occupied by two words; etc. Hashing the same dictionary as above shows the following results. ./collisions /usr/share/dict/words Word count: 99171 Table size: 196613 Load factor: 50.44% 0: 118762 (0.00%) 1: 59770 (60.27%) 2: 15218 (30.69%) 3: 2534 (7.67%) 4: 288 (1.16%) 5: 36 (0.18%) 6: 4 (0.02%) 7: 1 (0.01%) These results were obtained using the larger dictionary on Mac OS X: ./collisions /usr/share/dict/words Word count: 234936 Table size: 393241 Load factor: 59.74% 0: 218461 (0.00%) 1: 126393 (53.80%) 2: 38524 (32.80%) 3: 8228 (10.51%) 4: 1400 (2.38%) 5: 203 (0.43%) 6: 29 (0.07%) 7: 2 (0.01%) 8: 1 (0.00%) -- John B. Matthews trashgod at gmail dot com