From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM
	autolearn=unavailable autolearn_force=no version=3.4.4
X-Received: by 10.66.152.164 with SMTP id uz4mr9221295pab.24.1378175517085;
 Mon, 02 Sep 2013 19:31:57 -0700 (PDT)
X-Received: by 10.50.23.8 with SMTP id i8mr745992igf.8.1378175516750; Mon, 02
 Sep 2013 19:31:56 -0700 (PDT)
Path: 
 border1.nntp.dca3.giganews.com!border3.nntp.dca.giganews.com!border1.nntp.dca.giganews.com!nntp.giganews.com!newsfeed.news.ucla.edu!usenet.stanford.edu!n2no17375082pbg.1!news-out.google.com!z6ni30971pbu.0!nntp.google.com!j7no108610qai.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail
Newsgroups: comp.lang.ada
Date: Mon, 2 Sep 2013 19:31:56 -0700 (PDT)
In-Reply-To: <l03f49$8db$1@loke.gir.dk>
Complaints-To: groups-abuse@google.com
Injection-Info: glegroupsg2000goo.googlegroups.com;
 posting-host=105.236.92.252;
 posting-account=p-xPhAkAAADjHQWEO7sFME2XBdF1P_2H
NNTP-Posting-Host: 105.236.92.252
References: <1679ec49-424b-43bd-8f35-a5f69e658112@googlegroups.com>
 <kuu57u$ljt$1@loke.gir.dk>
 <7aa26916-cde1-46f8-9f49-d9ebcc2dee93@googlegroups.com>
 <782ef090-7299-4164-b4e5-14a06d1c1a44@googlegroups.com>
 <l03f49$8db$1@loke.gir.dk>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <8268e85c-e372-4883-8449-ef5253e2c77e@googlegroups.com>
Subject: Re: Hash Type Size
From: Peter Brooks <peter.h.m.brooks@gmail.com>
Injection-Date: Tue, 03 Sep 2013 02:31:56 +0000
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
X-Original-Bytes: 3810
Xref: number.nntp.dca.giganews.com comp.lang.ada:183263
Date: 2013-09-02T19:31:56-07:00
List-Id: <comp.lang.ada>

On Tuesday, 3 September 2013 03:47:52 UTC+2, Randy Brukardt  wrote:
>=20
> So my answer to "what's a good way to hash a private type/access=20
> value/system.address?" is that no such way exists. The best you can do is=
=20
> convert the data to a stream-element array and hash that somehow, but tha=
t=20
> most likely would be a dubious hash (there is a lot of unused bits in a=
=20
> typical record, and it would be really easy for those bits to end up bein=
g=20
> most of the result). The whole point is that a hash function needs to be=
=20
> tailored to the data that it is hashing.
>=20
This looks like a good topic for somebody's Ph.D!

I'd imagine it possible to have some broad applicability for certain algori=
thms. In the case of triplestores, the ones used by Wikipaedia contain, mai=
nly, English words and phrases, with an increasing number of entries in oth=
er languages. It's be useful to know if a hash function that works for Wiki=
paedia data in English is more or less effective when used on German or Kor=
ean.

The large triplestores of genetic data are clearly a different matter and I=
 can see tailoring the algorithm there would be important - what, though, d=
o you actually need to know about the data to design a hash algorithm?

With RDF, which all triplestores are made of, the question that's occupying=
 me is how much benefit there is to encoding the different parts of the URI=
.

The triples are of the form:

<http://dbpedia.org/resource/Alabama> <http://www.w3.org/1999/02/22-rdf-syn=
tax-ns#type> <http://schema.org/AdministrativeArea>

subject - predicate - object

Clearly the '<http://' and '>' don't need to be part of the hash, I'll need=
 a mapping both ways, from subject -> predicate -> object and from object -=
> predicate -> subject, that's what'll be in the binary trees.

The predicates will be a much smaller set, so an associative array will pro=
bably be best - probably a three-way associative array, one with the source=
 of the predicate 'www.w3.org/1999/02/22-rdf-syntax-ns' another with the pa=
rticular predicate 'type' and the third with the whole URI - but the same h=
ash-key for each, as they're the same relation.

This is a bit of a simplification as the triples are actually quads (subjec=
t,predicate,object, graph), but that doesn't affect the hashing question.