From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham
	autolearn_force=no version=3.4.4
X-Google-Language: ENGLISH,ASCII-7-bit
X-Google-Thread: 103376,3cd3b8571c28b75f
X-Google-Attributes: gid103376,public
X-Google-ArrivalTime: 2003-09-05 10:20:56 PST
Path: 
 archiver1.google.com!news1.google.com!newsfeed.stanford.edu!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!in.100proofnews.com!in.100proofnews.com!snoopy.risq.qc.ca!news.uunet.ca!nf3.bellglobal.com!nf1.bellglobal.com!nf2.bellglobal.com!news20.bellglobal.com.POSTED!not-for-mail
From: "Warren W. Gay VE3WWG" <ve3wwg@cogeco.ca>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US;
 rv:1.4) Gecko/20030624 Netscape/7.1 (ax)
X-Accept-Language: en-us, en
MIME-Version: 1.0
Newsgroups: comp.lang.ada
Subject: Re: A Customer's Request For Open Source Software
References: <3F44BC65.4020203@noplace.com>
 <20030822005323.2ff66948.david@realityrift.com>
 <vkc1b.26226$Cd2.21231@nwrdny01.gnilink.net>
 <Prv1b.3720$PJ2.319043@news20.bellglobal.com>
 <e2e5731a.0308221829.3d16a7b@posting.google.com> <3F4828D9.8050700@attbi.com>
 <bRv3b.692$c12.9302@newsfep4-glfd.server.ntli.net> <3F4EA616.30607@attbi.com>
 <e2e5731a.0308301101.7abab808@posting.google.com>
 <3F512BD1.8010402@attbi.com>
 <e2e5731a.0308311504.7c18e530@posting.google.com>
 <3F52AA5F.8080607@attbi.com> <e2e5731a.0309010Organization: LJK Software
 <og4DamrQ9AuX@eisner.encompasserve.org> <3F5559A4.8030507@attbi.com>
 <34p5b.8361$Kj.843491@news20.bellglobal.com> <wvbrptihkolk.fsf@sun.com>
In-Reply-To: <wvbrptihkolk.fsf@sun.com>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Message-ID: <Ln36b.22618$su.611773@news20.bellglobal.com>
Date: Fri, 05 Sep 2003 13:06:02 -0400
NNTP-Posting-Host: 198.96.223.163
X-Complaints-To: abuse@sympatico.ca
X-Trace: news20.bellglobal.com 1062781547 198.96.223.163 (Fri,
 05 Sep 2003 13:05:47 EDT)
NNTP-Posting-Date: Fri, 05 Sep 2003 13:05:47 EDT
Organization: Bell Sympatico
Xref: archiver1.google.com comp.lang.ada:42183
Date: 2003-09-05T13:06:02-04:00
List-Id: <comp.lang.ada>

I had written a detailed reply to this by my crappy
Netscape 7 had an error posting via NNTP, and threw
the whole reply away (major grumbling here about that!)
So this will be a very brief repeat of the essential
items ;-)

olehjalmar kristensen - Sun Microsystems - Trondheim Norway wrote:
>>>>>>"WWGV" == Warren W Gay VE3WWG <ve3wwg@cogeco.ca> writes:
>     WWGV> You might wonder why buffered "block" devices are not good
>     WWGV> enough for the purpose. I can't answer to the specifics, but
>     WWGV> only that database engines are in a position to better manage
>     WWGV> the cache based upon what they "know" needs to be done. Another
>     WWGV> important reason to control caching details is that when a
>     WWGV> transaction is committed, you need to guarantee that the
>     WWGV> data is written to the disk media (or can be recovered if
>     WWGV> the database is to be restarted at that point).
> 
> The cache management will typically be the same with both OS files and
> raw devices. The main difference is as you say, that you have better
> control over the raw device with respect to the layout of your tables,
> so you may be able to optimize disk writes better.

Control is where it is at. With block devices you can only:

1) leave it up to the O/S when to flush out writes
2) call sync(2) and have all dirty blocks written out
3) call fsync(3) and have all of your own blocks related
    to the file descriptor written out.

With a block device, a return from write(2) does not guarantee
that your data has been recorded on the magnetic media.

With a raw device, it should be a guarantee, with the following
exceptions:

1) Smart caching IDE like disk drives
2) EMC like subsystems with electronic cache


> Most operating systems will allow you to wait until the data are
> even if you are using the file system, so there is no difference with
> respect to the durability of data.

With UNIX, a return from write(2) is only a promise that
it will someday be written. No specific timeline is
guaranteed (see above). I don't expect it is much different
with win32 systems.

> All portable DBMS need to do their own cache management, so if you are
> running on files, both the OS and the DBMS cache the same blocks,
> thereby wasiting RAM. Also, the replacement strategies may be
> conflicting, resulting in suboptimal performance.

Yes, if the DBMS is expecting a raw device, and doing its own
caching, then using a block device (or file system file) is
asking for double-buffering. Very bad indeed for performance.

>     WWGV> The database
>     WWGV> must be recoverable at any given point anyhow, and this
>     WWGV> usually requires fine grained control over physical writes
>     WWGV> to the media. This aspect and performance means that the
>     WWGV> engine must balance performance with reliability (persistance),
>     WWGV> which are conflicting goals when using disk.
> 
>     WWGV> It is this last area where oodles of persistent fast memory
>     WWGV> (instead of disk), can make a world of difference. In this
>     WWGV> case persistence = memory performance, which of course is
>     WWGV> where the win is. If disks became obsolete (one can hope),
>     WWGV> then I could see that new database engine design (internals)
>     WWGV> will become much different than what it is today.  Certainly,
>     WWGV> many of the present compromises would be eliminated.
> 
>     WWGV> -- 
>     WWGV> Warren W. Gay VE3WWG
>     WWGV> http://home.cogeco.ca/~ve3wwg
> 
> Possibly, but keep in mind that most current DBMS's are already CPU
> bound when it comes to throughput. Fast disks and techniques like
> group commit ensures that the log is rarely a bottleneck, and all you
> need to recover is in the log. 

This is primarily where I disagree: I don't disagree that there
are fast disks and subsystems today. But keep in mind two things:

1) There are oodles of traditional slow disk based databases around
    still
2) The database engines were designed to handle the difficult case
    of #1.

Designing a new RDBMS would be much simpler, if you could ignore
the performance cost of putting data to the magnetic media. But
I would venture that all of the popular database vendor offerings
available today, were designed to get performance out of slow
disk I/O technologies.

Otherwise, they could just insist on
this new technology, and make life easier forthemselves from
a design and enhancement point of view.

> What you can get with large non-volatile memory is much lower latency
> per transaction, that is, the response time for a single transaction
> can be dramatically lower, even if the throughput in terms of TPS
> stays the same.

It also buys an opportunity to re-design the database engine,
because now you can eliminate:

  1) the cost of slow disk I/O
  2) the worries about data persistance (no disk
     write operations need to be issued)

This opens the doors to a whole new set of database engine
design dynamics.

> And in case you were wondering if you could do away with the log, the
> answer is yes, if you create some kind of multi-version system. But
> you always need some way of keeping track of the history, so you can
> roll back your changes in case a transaction decides to abort for some
> reason. Actually, one may say that the log IS the database, all the
> rest is there only to give faster access to the latest version.

Going to fast, persistent memory, opens a lot of doors indeed.
-- 
Warren W. Gay VE3WWG
http://home.cogeco.ca/~ve3wwg