From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-0.3 required=5.0 tests=BAYES_00,
	REPLYTO_WITHOUT_TO_CC autolearn=no autolearn_force=no version=3.4.4
X-Google-Thread: 103376,8de7eedad50552f1
X-Google-Attributes: gid103376,public
X-Google-Language: ENGLISH,ASCII-7-bit
Path: 
 g2news1.google.com!news1.google.com!proxad.net!proxad.net!newsfeed.arcor.de!news.arcor.de!not-for-mail
From: "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de>
Subject: Re: Ada bench : count words
Newsgroups: comp.lang.ada
User-Agent: 40tude_Dialog/2.0.14.1
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Reply-To: mailbox@dmitry-kazakov.de
Organization: cbb software GmbH
References: <ur7ib38sg.fsf@obry.net>
 <pan.2005.03.19.16.57.03.525542@linuxchip.demon.co.uk.uk.uk>
 <87vf7n5njs.fsf@code-hal.de> <umzsy3c8j.fsf@obry.net>
 <423f5813$0$9224$9b4e6d93@newsread4.arcor-online.net>
 <mailman.47.1111454204.23655.comp.lang.ada@ada-france.org>
Date: Tue, 22 Mar 2005 11:59:43 +0100
Message-ID: <18arnvu705ly4$.1wz6ybz1jt70y$.dlg@40tude.net>
NNTP-Posting-Date: 22 Mar 2005 11:59:41 MET
NNTP-Posting-Host: f809983c.newsread4.arcor-online.net
X-Trace: 
 DXC=@:N?elDKL1KRG:gi]CKmZG:ejgIfPPldDjW\KbG]kaMHdbobRQ:W=WBK=fR<:5ghh@WRXZ37ga[7JncfD5BXcIX@hE?HIfUYM1C
X-Complaints-To: abuse@arcor.de
Xref: g2news1.google.com comp.lang.ada:9719
Date: 2005-03-22T11:59:41+01:00
List-Id: <comp.lang.ada>

On Tue, 22 Mar 2005 01:16:09 +0000, Marius Amado Alves wrote:

> I took a shot at the count-words benchmark, a program to count lines, 
> words and characters. The Ada program currently published there is 
> broken. My program is correct and portable but:
> 
> - the speed is circa 1/3 of the GCC C version
> 
> - it fails to comply with the requirement that the input be taken from 
> standard input. To implement buffering, I have resorted to 
> Ada.Direct_IO, which I think cannot apply to standard input.

Is Text_IO that bad?

> Can you help with any of these points? Thanks. The complete program 
> follows.
> 
> -- Count words in Ada for the language shootout
> -- by Marius Amado Alves
> 
> with Ada.Characters.Handling;
> with Ada.Characters.Latin_1;
> with Ada.Command_Line;
> with Ada.Direct_IO;
> with Ada.Strings.Fixed;
> with Ada.Text_IO;
> 
> procedure Count_Words is
> 
>     use Ada.Characters.Handling;
>     use Ada.Characters.Latin_1;
>     use Ada.Command_Line;
> 
>     Filename : String := Argument (1);
>     Buffer_Size : constant := 4096;
>     EOF : Character := FS;
>     EOL : String := (1 => LF);
>     Lines : Natural := 0;
>     Words : Natural := 0;
>     Total : Natural := 0;
>     In_Word : Boolean := False;
> 
>     function Is_Separator (C : Character) return Boolean is
>     begin
>        return Is_Control (C) or C = ' ';
>     end;

Why don't you use character map here?
 
>     procedure Start_Word is
>     begin
>        In_Word := True;
>     end;
> 
>     procedure Finish_Word is
>     begin
>        Words := Words + 1;
>        In_Word := False;
>     end;
> 
>     procedure Process (S : in String) is
>     begin
>        Lines := Lines + Ada.Strings.Fixed.Count (S, EOL);

Isn't it an extra pass? I think you should do parsing using FSM. Character
classes are: EOL, delimiter, letter. It is either two character map tests
or one case statement. I don't know what is faster. Probably you should
test both.

>        for I in S'Range loop
>           if Is_Separator (S (I)) then
>              if In_Word then Finish_Word; end if;
>           else
>              if not In_Word then Start_Word; end if;
>           end if;
>        end loop;
>     end;
>
> begin
>     declare
>        package Character_IO is new Ada.Direct_IO (Character);
>        use Character_IO;
>        File : File_Type;
>     begin
>        Open (File, In_File, Filename);
>        Total := Natural (Size (File));
>        Close (File);
>     end;
> 
>     declare
>        subtype Buffer_Type is String (1 .. Buffer_Size);
>        package Buffer_IO is new Ada.Direct_IO (Buffer_Type);
>        use Buffer_IO;
>        File : File_Type;
>        S : Buffer_Type;
>     begin
>        Open (File, In_File, Filename);
>        for I in 1 .. Total / Buffer_Size loop
>           Read (File, S);
>           Process (S);
>        end loop;
>        Close (File);
>     end;
> 
>     declare
>        subtype Rest_Type is String (1 .. Total rem Buffer_Size);
>        package Character_IO is new Ada.Direct_IO (Character);
>        use Character_IO;
>        File : File_Type;
>        S : Rest_Type;
>     begin
>        Open (File, In_File, Filename);
>        Set_Index (File, Count (Total - S'Length));
>        for I in 1 .. S'Length loop
>           Read (File, S (I));
>        end loop;
>        Close (File);
>        Process (S);
>     end;
> 
>     if In_Word then Finish_Word; end if;
> 
>     Ada.Text_IO.Put_Line
>       (Natural'Image (Lines) &
>        Natural'Image (Words) &
>        Natural'Image (Total));
> end;

P.S. For word frequencies: gcc version uses hash + sort. I wonder if binary
trees could be better here. Or even sorted arrays for simplicity, there is
no item deletion, search is more often than insert... Of course, for tree
node allocation one could use a stack pool instead of heap.

(That's one of the weaknesses of this contest. Actually the method should
have been specified)

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de