Tentative design of markdown parser

comp.lang.ada
 help / color / mirror / Atom feed

From: Natasha Kerensikova <lithiumcat@gmail.com>
Subject: Tentative design of markdown parser
Date: Sat, 28 May 2011 13:22:17 +0000 (UTC)
Date: 2011-05-28T13:22:17+00:00	[thread overview]
Message-ID: <slrniu1to9.i18.lithiumcat@sigil.instinctive.eu> (raw)

Hello,

first I would like to thank everybody who has contributed to my earlier
"Parser interface design" thread, I learned a lot through both what was
written there and the research I did triggered by what was written
there. I had no idea about double dispatch, visitor pattern, extensions
(I used to code as flexibly as possible, but with only "extension
through composition" in mind, like UNIX text utils), etc.

Based on all this and a lot of thought, I came up with a design draft,
and would love got know what you think about it.

Lets consider the following types:

type Token_Root is abstract tagged null record;

type Renderer_Callback is access procedure (
  Token : Token_Root'Class);

type Lexer_Callback is access procedure (
  Source : String;
  Position : in out Positive;
  Renderer: Render_Callback);

type Element_Priority is delta 2.0**(-15) range 0.0 .. 1.0;

type Span_Element is record
  Active_Char : Character;
  Priority : Element_Priority;
  Lexer : Lexer_Callback;
  Renderer : Renderer_Callback);

type Span_Element_List is array (<>) of Span_Element;

Note: I'm not sure whether I'm using correctly terms like "lexer" or
"parser", and Markdown (and probably other lightweight markup languages
as well) doesn't require such fine concepts, which can therefore be more
burden than help. It's not a full-fledged programming language, only a
text-based replacement method that was first implemented with perl
substitutions...

The general idea is to have a parser procedure, which gets as input a
String containing the markdown input text, and a Span_Element_List that
contains all the callbacks (in a real version there would also be a
Block_Element_List, very similar to span-level elements except there is
no Active_Char, and probably other stuff as well -- this is still a
draft).

The parser procedure scans the input for active characters. A
consecutive sequence of inactive characters is considered to be normal
text. Upon encountering an active character, it will go through
Span_Element items with the relevant Active_Char, in decreasing priority
order. Span_Element.Lexer is called, with the complete markdown string
as first parameter (in order to be unlimited for context checking),
along with the position of the relevant active character and the
Renderer taken from the Span_Element.

If the context is good and there is indeed the element recognized by the
particular Lexer, it will create a Token (extended from Token_Root),
call Renderer with it, and update Position before returning. If there is
actually no such token, it returns leaving Position unchanged, which
makes the parser try the next Span_Element.

Here are the design choices and rationales:

It's a parser-controlled design, chosen because the recursive nature of
Markdown means there will be a context stack at some point, and it seems
easier and more reliable to use procedure-local variables and the call
stack rather than implementing my own solution.

I haven't found any satisfactory solution to the double dispatch
problem, so I gave up completely and worked around it by providing the
client only bricks and letting it assemble the dispatch table itself.
This is further motivated by the fact that even though double dispatch
can be grafted on top of Ada OO, there is still the issue of the single
package hierarchy. For example, considering an extension to render into
PDF files, and another unrelated extension to support definition lists
in the input language, where would the code to write definition lists
into PDF would live? I decided to leave it entirely to the client. I
don't remember where, but I remember having seen several other examples
where Ada clients are provided with brick to assemble rather than
ready-to-go solution.

The Active_Char idea comes from the fact that most input characters are
normal text, and calling a callback for each of them would be very
inefficient. On the other hand, there is no equivalent for block-level
syntax, but blocks are much less numerous than characters.

The Priority is there to allow several callbacks for a single active
character. For example in the official Markdown syntax, *foo* is turned
into <em>foo</em> while **foo** is turned into <strong>foo</strong>.
That would translate to both EM and STRONG having '*' as active
character, but STRONG has a higher priority, to avoid <em>*foo*</em>.

Priority is chosen as a fixed point number between 0 and 1 to make it
easier to insert a new Span_Element between two existing ones, while
keeping the advantages of integers. However I'm not sure the particular
fixed-point definition I chose is that good.

Lexer_Callback has an in out Position parameter to inform the parser of
the amount of input text consumed by this token. Moreover, leaving
Position untouched allows the lexer to have stricter definition for what
represents a token than just anything starting with the active
character. For example in Markdown, an opening '*' for emphasis must be
followed by a non-blank character, to prevent e.g. multiplication from
being considered as an open-emphasis marker.

The Token_Root'Class object created by Lexer_Callback contains all the
information extracted from the input. For example:

   Token_Emphasis is new Token_Root with record
     Text : String;
   end record;
   Token_Header is new Token_Root with record
     Level : Positive;
     Text : String;
   end record;
   Token_Ruler is new Token_Root with null record;

Renderer_Callback would then cast the Token it got into the Token_Root
extension it is able to render. This allow to ensure (through a runtime
check) that the Renderer_Callback actually matches the Lexer_Callback it
has been paired with (by the client).

This is actually the only point I don't like about the design: there
should be a way to check at compile time that in a Span_Element the
Renderer_Callback can actually handle what is produced by the
Lexer_Callback. However I can't find any.

Moreover, the output method is still completely absent from the design.
Going for function that return Strings? Or using an in out
Unbounded_String parameter? It might be premature optimization, but this
sounds quite inefficient in term of memory allocation, and misses out
the information that the output text is usually in the same order of
magnitude as the input. Maybe I should go for yet another kind of
elements that can be plugged in, to allow output as String, or into a
preallocated dynamic buffer, or into a Stream, or any other output
possibility that hasn't come to my mind yet. I don't manage to estimate
the consequences in terms for code complexity, maintainability,
readability, flexibility and maybe performance of such a new (probably
tagged) object.

Thanks in advance for your help, comments and remarks,
Natasha

next             reply	other threads:[~2011-05-28 13:22 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-05-28 13:22 Natasha Kerensikova [this message]
2011-05-28 14:40 ` Tentative design of markdown parser Yannick Duchêne (Hibou57)
2011-05-30  8:13   ` Natasha Kerensikova
2011-05-30 13:52 ` Georg Bauhaus
2011-05-30 18:07   ` Natasha Kerensikova

replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox