Tentative design of markdown parser

comp.lang.ada
 help / color / mirror / Atom feed

* Tentative design of markdown parser
@ 2011-05-28 13:22 Natasha Kerensikova
  2011-05-28 14:40 ` Yannick Duchêne (Hibou57)
  2011-05-30 13:52 ` Georg Bauhaus
  0 siblings, 2 replies; 5+ messages in thread
From: Natasha Kerensikova @ 2011-05-28 13:22 UTC (permalink / raw)


Hello,

first I would like to thank everybody who has contributed to my earlier
"Parser interface design" thread, I learned a lot through both what was
written there and the research I did triggered by what was written
there. I had no idea about double dispatch, visitor pattern, extensions
(I used to code as flexibly as possible, but with only "extension
through composition" in mind, like UNIX text utils), etc.

Based on all this and a lot of thought, I came up with a design draft,
and would love got know what you think about it.

Lets consider the following types:



type Token_Root is abstract tagged null record;

type Renderer_Callback is access procedure (
  Token : Token_Root'Class);

type Lexer_Callback is access procedure (
  Source : String;
  Position : in out Positive;
  Renderer: Render_Callback);

type Element_Priority is delta 2.0**(-15) range 0.0 .. 1.0;

type Span_Element is record
  Active_Char : Character;
  Priority : Element_Priority;
  Lexer : Lexer_Callback;
  Renderer : Renderer_Callback);

type Span_Element_List is array (<>) of Span_Element;



Note: I'm not sure whether I'm using correctly terms like "lexer" or
"parser", and Markdown (and probably other lightweight markup languages
as well) doesn't require such fine concepts, which can therefore be more
burden than help. It's not a full-fledged programming language, only a
text-based replacement method that was first implemented with perl
substitutions...

The general idea is to have a parser procedure, which gets as input a
String containing the markdown input text, and a Span_Element_List that
contains all the callbacks (in a real version there would also be a
Block_Element_List, very similar to span-level elements except there is
no Active_Char, and probably other stuff as well -- this is still a
draft).

The parser procedure scans the input for active characters. A
consecutive sequence of inactive characters is considered to be normal
text. Upon encountering an active character, it will go through
Span_Element items with the relevant Active_Char, in decreasing priority
order. Span_Element.Lexer is called, with the complete markdown string
as first parameter (in order to be unlimited for context checking),
along with the position of the relevant active character and the
Renderer taken from the Span_Element.

If the context is good and there is indeed the element recognized by the
particular Lexer, it will create a Token (extended from Token_Root),
call Renderer with it, and update Position before returning. If there is
actually no such token, it returns leaving Position unchanged, which
makes the parser try the next Span_Element.



Here are the design choices and rationales:

It's a parser-controlled design, chosen because the recursive nature of
Markdown means there will be a context stack at some point, and it seems
easier and more reliable to use procedure-local variables and the call
stack rather than implementing my own solution.

I haven't found any satisfactory solution to the double dispatch
problem, so I gave up completely and worked around it by providing the
client only bricks and letting it assemble the dispatch table itself.
This is further motivated by the fact that even though double dispatch
can be grafted on top of Ada OO, there is still the issue of the single
package hierarchy. For example, considering an extension to render into
PDF files, and another unrelated extension to support definition lists
in the input language, where would the code to write definition lists
into PDF would live? I decided to leave it entirely to the client. I
don't remember where, but I remember having seen several other examples
where Ada clients are provided with brick to assemble rather than
ready-to-go solution.

The Active_Char idea comes from the fact that most input characters are
normal text, and calling a callback for each of them would be very
inefficient. On the other hand, there is no equivalent for block-level
syntax, but blocks are much less numerous than characters.

The Priority is there to allow several callbacks for a single active
character. For example in the official Markdown syntax, *foo* is turned
into <em>foo</em> while **foo** is turned into <strong>foo</strong>.
That would translate to both EM and STRONG having '*' as active
character, but STRONG has a higher priority, to avoid <em>*foo*</em>.

Priority is chosen as a fixed point number between 0 and 1 to make it
easier to insert a new Span_Element between two existing ones, while
keeping the advantages of integers. However I'm not sure the particular
fixed-point definition I chose is that good.

Lexer_Callback has an in out Position parameter to inform the parser of
the amount of input text consumed by this token. Moreover, leaving
Position untouched allows the lexer to have stricter definition for what
represents a token than just anything starting with the active
character. For example in Markdown, an opening '*' for emphasis must be
followed by a non-blank character, to prevent e.g. multiplication from
being considered as an open-emphasis marker.

The Token_Root'Class object created by Lexer_Callback contains all the
information extracted from the input. For example:

   Token_Emphasis is new Token_Root with record
     Text : String;
   end record;
   Token_Header is new Token_Root with record
     Level : Positive;
     Text : String;
   end record;
   Token_Ruler is new Token_Root with null record;

Renderer_Callback would then cast the Token it got into the Token_Root
extension it is able to render. This allow to ensure (through a runtime
check) that the Renderer_Callback actually matches the Lexer_Callback it
has been paired with (by the client).

This is actually the only point I don't like about the design: there
should be a way to check at compile time that in a Span_Element the
Renderer_Callback can actually handle what is produced by the
Lexer_Callback. However I can't find any.

Moreover, the output method is still completely absent from the design.
Going for function that return Strings? Or using an in out
Unbounded_String parameter? It might be premature optimization, but this
sounds quite inefficient in term of memory allocation, and misses out
the information that the output text is usually in the same order of
magnitude as the input. Maybe I should go for yet another kind of
elements that can be plugged in, to allow output as String, or into a
preallocated dynamic buffer, or into a Stream, or any other output
possibility that hasn't come to my mind yet. I don't manage to estimate
the consequences in terms for code complexity, maintainability,
readability, flexibility and maybe performance of such a new (probably
tagged) object.


Thanks in advance for your help, comments and remarks,
Natasha



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Tentative design of markdown parser
  2011-05-28 13:22 Tentative design of markdown parser Natasha Kerensikova
@ 2011-05-28 14:40 ` Yannick Duchêne (Hibou57)
  2011-05-30  8:13   ` Natasha Kerensikova
  2011-05-30 13:52 ` Georg Bauhaus
  1 sibling, 1 reply; 5+ messages in thread
From: Yannick Duchêne (Hibou57) @ 2011-05-28 14:40 UTC (permalink / raw)


Le Sat, 28 May 2011 15:22:17 +0200, Natasha Kerensikova  
<lithiumcat@gmail.com> a écrit:
> type Element_Priority is delta 2.0**(-15) range 0.0 .. 1.0;
>
> type Span_Element is record
>   Active_Char : Character;
>   Priority : Element_Priority;
>   Lexer : Lexer_Callback;
>   Renderer : Renderer_Callback);
I've the Rationals part which I may read later, and did not already  
searched the web for what Markdown is, so I may missed some point.

What is Element_Priority ? A kind of Order-of-precedence ?

Span_Element is a structure, with no constructor ? What returns  
instance(s) of Span_Element ?

> type Lexer_Callback is access procedure (
>   Source : String;
>   Position : in out Positive;
>   Renderer: Render_Callback);
Lexer_Callback get a parameter of type String, but nowhere I can see a  
structure holding a type String. Seems the interface does not show all the  
relations between all the stuffs.

Position is in/out, so something should hold at least a corresponding  
element of the same type Positive, which would be mutable. I could not see  
such a thing. Same as the above comment, I feel the interface does not  
show some relations which are supposed to be “guessed”.

Sorry if my questions seems stupid, as I suppose everything looks very  
clear in your own mind.

-- 
“Syntactic sugar causes cancer of the semi-colons.”  [Epigrams on  
Programming — Alan J. — P. Yale University]
“Structured Programming supports the law of the excluded muddle.” [Idem]
“c++; /* this makes c bigger but returns the old value */” [Anonymous]



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Tentative design of markdown parser
  2011-05-28 14:40 ` Yannick Duchêne (Hibou57)
@ 2011-05-30  8:13   ` Natasha Kerensikova
  0 siblings, 0 replies; 5+ messages in thread
From: Natasha Kerensikova @ 2011-05-30  8:13 UTC (permalink / raw)

Hello,

On 2011-05-28, Yannick Duchêne <yannick_duchene@yahoo.fr> wrote:
> Le Sat, 28 May 2011 15:22:17 +0200, Natasha Kerensikova  
><lithiumcat@gmail.com> a écrit:
>> type Element_Priority is delta 2.0**(-15) range 0.0 .. 1.0;
>>
>> type Span_Element is record
>>   Active_Char : Character;
>>   Priority : Element_Priority;
>>   Lexer : Lexer_Callback;
>>   Renderer : Renderer_Callback);
> I've the Rationals part which I may read later, and did not already  
> searched the web for what Markdown is, so I may missed some point.

I didn't consider deep understanding of Markdown as a prerequisite, but
it might be useful (and should be enough) to know that it's a markup of
raw text supposed to be as close as possible as e-mail conventions, like
using enclosing *stars* or _underscores_ as emphasis, lines starting
with '>' to denote blockquotes, etc.

> What is Element_Priority ? A kind of Order-of-precedence ?

I guess kind of order-of-precedence. More precisely, some Lexer_Callback
are less specific than others (and in the implementation of Markdown I
have in mind, there is even a catch-all lexer that produces tokens out
of anything that hasn't been recognized by something else), so I felt
the need of a way to order them from the most specific to the least
specific, and without any information about Lexer_Callback, it has to be
done manually.

> Span_Element is a structure, with no constructor ? What returns  
> instance(s) of Span_Element ?

Nothing. I thought of Span_Element as just a bunch of variables,
initialized by the client and provided (read-only) to the library
subprograms.

>> type Lexer_Callback is access procedure (
>>   Source : String;
>>   Position : in out Positive;
>>   Renderer: Render_Callback);
> Lexer_Callback get a parameter of type String, but nowhere I can see a  
> structure holding a type String. Seems the interface does not show all the  
> relations between all the stuffs.

What you're probably missing is the main subprogram of the library (I
only written types). I don't have a clear idea of what it would look
like, but probably something similar to

function Parse (
  Source : String;
  Parser_Description : Span_Element_List)
return String;

The data flow is the client filling the array of Span_Element, and
sending it along with the input text (as String) to the Parser function.

The parser function scans the Source, and that internal scan position is
what would become the Position parameter passed to Lexer_Callback.

Then the Lexer_Callback crafts a Token object, sent to the
Render_Callback, which produces a text output of some yet-undefined form
(maybe a String or something more complex and better suited). The text
output goes up the stack frames from Render_Callback to Lexer_Callback
to Parse where it's accumulated, and eventually the whole processed text
is returned (as String in the above declaration, but it might as well be
using a better suited text-holding type).

> Position is in/out, so something should hold at least a corresponding  
> element of the same type Positive, which would be mutable. I could not see  
> such a thing. Same as the above comment, I feel the interface does not  
> show some relations which are supposed to be “guessed”.

Indeed, I've never meant that type list to provide enough information.
The "guessed" part is supposed to come from the description of how it
works (which is immediately before the actual rationale).

> Sorry if my questions seems stupid, as I suppose everything looks very  
> clear in your own mind.

Not really stupid, you just used a part of my post that I didn't mean to
be standalone. I don't feel the design to be mature enough in my head to
actually write a specification file (I have a clear enough sketch in my
head, but I'm missing some Ada type details to reach that point, and I
was sort-of hoping to get help from the group in figuring out these
details). But when that's the case, that won't be a bunch of types like
my post, but a well-commented standalone Ada text.

Thanks for your comment,
Natasha

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Tentative design of markdown parser
  2011-05-28 13:22 Tentative design of markdown parser Natasha Kerensikova
  2011-05-28 14:40 ` Yannick Duchêne (Hibou57)
@ 2011-05-30 13:52 ` Georg Bauhaus
  2011-05-30 18:07   ` Natasha Kerensikova
  1 sibling, 1 reply; 5+ messages in thread
From: Georg Bauhaus @ 2011-05-30 13:52 UTC (permalink / raw)

On 28.05.11 15:22, Natasha Kerensikova wrote:

> This is actually the only point I don't like about the design: there
> should be a way to check at compile time that in a Span_Element the
> Renderer_Callback can actually handle what is produced by the
> Lexer_Callback. However I can't find any.

When I know all the types of objects that Lexer_Callback
can produce, I like to list them in an enumeration type,
type Token_Kind is (A, B, C, ...).
 Then, in every object of

type Renderer_Callback is access procedure (
   Token : Token_Root'Class);

I'd use case coverage to make sure that it will handle
Token: the enumeration value is a component of Token,
or else I call a function on Token that returns its
Token_Kind.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Tentative design of markdown parser
  2011-05-30 13:52 ` Georg Bauhaus
@ 2011-05-30 18:07   ` Natasha Kerensikova
  0 siblings, 0 replies; 5+ messages in thread
From: Natasha Kerensikova @ 2011-05-30 18:07 UTC (permalink / raw)

Hello,

On 2011-05-30, Georg Bauhaus <rm.dash-bauhaus@futureapps.de> wrote:
> On 28.05.11 15:22, Natasha Kerensikova wrote:
>
>> This is actually the only point I don't like about the design: there
>> should be a way to check at compile time that in a Span_Element the
>> Renderer_Callback can actually handle what is produced by the
>> Lexer_Callback. However I can't find any.
>
> When I know all the types of objects that Lexer_Callback
> can produce, I like to list them in an enumeration type,
> type Token_Kind is (A, B, C, ...).

Yes, I do like that too, and then using a variant record to store the
token-specific parameters rather than a tagged type.

However, that means knowing statically all kinds of tokens, which
prevents any extension that adds new tokens. That is how my C library
was designed, but then I found out that not all Markdown extensions fit
in the Markdown-strict set of tokens. One common example is tables. So I
ended up adding tokens for tables, which means some of the features are
inside the parser while some others are completely outside, and I really
don't like that.

So with that experience and the emphasis on extensibility from various
people in my previous thread, I was trying to design a parser that would
be extensible both by adding new tokens and new ways of rendering a set
of tokens. It does seem like a much harder problem to solve. If it turns
out impossible, or unreasonably complex, to solve, then I will fall back
on the fixed set of tokens.

Here I used a tagged type for Token only to have what is effectively a
variant record whose discriminant and variations would be extensible
(using the tag instead of a discriminant). It feels a bit like an abuse
of the tagged type system, but I haven't been able to come up with
anything better.

Thanks for your comment,
Natasha

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-05-30 18:07 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-28 13:22 Tentative design of markdown parser Natasha Kerensikova
2011-05-28 14:40 ` Yannick Duchêne (Hibou57)
2011-05-30  8:13   ` Natasha Kerensikova
2011-05-30 13:52 ` Georg Bauhaus
2011-05-30 18:07   ` Natasha Kerensikova

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox