From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-0.9 required=5.0 tests=BAYES_00,FORGED_GMAIL_RCVD, FREEMAIL_FROM autolearn=no autolearn_force=no version=3.4.4 X-Google-Thread: 103376,4a07e5fd38e94bef,start X-Google-NewGroupId: yes X-Google-Attributes: gida07f3367d7,domainid0,public,usenet X-Google-Language: ENGLISH,ASCII-7-bit Path: g2news1.google.com!news4.google.com!feeder.news-service.com!85.214.198.2.MISMATCH!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail From: Natasha Kerensikova Newsgroups: comp.lang.ada Subject: Tentative design of markdown parser Date: Sat, 28 May 2011 13:22:17 +0000 (UTC) Organization: A noiseless patient Spider Message-ID: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Injection-Date: Sat, 28 May 2011 13:22:17 +0000 (UTC) Injection-Info: mx04.eternal-september.org; posting-host="Mda950WjNwNLAFOE7yJXQw"; logging-data="2684"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX189og/tvp6xJ97wirJIpee9" User-Agent: slrn/0.9.9p1 (FreeBSD) Cancel-Lock: sha1:Ho57xM7qaqSRMAGLzlGev1UBZtQ= Xref: g2news1.google.com comp.lang.ada:19534 Date: 2011-05-28T13:22:17+00:00 List-Id: Hello, first I would like to thank everybody who has contributed to my earlier "Parser interface design" thread, I learned a lot through both what was written there and the research I did triggered by what was written there. I had no idea about double dispatch, visitor pattern, extensions (I used to code as flexibly as possible, but with only "extension through composition" in mind, like UNIX text utils), etc. Based on all this and a lot of thought, I came up with a design draft, and would love got know what you think about it. Lets consider the following types: type Token_Root is abstract tagged null record; type Renderer_Callback is access procedure ( Token : Token_Root'Class); type Lexer_Callback is access procedure ( Source : String; Position : in out Positive; Renderer: Render_Callback); type Element_Priority is delta 2.0**(-15) range 0.0 .. 1.0; type Span_Element is record Active_Char : Character; Priority : Element_Priority; Lexer : Lexer_Callback; Renderer : Renderer_Callback); type Span_Element_List is array (<>) of Span_Element; Note: I'm not sure whether I'm using correctly terms like "lexer" or "parser", and Markdown (and probably other lightweight markup languages as well) doesn't require such fine concepts, which can therefore be more burden than help. It's not a full-fledged programming language, only a text-based replacement method that was first implemented with perl substitutions... The general idea is to have a parser procedure, which gets as input a String containing the markdown input text, and a Span_Element_List that contains all the callbacks (in a real version there would also be a Block_Element_List, very similar to span-level elements except there is no Active_Char, and probably other stuff as well -- this is still a draft). The parser procedure scans the input for active characters. A consecutive sequence of inactive characters is considered to be normal text. Upon encountering an active character, it will go through Span_Element items with the relevant Active_Char, in decreasing priority order. Span_Element.Lexer is called, with the complete markdown string as first parameter (in order to be unlimited for context checking), along with the position of the relevant active character and the Renderer taken from the Span_Element. If the context is good and there is indeed the element recognized by the particular Lexer, it will create a Token (extended from Token_Root), call Renderer with it, and update Position before returning. If there is actually no such token, it returns leaving Position unchanged, which makes the parser try the next Span_Element. Here are the design choices and rationales: It's a parser-controlled design, chosen because the recursive nature of Markdown means there will be a context stack at some point, and it seems easier and more reliable to use procedure-local variables and the call stack rather than implementing my own solution. I haven't found any satisfactory solution to the double dispatch problem, so I gave up completely and worked around it by providing the client only bricks and letting it assemble the dispatch table itself. This is further motivated by the fact that even though double dispatch can be grafted on top of Ada OO, there is still the issue of the single package hierarchy. For example, considering an extension to render into PDF files, and another unrelated extension to support definition lists in the input language, where would the code to write definition lists into PDF would live? I decided to leave it entirely to the client. I don't remember where, but I remember having seen several other examples where Ada clients are provided with brick to assemble rather than ready-to-go solution. The Active_Char idea comes from the fact that most input characters are normal text, and calling a callback for each of them would be very inefficient. On the other hand, there is no equivalent for block-level syntax, but blocks are much less numerous than characters. The Priority is there to allow several callbacks for a single active character. For example in the official Markdown syntax, *foo* is turned into foo while **foo** is turned into foo. That would translate to both EM and STRONG having '*' as active character, but STRONG has a higher priority, to avoid *foo*. Priority is chosen as a fixed point number between 0 and 1 to make it easier to insert a new Span_Element between two existing ones, while keeping the advantages of integers. However I'm not sure the particular fixed-point definition I chose is that good. Lexer_Callback has an in out Position parameter to inform the parser of the amount of input text consumed by this token. Moreover, leaving Position untouched allows the lexer to have stricter definition for what represents a token than just anything starting with the active character. For example in Markdown, an opening '*' for emphasis must be followed by a non-blank character, to prevent e.g. multiplication from being considered as an open-emphasis marker. The Token_Root'Class object created by Lexer_Callback contains all the information extracted from the input. For example: Token_Emphasis is new Token_Root with record Text : String; end record; Token_Header is new Token_Root with record Level : Positive; Text : String; end record; Token_Ruler is new Token_Root with null record; Renderer_Callback would then cast the Token it got into the Token_Root extension it is able to render. This allow to ensure (through a runtime check) that the Renderer_Callback actually matches the Lexer_Callback it has been paired with (by the client). This is actually the only point I don't like about the design: there should be a way to check at compile time that in a Span_Element the Renderer_Callback can actually handle what is produced by the Lexer_Callback. However I can't find any. Moreover, the output method is still completely absent from the design. Going for function that return Strings? Or using an in out Unbounded_String parameter? It might be premature optimization, but this sounds quite inefficient in term of memory allocation, and misses out the information that the output text is usually in the same order of magnitude as the input. Maybe I should go for yet another kind of elements that can be plugged in, to allow output as String, or into a preallocated dynamic buffer, or into a Stream, or any other output possibility that hasn't come to my mind yet. I don't manage to estimate the consequences in terms for code complexity, maintainability, readability, flexibility and maybe performance of such a new (probably tagged) object. Thanks in advance for your help, comments and remarks, Natasha