comp.lang.ada
 help / color / mirror / Atom feed
* AdaCore xmlada throws XML_Fatal_Error on <script> tag
@ 2017-09-07  1:10 Stephen Leake
  2017-09-07  6:21 ` briot.emmanuel
  2017-09-07 15:12 ` gautier_niouzes
  0 siblings, 2 replies; 17+ messages in thread
From: Stephen Leake @ 2017-09-07  1:10 UTC (permalink / raw)


I'm trying to write code to use AdaCore xmlada to parse a web page (an album listing on Discogs; https://www.discogs.com/Beth-Nielsen-Chapman-You-Hold-The-Key/release/9938848 for example) to extract information.

For now, I'm manually copying the file to my local disk; I can change that to use http access later.

The file has <script> </script> elements that the xmlada Sax reader can't handle; it tries to parse the content of the tag, which of course is _not_ in xml format.

In my brief study of the manual and the code, I don't see any way to add a hook function that would let me specify how to parse that tag; is there such a feature?

Failing that, I guess I'll have to edit the xmlada code to add support for <script>.

Unless there's another/better xml library for Ada out there?

-- Stephe


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: AdaCore xmlada throws XML_Fatal_Error on <script> tag
  2017-09-07  1:10 AdaCore xmlada throws XML_Fatal_Error on <script> tag Stephen Leake
@ 2017-09-07  6:21 ` briot.emmanuel
  2017-09-07 19:56   ` Stephen Leake
  2017-09-07 15:12 ` gautier_niouzes
  1 sibling, 1 reply; 17+ messages in thread
From: briot.emmanuel @ 2017-09-07  6:21 UTC (permalink / raw)


> Unless there's another/better xml library for Ada out there?

You are not passing XML as input, so why would you be looking for an XML parser ?
HTML is not the same as XML at all, there are actually quite a number of differences, like self-closing tags and so on.

If your document is XHTML, then XML/Ada should have no problem accepting that. A random link I found on Google: https://developer.mozilla.org/en-US/docs/Archive/Web/Writing_JavaScript_for_HTML describes exactly the problem you are seeing, and shows how the contents of the <script> tag should be formatted to be valid XHTML (and thus be parsable by an XML parser like XML/Ada)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: AdaCore xmlada throws XML_Fatal_Error on <script> tag
  2017-09-07  1:10 AdaCore xmlada throws XML_Fatal_Error on <script> tag Stephen Leake
  2017-09-07  6:21 ` briot.emmanuel
@ 2017-09-07 15:12 ` gautier_niouzes
  2017-09-07 20:03   ` Stephen Leake
  2017-09-07 20:05   ` gautier_niouzes
  1 sibling, 2 replies; 17+ messages in thread
From: gautier_niouzes @ 2017-09-07 15:12 UTC (permalink / raw)


Stephen Leake:

> Unless there's another/better xml library for Ada out there?

As Emmanuel pointed out, you need an HTML parser, not an XML parser.
There is an HTML parser (work in progress) in the following project:
https://sourceforge.net/projects/wasabee/
In the source tree, there is a simple command-line "browser"
code/HEAD/tree/zrt_dev/target/text/
With some effort it could be able to parse the Web page you mention.

G.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: AdaCore xmlada throws XML_Fatal_Error on <script> tag
  2017-09-07  6:21 ` briot.emmanuel
@ 2017-09-07 19:56   ` Stephen Leake
  2017-09-08 23:31     ` Georg Bauhaus
  0 siblings, 1 reply; 17+ messages in thread
From: Stephen Leake @ 2017-09-07 19:56 UTC (permalink / raw)


On Thursday, September 7, 2017 at 1:21:27 AM UTC-5, briot.e...@gmail.com wrote:
> > Unless there's another/better xml library for Ada out there?
> 
> You are not passing XML as input, so why would you be looking for an XML parser ?
> HTML is not the same as XML at all, there are actually quite a number of differences, like self-closing tags and so on.

Ah. I was under the impression that HTML was defined by an XML schema. 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: AdaCore xmlada throws XML_Fatal_Error on <script> tag
  2017-09-07 15:12 ` gautier_niouzes
@ 2017-09-07 20:03   ` Stephen Leake
  2017-09-07 20:05   ` gautier_niouzes
  1 sibling, 0 replies; 17+ messages in thread
From: Stephen Leake @ 2017-09-07 20:03 UTC (permalink / raw)


On Thursday, September 7, 2017 at 10:12:03 AM UTC-5, gautier...@hotmail.com wrote:
> Stephen Leake:
> 
> > Unless there's another/better xml library for Ada out there?
> 
> There is an HTML parser (work in progress) in the following project:
> https://sourceforge.net/projects/wasabee/
> In the source tree, there is a simple command-line "browser"
> code/HEAD/tree/zrt_dev/target/text/
> With some effort it could be able to parse the Web page you mention.

Thanks, I'll give it a try

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: AdaCore xmlada throws XML_Fatal_Error on <script> tag
  2017-09-07 15:12 ` gautier_niouzes
  2017-09-07 20:03   ` Stephen Leake
@ 2017-09-07 20:05   ` gautier_niouzes
  2017-09-07 20:47     ` Stephen Leake
  1 sibling, 1 reply; 17+ messages in thread
From: gautier_niouzes @ 2017-09-07 20:05 UTC (permalink / raw)


> As Emmanuel pointed out, you need an HTML parser, not an XML parser.

Just developing on this topic: a typical random Web page is a mix of the following
1) HTML (typically, <br> tags)
2) Ill-formed HTML (closing tags that don't close anything, opening tags that are never closed, singleton tags written as closing tags: </br>)
3) XHTML (HTML following XML syntax; e.g. <br />)
At some point, there was a will to impose a clean, well-defined standard (XHTML), but it did not succeed because of a typical phenomenon: browsers need to be compatible with 1) & 2) to be in use; a browser accepting only well-formed XHTML would be ignored by users...


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: AdaCore xmlada throws XML_Fatal_Error on <script> tag
  2017-09-07 20:05   ` gautier_niouzes
@ 2017-09-07 20:47     ` Stephen Leake
  2017-09-07 20:52       ` Stephen Leake
  2017-09-07 21:46       ` gautier_niouzes
  0 siblings, 2 replies; 17+ messages in thread
From: Stephen Leake @ 2017-09-07 20:47 UTC (permalink / raw)


On Thursday, September 7, 2017 at 3:05:08 PM UTC-5, gautier...@hotmail.com wrote:
> > As Emmanuel pointed out, you need an HTML parser, not an XML parser.
> 
> Just developing on this topic: a typical random Web page is a mix of the following
> 1) HTML (typically, <br> tags)
> 2) Ill-formed HTML (closing tags that don't close anything, opening tags that are never closed, singleton tags written as closing tags: </br>)
> 3) XHTML (HTML following XML syntax; e.g. <br />)
> At some point, there was a will to impose a clean, well-defined standard (XHTML), but it did not succeed because of a typical phenomenon: browsers need to be compatible with 1) & 2) to be in use; a browser accepting only well-formed XHTML would be ignored by users...

So I got wasabee_text.adb to compile, and forgetting momentarily that it wants to fetch a URL, not read a local file, I typed "wasabee_text.exe discogs.html", and it said:

...

This document is not an XHTML document

So is this supposed to be an HTML 4/5 parser, or an XHTML parser?

Also, I suggest you add "-gnatwe" to your compiler flags; treat warnings as errors. Most of the warnings I saw where style-based, but it gives a bad impression for a project claiming to have "a focus on user safety".


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: AdaCore xmlada throws XML_Fatal_Error on <script> tag
  2017-09-07 20:47     ` Stephen Leake
@ 2017-09-07 20:52       ` Stephen Leake
  2017-09-07 20:54         ` Stephen Leake
  2017-09-07 21:46       ` gautier_niouzes
  1 sibling, 1 reply; 17+ messages in thread
From: Stephen Leake @ 2017-09-07 20:52 UTC (permalink / raw)


On Thursday, September 7, 2017 at 3:47:56 PM UTC-5, Stephen Leake wrote:
> On Thursday, September 7, 2017 at 3:05:08 PM UTC-5, gautier...@hotmail.com wrote:
> > > As Emmanuel pointed out, you need an HTML parser, not an XML parser.
> > 
> > Just developing on this topic: a typical random Web page is a mix of the following
> > 1) HTML (typically, <br> tags)
> > 2) Ill-formed HTML (closing tags that don't close anything, opening tags that are never closed, singleton tags written as closing tags: </br>)
> > 3) XHTML (HTML following XML syntax; e.g. <br />)
> > At some point, there was a will to impose a clean, well-defined standard (XHTML), but it did not succeed because of a typical phenomenon: browsers need to be compatible with 1) & 2) to be in use; a browser accepting only well-formed XHTML would be ignored by users...
> 
> So I got wasabee_text.adb to compile, and forgetting momentarily that it wants to fetch a URL, not read a local file, I typed "wasabee_text.exe discogs.html", and it said:
> 
> ...
> 
> This document is not an XHTML document
> 
> So is this supposed to be an HTML 4/5 parser, or an XHTML parser?

Digging a little deeper, one top level operation is:

   procedure Load_frame(ho: in out HT_object; from: DOM.Core.Node_List);

This creates an HTML object from an XML DOM tree. And the XML DOM tree is created by AdaCore xmlada. So this is not going to do me any good.

Unless you've modified the XML parser in exactly the way I need - I'll assume that and keep digging.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: AdaCore xmlada throws XML_Fatal_Error on <script> tag
  2017-09-07 20:52       ` Stephen Leake
@ 2017-09-07 20:54         ` Stephen Leake
  0 siblings, 0 replies; 17+ messages in thread
From: Stephen Leake @ 2017-09-07 20:54 UTC (permalink / raw)


On Thursday, September 7, 2017 at 3:52:39 PM UTC-5, Stephen Leake wrote:
> On Thursday, September 7, 2017 at 3:47:56 PM UTC-5, Stephen Leake wrote:
> > On Thursday, September 7, 2017 at 3:05:08 PM UTC-5, gautier...@hotmail.com wrote:
> > > > As Emmanuel pointed out, you need an HTML parser, not an XML parser.
> > > 
> > > Just developing on this topic: a typical random Web page is a mix of the following
> > > 1) HTML (typically, <br> tags)
> > > 2) Ill-formed HTML (closing tags that don't close anything, opening tags that are never closed, singleton tags written as closing tags: </br>)
> > > 3) XHTML (HTML following XML syntax; e.g. <br />)
> > > At some point, there was a will to impose a clean, well-defined standard (XHTML), but it did not succeed because of a typical phenomenon: browsers need to be compatible with 1) & 2) to be in use; a browser accepting only well-formed XHTML would be ignored by users...
> > 
> > So I got wasabee_text.adb to compile, and forgetting momentarily that it wants to fetch a URL, not read a local file, I typed "wasabee_text.exe discogs.html", and it said:
> > 
> > ...
> > 
> > This document is not an XHTML document
> > 
> > So is this supposed to be an HTML 4/5 parser, or an XHTML parser?
> 
> Digging a little deeper, one top level operation is:
> 
>    procedure Load_frame(ho: in out HT_object; from: DOM.Core.Node_List);
> 
> This creates an HTML object from an XML DOM tree. And the XML DOM tree is created by AdaCore xmlada. So this is not going to do me any good.
> 
> Unless you've modified the XML parser in exactly the way I need - I'll assume that and keep digging.

Nope: 

stephe@Takver4$ /Projects/wasabee-code/target/text/wasabee_text.exe file:///Projects/org.stephe_leake.misc/build/discogs.html
protocole : file
host      : -- localhost - unused --
Port      :  0
Ressource : /Projects/org.stephe_leake.misc/build/discogs.html

Execution terminated by unhandled exception
raised SAX.READERS.XML_FATAL_ERROR : /Projects/org.stephe_leake.misc/build/discogs.html:48:5: Name differ for closing tag (expecting link, opened line 47)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: AdaCore xmlada throws XML_Fatal_Error on <script> tag
  2017-09-07 20:47     ` Stephen Leake
  2017-09-07 20:52       ` Stephen Leake
@ 2017-09-07 21:46       ` gautier_niouzes
  2017-09-08  2:12         ` Stephen Leake
  1 sibling, 1 reply; 17+ messages in thread
From: gautier_niouzes @ 2017-09-07 21:46 UTC (permalink / raw)


> So is this supposed to be an HTML 4/5 parser, or an XHTML parser?

There is a trunk and two branches; ignore

  wasabee\trunk
and
  wasabee\fby_dev

and use "my" branch in which I've replaced the XML parser by an own HTML parser:

  wasabee\zrt_dev


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: AdaCore xmlada throws XML_Fatal_Error on <script> tag
  2017-09-07 21:46       ` gautier_niouzes
@ 2017-09-08  2:12         ` Stephen Leake
  2017-09-08  5:38           ` Stephen Leake
  2017-09-08  5:40           ` gautier_niouzes
  0 siblings, 2 replies; 17+ messages in thread
From: Stephen Leake @ 2017-09-08  2:12 UTC (permalink / raw)


On Thursday, September 7, 2017 at 4:46:25 PM UTC-5, gautier...@hotmail.com wrote:
> > So is this supposed to be an HTML 4/5 parser, or an XHTML parser?
> 
> There is a trunk and two branches; ignore
> 
>   wasabee\trunk
> and
>   wasabee\fby_dev
> 
> and use "my" branch in which I've replaced the XML parser by an own HTML parser:
> 
>   wasabee\zrt_dev

Sigh; sourceforge error. I did click on zrt_dev in the svn browser, but the svn command still pulled down trunk. I'll try again.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: AdaCore xmlada throws XML_Fatal_Error on <script> tag
  2017-09-08  2:12         ` Stephen Leake
@ 2017-09-08  5:38           ` Stephen Leake
  2017-09-08 16:55             ` Shark8
  2017-09-09 18:07             ` Stephen Leake
  2017-09-08  5:40           ` gautier_niouzes
  1 sibling, 2 replies; 17+ messages in thread
From: Stephen Leake @ 2017-09-08  5:38 UTC (permalink / raw)


On Thursday, September 7, 2017 at 9:12:59 PM UTC-5, Stephen Leake wrote:
> On Thursday, September 7, 2017 at 4:46:25 PM UTC-5, gautier...@hotmail.com wrote:
> > > So is this supposed to be an HTML 4/5 parser, or an XHTML parser?
> > 
> > There is a trunk and two branches; ignore
> > 
> >   wasabee\trunk
> > and
> >   wasabee\fby_dev
> > 
> > and use "my" branch in which I've replaced the XML parser by an own HTML parser:
> > 
> >   wasabee\zrt_dev
> 
> Sigh; sourceforge error. I did click on zrt_dev in the svn browser, but the svn command still pulled down trunk. I'll try again.

much better; it puts out lots of CSS stuff, then some recognizable text, then a Constraint_error in wasabee_text.adb Text_at. And all the compile-time warnings are gone.

Thanks for this; I was steeling myself to use jsoup, and not looking forward to it.

On the other hand, there are no queries on the HT_object; I'll have to add some. I guess there's no chance of forcing the HTML into an XML DOM? I'll try to model the queries after DOM queries.
-- Stephe


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: AdaCore xmlada throws XML_Fatal_Error on <script> tag
  2017-09-08  2:12         ` Stephen Leake
  2017-09-08  5:38           ` Stephen Leake
@ 2017-09-08  5:40           ` gautier_niouzes
  1 sibling, 0 replies; 17+ messages in thread
From: gautier_niouzes @ 2017-09-08  5:40 UTC (permalink / raw)


> Sigh; sourceforge error. I did click on zrt_dev in the svn browser, but the svn command still pulled down trunk. I'll try again.

Ouch. OK, to make things easier, I've put a Zip file with sources in the "Files" part, named "Wasa_branch_Zrt_Dev_ver_269_date_2017-09-08.zip".

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: AdaCore xmlada throws XML_Fatal_Error on <script> tag
  2017-09-08  5:38           ` Stephen Leake
@ 2017-09-08 16:55             ` Shark8
  2017-09-09 18:07             ` Stephen Leake
  1 sibling, 0 replies; 17+ messages in thread
From: Shark8 @ 2017-09-08 16:55 UTC (permalink / raw)


On Thursday, September 7, 2017 at 11:38:02 PM UTC-6, Stephen Leake wrote:
> 
> On the other hand, there are no queries on the HT_object; I'll have to add some. I guess there's no chance of forcing the HTML into an XML DOM? I'll try to model the queries after DOM queries.

XML and HTML are kinda cousins, perhaps what would be more appropriate is a root "SGML DOM" -- if there was an SGML implementation then XML, XHTML, and HTML (<5) would all be instantiations of the SGML parser.

Having that "root DOM" would probably come in useful for providing a consistent interface/object for scripts, VMs, and plugins to operate on. (Meaning that we could have, say, an AdaScript extension which operates on all of HTML, XHTML, XML, OED, etc.)

Thoughts?


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: AdaCore xmlada throws XML_Fatal_Error on <script> tag
  2017-09-07 19:56   ` Stephen Leake
@ 2017-09-08 23:31     ` Georg Bauhaus
  0 siblings, 0 replies; 17+ messages in thread
From: Georg Bauhaus @ 2017-09-08 23:31 UTC (permalink / raw)


Stephen Leake <stephen_leake@stephe-leake.org> wrote:
> On Thursday, September 7, 2017 at 1:21:27 AM UTC-5, briot.e...@gmail.com wrote:
>>> Unless there's another/better xml library for Ada out there?
>> 
>> You are not passing XML as input, so why would you be looking for an XML parser ?
>> HTML is not the same as XML at all, there are actually quite a number of
>> differences, like self-closing tags and so on.
> 
> Ah. I was under the impression that HTML was defined by an XML schema. 
> 

HTML tidy (the program so named) used to be
An excellent preprocessor. It should produce 
XML output of any HTML page, minus any
Dynamically created tree fragments. Linking
a framework like Firefox's JavaScript VM
Can assist there if necessary.

sgrep is another tool when exploring.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: AdaCore xmlada throws XML_Fatal_Error on <script> tag
  2017-09-08  5:38           ` Stephen Leake
  2017-09-08 16:55             ` Shark8
@ 2017-09-09 18:07             ` Stephen Leake
  2017-09-09 19:35               ` Simon Wright
  1 sibling, 1 reply; 17+ messages in thread
From: Stephen Leake @ 2017-09-09 18:07 UTC (permalink / raw)


On Friday, September 8, 2017 at 12:38:02 AM UTC-5, Stephen Leake wrote:
> On the other hand, there are no queries on the HT_object; I'll have to add some. I guess there's no chance of forcing the HTML into an XML DOM? I'll try to model the queries after DOM queries.
> -- Stephe

I got my data grabber working. The only "query" I implemented is a cursor to walk the tree; first_child, next_sibling. And I made HTML_kind visible.

I had to improve the parsing of <script> tags; it got confused by "<=".

I'll send you a diff, after I clean up the accidental whitespace changes.

-- Stephe


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: AdaCore xmlada throws XML_Fatal_Error on <script> tag
  2017-09-09 18:07             ` Stephen Leake
@ 2017-09-09 19:35               ` Simon Wright
  0 siblings, 0 replies; 17+ messages in thread
From: Simon Wright @ 2017-09-09 19:35 UTC (permalink / raw)


Stephen Leake <stephen_leake@stephe-leake.org> writes:

> I'll send you a diff, after I clean up the accidental whitespace
> changes.

I always have to remember to diff -b (too lazy to say --ignore-space-change!)


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-09-09 19:35 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-07  1:10 AdaCore xmlada throws XML_Fatal_Error on <script> tag Stephen Leake
2017-09-07  6:21 ` briot.emmanuel
2017-09-07 19:56   ` Stephen Leake
2017-09-08 23:31     ` Georg Bauhaus
2017-09-07 15:12 ` gautier_niouzes
2017-09-07 20:03   ` Stephen Leake
2017-09-07 20:05   ` gautier_niouzes
2017-09-07 20:47     ` Stephen Leake
2017-09-07 20:52       ` Stephen Leake
2017-09-07 20:54         ` Stephen Leake
2017-09-07 21:46       ` gautier_niouzes
2017-09-08  2:12         ` Stephen Leake
2017-09-08  5:38           ` Stephen Leake
2017-09-08 16:55             ` Shark8
2017-09-09 18:07             ` Stephen Leake
2017-09-09 19:35               ` Simon Wright
2017-09-08  5:40           ` gautier_niouzes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox