The page http://web.uvic.ca/calendar2012/CDs/ENGL/CTs.html declares itself as html 4.01 transitional and is not well-formed XML. I opened it as a plain XML file in Oxygen:
- bunch of self-closing elements which don't have a close tag or "/>",
- bunch of upper-case element and attribute names
- couple of missing li elements to contain nested lists.
So, we'd have to do maybe a dozen regexp string manipulations to take the raw file and turn it into well-formed XML source, but those same regexps could probably be used to create output XML instead.
The page http://web.uvic.ca/calendar2012/CDs/ENGL/101.html also declares itself as html 4.01 transitional and is even further from well-formed XML. We need only a pretty small and simple snippet from the page which, so again we'd probably have to take that in as text. The regexps to turn it into a usable XML source would be more, and more complex, than the regexps to create the desired output XML
The page https://www.uvic.ca/BAN2P/bwckctlg.p_disp_listcrse?term_in=201301&subj_in=ENGL&crse_in=101&schd_in= also declares itself as html 4.01 transitional. It's a long way from XML, too, so again I figure we'll likely be treating this as text rather than an xml structure.
Pattern is pretty clear. Right now, it looks like we should take in the files as text rather then xml, and run regexps to generate the output XML we want. I don't think it's viable to take in the calendar files as XML, and I don't see sufficient value in taking the raw html as text and using regexps to render an XML (or XHTML) version of the input structure and then doing XSLT on that XML to generate the output XML
No Pingbacks for this post yet...
This blog is for work done creating and major updating of departmental and similar sites. Routine text edits etc. are logged in the Depts blog.
|<< <||> >>|