etcl : write code to scrape DevMS wikibook and generate circleMagic output
Posted by sarneil on 26 Apr 2012 in Activity log
Wrote a scraper php page which when it is opened in a browser:
- queries the number of records in the Devonshire Manuscript wikibooks project to be processed and displays that number
- scrapes each of the records in the Devonshire Manuscript project on wikibooks
- generates an XML file constructed to work with the circleMagic player for that record
- generates an htm file that includes an instance of a call to the player with the appropriate XML data file
The XML is idiosyncratic and based on examples provided with the circleMagic code.
CircleMagic can't handle an XML data file with more than 7 "source" elements (which in this implementation are used to identify contributors for that page). I included in the php code which comments out all source elements after 7 in any xml file, and displays a warning to the user as well as on the generated html page that displays the circleMagic player.
CircleMagic's processing from the XML structure to the circular GUI is also idiosyncractic, but I've posted on that previously.
Other potential constraints eventually imposed by the wikibooks API :
- returns a maximum of 500 hits to the query asking for all the pages in the DevMS collection
- returns a maximum of 500 hits to the query asking for the number of revisions to a page. The most revisions on any page so far is about 200, so it will be a while before that limit is reached.