ETCL : meet on wikimedia scraping research

Posted by on 23 Nov 2011 in Activity log

Met with CC on four topics:
1) update ETCL website and possibly the CMS behind it
2) scrape user activity on Devonshire MS on wikimedia and present it visually to researchers
3) XSLT consultation on transforming TEI XML to wikimedia markup
4) Backup of computers and management of shared administrative files

On the ETCL website:

the site is currently implemented in a WordPress environment (I don't where that's running). You said you wanted the new site to also be implemented in some kind of Content Management System. The CMS to use would be the one provided by the university (Cascade) unless there's some exceptional reason to use something else. For example, maybe the site includes a bunch of special plugins (especially if written specifically for the site), in which case you'd have to test those plugins against the new version of the WordPress environment.
The Cascade site provides standard templates, but I think it is possible to create your own look and feel (though the central branding people may have something to say about that).

On the Installation and Setup of Wikimedia scraping tools and Social Tracking software:

As you pointed out, this has not yet been clearly specified.

What technology to use and what kinds of information we want returned (e.g. for further processing) and then how that is ultimately reported/visualized/represented to the end user depends entirely on what are the research question(s) or hypthesis(es) that you want to investigate. Those also serve to guide/focus the research and provide indicators of successful completion of the work (i.e. the question has been answered one way or another). You said that the intended audience for the output was academic researchers and not the users generating the data, and that the purpose was to tack user actions (edits) on specified articles and participation in discussions attached to articles, but we didn't come up with a clear answer to the question what features, patterns etc. does the audience for the output wish to see represented? If this is purely exploratory, then the constraining variable for HCMC staff will be our time - once we've used up the number allocated then that's it for us, as otherwise it's an open-ended commitment and that can't be reconciled with the other demands on our time.

dumpBackup.php is a file that can only be used on your own instance of wikimedia (i.e. one on which you have system administrative privileges). If your material is hosted on somebody else's instance of wikimedia, than you are not able to use this file, so I don't think it's an option for the Devonshire Manuscript on Wikimedia.

I took a quick look at wikistream and it appears to be a pretty simple log-like system for all postings to wikipedia. As you said, it provides a who, what, when record. Presumably an instance of it can be configured to look at only certain articles - but I haven't checked. I'm not sure whether it can do any more than temporarily display the posting events, but presumably (again I haven't checked) something could be written that would capture the same information using whatever API wikistream is using and then write it out to a permanent text file. Or something could run that would go back and look for all the modifications made between two specified times and generate a wikistream record for each. If we stick to the wikistream model we're limited to whatever that API provides.
Each wikimedia page has a history tab with extensive details on modifications to the page etc. I haven't looked yet, but I'm guessing there is some kind of API for that which would allow you to query specified pages and return whacks of data (the biggest problem might be structuring that data usefully) on it - obviously they use something internally.

"Magic Circle" is currently a black box - you know what comes out, but nobody seems to know what goes in. I haven't looked into this webapp yet, but we'll certainly need some kind of API or technical docs on what it accepts as input and what kinds of processes it can run on that input.

Whatever code is written to query and process reports from wikimedia has to live somewhere and do so in a way that doesn't put the server at risk or unduly deny services to other processes running on the same server. I.e. there may be security and resources-consumption issues that need to be tested to the sys-admin's satisfaction.

On XSLT Consultation:

I don't see any problems with consulting on XSLT on TEI XML -> wiki markup. That's an obvious thing to come to us with, especially with Martin's expertise.

On Backup:

The computers in the ETCL space each used to have an account with the university's backup service (http://www.uvic.ca/systems/services/storagebackup/computerbackup/index.php). I think the setup was documented by Michael Joyce.
If you've already committed to the Time Machine, you probably know more about how to set it up and run it than we do but ask us any questions you may have.

On Management of Shared Files

You said you wanted to have a single copy of a document that certain specified people could have read access to and that certain specified people could have write access to, but that only one person at a time would have write access. I think an SVN repository would be the solution for you - we've got an SVN service running which would probably be adequate for your needs. Do you imagine most people wanting a local copy of the shared file? As far as I know, the SVN system does not push the modified file to the various users; it's up to the users to go get the current version of the file. Greg would be the one to ask.

This entry was posted by Stewart and filed under Activity log.