Working on text-analysis
Posted by mholmes on 01 May 2008 in Activity log
In order to run some of the text-analysis operations we've been planning on the database, we need to be able to grab text-only output from the documents, and be able to specify various types of filter on that output. I've been working on that today. These are the main points so far:
- I've written an XQuery file called text.xq which does the extraction, based on input parameters (see below), and returns plain text. The file has this custom eXist option declared in it:
declare option exist:serialize "method=text media-type=text/text encoding=utf-8";
That means I can avoid any transformations at all, which is handy. - I've set up a page here which presents a simple form from which you can choose the options you want.
- These are the parameters:
- startYear
- endYear
- textProse
- textVerse
- imageProse
- imageVerse
As usual, XQuery hacking took a long time, but it appears to be working well now, and it's faster than I expected. The user will now be able to get the text in the browser, and feed it as-is into an application, or supply the URL of the filtered text query to an online service such as the TAPoR Tools.
This entry was posted by Martin and filed under Activity log.