Working on text-analysis
Posted by mholmes on 01 May 2008 in Activity log
In order to run some of the text-analysis operations we've been planning on the database, we need to be able to grab text-only output from the documents, and be able to specify various types of filter on that output. I've been working on that today. These are the main points so far:
- I've written an XQuery file called text.xq which does the extraction, based on input parameters (see below), and returns plain text. The file has this custom eXist option declared in it:
declare option exist:serialize "method=text media-type=text/text encoding=utf-8";
That means I can avoid any transformations at all, which is handy. - I've set up a page here which presents a simple form from which you can choose the options you want.
- These are the parameters:
- startYear
- endYear
- textProse
- textVerse
- imageProse
- imageVerse
As usual, XQuery hacking took a long time, but it appears to be working well now, and it's faster than I expected. The user will now be able to get the text in the browser, and feed it as-is into an application, or supply the URL of the filtered text query to an online service such as the TAPoR Tools.