Looking for ways to implement automated topic discovery, I've determined that I'll definitely need to do part-of-speech tagging as part of creating input for any system that purports to do topic discovery, so today I've been downloading and playing with various POS taggers. The best seems to be the Stanford Log-linear POS Tagger. With the help of Matt Jocker's excellent tutorial, I got started on POS tagging some of the ColDesp files.
One note on the tutorial: it's aimed at *nix, and it says that you can process the content of multiple XML elements by separating them with a backslash-escaped space in the command line (p\ ab to do <p> and <ab> tags). On Windows, you actually need to use quotes ("p ab").
I've been working so far on the basis that the meat of the content for any of my files is going to be contained in <p> and <ab> tags, so that's what I'm working with. The POS tagger simply outputs the rest of the file unchanged, and tags up the text content of the tags you specify. At the command line, you specify one input and one output file, but we have thousands to do, so I needed to write a batch file. I have one working now; it's taking between one and ten minutes to process a file, depending on the size and complexity of the file. The approach I've taken is to put the source files inside the tagger's directory structure, and put a batch file inside the folder containing the source files; the output folder is parallel with the source folder, and the jar file is a couple of directories up the tree. This is what the batch file looks like:
for %%i in (*.xml) do java -mx300m -classpath ..\..\stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model ../../models/bidirectional-wsj-0-18.tagger -xmlInput "p ab" -textFile %%i > ../postagged/%%i
meaning "for each XML file in this directory, run the jar file operation with the specified memory and model settings, and output the result using the same filename into the postagged folder".
At an average of 5 minutes per file, it would take us nearly 600 hours to get the whole job done; that's 25 days straight. Matt's tutorial explains how to use XGrid to farm out this kind of job to multiple Macs, but it might be simplest just to set up a single dedicated Linux box and have it churn away for the duration; it would probably go faster if it wasn't doing anything else (as opposed to running lots of foreground apps like my machine is).
My next problem is how to do noun-phrase discovery and analysis based on the POS-tagged text.