Doing Text Analysis on the ACH Abstracts
The Conference Abstracts are all encoded as TEI P4 XML, and we have made an effort to provide URL access to individual abstracts as well as the whole collection in both XML and plain text format. We would like to encourage researchers to do text-analysis operations on the collection; it represents a snapshot of the state of humanities computing in 2005, and much could be learned from treating it as a textbase for analysis. You can access the documents as described below:
Accessing XML
- The entire corpus of abstracts is available as a teiCorpus.2 document here:
corpus.xq
- A combined bibliography, including all the citations from all the abstracts, is available as a TEI.2 document here:
biblio_xml.xq
- Individual abstracts can be accessed in XML format directly from the Program and Titles pages.
Accessing plain text
- The entire corpus of abstracts is available as a UTF-8 document here:
corpus_text.xq
- If you are interested only in the prose body of the texts, excluding header, meta or bibliographical information, you can access a stripped-down version here:
corpus_text_body.xq
- A combined bibliography, including all the citations from all the abstracts, is available in text format here:
biblio_text.xq
- Individual abstracts can be accessed in text format directly from the Program and Titles pages.
Demonstration usages
Although not strictly text-analysis operations, these are a couple of examples of transformations and renderings created from the XML feeds:
- Here you can see the combined bibliography rendered into XHTML:
biblio_xhtml.xq
- This page shows "who cited whom". It is created by processing all the names cited as author or editor in all of the bibliographies, then linking them to the abstracts citing them. Be patient! The operation takes up to a minute or two to complete:
biblio_authors_xhtml.xq
- You can use the McMaster University TAPoRware Tools to do lots of different text analysis operations on URL-based resources. For instance, a concordance operation ("List words"), run against the text version of the corpus (body text only: see above), produces the following top ten words, in descending order of frequency: text, information, texts, digital, humanities, project, research, use, work, and data. (This uses the Glasgow stop-words lists, and I have also ignored instances of the letter S alone, and the article la used in non-English texts.)
How the system works
The program page, author page, title list and keyword list, along with the text-analysis feeds described above, are all based on the same underlying set of XML documents. This is how we built the system:
- Each abstract is encoded as a TEI P4 XML document.
- All the abstracts are stored in an eXist XML database.
- The database is running under the Apache Cocoon servlet.
- Cocoon is running under the Apache Jakarta Tomcat servlet engine.
- All access to the XML documents is through XQuery queries run against the eXist database. Various types of queries are run for various purposes; sometimes a complete abstract document might be retrieved, sometimes a fragment, and sometimes a list of fragments from all the abstracts (for example, a list of all the authors' names).
- The queries result in XML documents or fragments. If XML is what is required, then it can be piped directly back to the browser. If another format is required, it is created as follows:
- XHTML is produced through an XSLT transformation, and the XHTML document is linked to CSS and JavaScript files to make an interactive Web page.
- PDF is produced in a two-stage process: first, XSLT is used to transform the XML to XSL:FO; then the XSL:FO is transformed using the RenderX XEP engine to create a PDF document.
- Plain text is produced through an XSLT transformation which strips out all the tags.
- The printed Abstract Book has also been produced using the same system, by building a teiCorpus.2 document incorporating all the abstracts. A complex XSLT transformation generates the table of contents, author index and keyword index, then produces XSL:FO, which is then processed with XEP to create the PDF document for the printer.