Classifying the Chimera

Federico Meschini

fmeschini@tin.it

Tuscia University


While the Digital Library (DL) concept is an extremely vague one, its paradigm, the real implementation, is, if possible, still more elusive.

Since the late 80s and during the 90s, the primary concern and task in the textual digital resources field was really a basic (but not simple) one: how can we put texts in computers? How can we encode, manage and memorize them? Which stuff digital texts are made of?

Other important problems were perceived (visualization, for example) but they were temporarily put aside for practical reasons. The Text/Data relationship is really a Castor and Polydeuces’ one: is Text a particular kind of Data or Data a particular kind of Text? The main choice, and long-term winning, in the textual encoding issue was the use of the powerful Standard Generalized MarkUp Language (SGML), and – more specifically – the rules (or better guidelines) established by the Text Encoding Initiative (TEI), which could actually be considered the de facto standard for encoding humanistic texts in digital format.

The transition from SGML to the eXstensible MarkUp Language (XML) was, more than an evolution, a sort of Copernican revolution for two main aspects: the introduction of the new stylesheet technology for displaying an XML file, in particular the powerful Extensible Stylesheet Language Transformation (XSLT), and the great diffusion and development of open-source software able to manage XML documents. This transition has of course also taken place in the TEI: starting with the P4 version of the Guidelines, XML is the technology now used. As a logic result different open-source software tools for implementing a Digital Library with texts encoded in TEI/XML are now available.

These tools are divided into two main categories. In the first group there are software created in hard Information Technology contexts (Database Management Systems, Native XML Databases, Publishing Framework, etc.), and these programs need to be adapted to the specific aims of a digital library
       [Note 1: E.g., Apache Cocoon, Apache AxKit, eXist]
     . In the second, and this is a new trend, these tools are being developed in academic contexts
       [Note 2: Anastasia, teiPublisher, XPhilologic]
     , created from the scratch, or using programs from the first group as the basic core; in both cases the final function is the utilization with digital cultural resources and overall TEI encoded texts. Compared to the first group, the logical added value of these tools is that often they provide some specific features for the textual field
       [Note 3: See for example the TAPoRware set for textual analysis  or the Versioning Machine for the comparison of different versions and editions of the same text.]
     .

The choice is now much wider than it was just a few years ago, but it could also cause some confusion in selecting the right tool. As the experts of knowledge management well know, too much information, if not well structured and organized, is equivalent to no information at all, thus being completely useless. Each software has its own peculiarities, which should be evaluated and confronted against the characteristics of the texts being encoded and the general needs and aims of the project of which the digital library is part. What is the main aim of a project? It’s the visualization one, with perhaps a multiple output feature? It’s the research, with some form of advanced textual analysis? It is possible to combine all these aspect? And if somebody has already found a solution for our problems how can we find it over the net?

Trying to find a solution for these problems, or better, trying to share the same problems to have common solutions, during the TEI Members Meeting 2003, there was the first reunion of the TEI Presentation Tools Special Interest Group
       [Note 4: ]
     . The Presentation Tools SIG has two main initiatives: the creation and update of a tool list and of a sample collection of texts for testing.

The first version of the tool list has been presented during the TEI Meeting in Baltimore, last October. This list is actually a digital document encoded with the TEI/XML standard, and in this way it’s currently published, using an XSLT stylesheet, in HTML
       [Note 5: Available on line at .]
     . With a simple structure this list presents the various software in an alphabetical order with a short description and the links to the various implementations. From the descriptions and the links it’s possible to have an idea of the distinctive features of each tool, that for example the software XPhilologic is very good for full-text search and document retrieval
       [Note 6: See for example the demo on the Brown Writer Women Collection at .]
      and that Apache Cocoon could be implemented so to have an XML framework for format scalable output of the same TEI document
       [Note 7: A good implementation of Cocoon with TEI can be found at .]
       And again, with Anastasia is possible to have an electronic text/image edition of a medieval manuscript
       [Note 8: See the Caxtons’ Canterbury Tales at .]
       and eXist is a powerful native XML database that could be used for queries and researches.
       [Note 9: See the Digital Quaker Collection at .]
     

But this is not enough. Perhaps from a list you can obtain some information, but what is really needed (and planned since the beginning) it’s an higher level of classification, and this become more necessary as the number of such software is increasing
       [Note 10: See the presentations on this subject at ALLC/ACH 2004  . Among the others: Kumar, Amit et al., teiPublisher a repository management system for TEI documents; Matthew  Zimmerman, Using AMP technology (Apache, MySQL, PHP) for XML publication;  Stephen Ramsay, Geoffrey Rockwell, Stéfan Sinclair, TAPoRware: Simple Portal Tools for Text Analysis ]
     .  It’s a software written in Java or in Perl? Which are requisites for running it on a computer? It’s XML-aware? It allows XSLT transformation? The texts are stored in the file system or in a database? What are its peculiar features? It could be integrated with other software in order to augment the possibilities? It can be customized? It’s clear that a simple list cannot answer to all these questions.

Many discussions have been made about the kind of classification to apply to the tool list and in my opinion it should be made using practical rather than theoretical principles, with a sort of empirical and pragmatic observation, including also the links to the most possible numbers of the concrete implementations of these tools, so to highlight the best practices and the particular features of each digital library.

A good way of realizing this classification could be the use of the standard ISO 13250
       [Note 11: ]
       or TopicMaps, and the respective XML Topic Map (XTM) syntax
       [Note 12: ]
     . A TopicMap is based on the definition of a general topic, the particular and real occurrences of that topic, and the associations between different topics, thus in my opinion it’s the best way to obtain a complete classification, which will include the various aspects, from the most technical, concerning the programming languages used or the technical specification needed, to the functionalities of visualization, text research and analysis.
       [Note 13: For an introduction to TopicMap see Steve Pepper, The TAO of Topic Maps, finding the way in the age of infoglut.]
     

So what is now a TEI document should be elaborated in a XTM document, detecting, separating, organizing, linking and classifying all the information that now are presented in a linear structure.

Once created, the XTM file representing the TopicMap can be used and navigated in several ways. Being an XML file, it is possible to apply the same technologies used for the TEI texts, but there are also available some dedicated software which can exploit the great potentialities of this standard as, for example, the Omnigator from Ontopia
       [Note 14: ]
     , or the TM4J
       [Note 15: ]
     , a java open-source package expressly developed for creating, manipulating and publishing topic maps.

The TopicMap technology has been presented for the first time related to the TEI during the 2003 meeting, and it’s growing in interest from this community, for its possibility of adding a metadata semantic layer to the digital collections
       [Note 16: John Bradley, A Model for Text Analysis Tools]
     . Moreover, thanks to the possibilities of merging different XTM documents each representing a different map, the Presentation Tools TopicMap could be integrated with other map about other subjects, the textual content for example
       [Note 17: See John A. Walsh, Topic Maps and TEI-Encoded Literary Texts, ]
      or the documentation of the local views of the DTDs
       [Note 18: Stuart Brown,  A Topic Map for the TEI]
     , thus creating the basis for the definition of what could become a TEI Ontology.
         



Bibliography



 Anastasia


teiPublisher


XPhilologic


TAPoRware


Versioning Machine

Kumar, Amit, et al.

teiPublisher  a repository management system for TEI documents
Paper delivered at the ALLC/ACH 2004 Conference, Göteborg
2004

Zimmerman, Matthew
Using AMP technology (Apache, MySQL, PHP) for XML publication
Paper delivered at the ALLC/ACH 2004 Conference, Göteborg
2004

Ramsay, Stephen
Rockwell, Geoffrey
Sinclair, Stéfan
TAPoRware: Simple Portal Tools for Text Analysis
Paper delivered at the ALLC/ACH 2004 Conference, Göteborg
2004

Pepper, Steve
The TAO of Topic Maps, finding the way in the age of infoglut
Paper delivered at the XML Europe 2000 Conference, Paris
2000

Bradley, John
A Model for Text Analysis Tools
Literary and Linguistic Computing
18.2
185-207
2003

Walsh, John A.
Topic Maps and TEI-Encoded Literary Texts
Paper delivered at the Digital Resources for the Humanities Conference, Newcaslte Upon Tyne
2004

Brown, Stuart
A Topic Map for the TEI
TEI Consortium
2003