Marked up to be included in the ACH/ALLC 2005 Conference Abstracts book.
None
Electronic archival finding aids encoded in Encoded Archival Description (EAD) are transported across networks and rendered into HTML for display on the browser. Considering the time, effort and money involved in marking up the finding aids, has the markup been used for retrieval purposes? Has the multilevel hierarchical nature of finding aids been used for searching? A few online EAD tag based retrieval systems that process queries look for occurrences of the search term in the corresponding EAD tag, but do not seem to address subject- or topic-based queries. This study explores the possibility of using the content of specific EAD tags for subject retrieval purposes. We studied the consistencies, commonalities and discrepancies in usages of various critical tags across repositories participating in the
This study was conducted on 1226 EAD encoded finding aids from nine archiving institutions which are part of
In the second part, we studied if these finding aids have been encoded according to standard archival descriptive practice (i.e. if the text within these EAD tags was appropriate). This was achieved through text processing involving extraction of the text from the specific tags and processing these to arrive at a vocabulary. We conducted this study on the part of the finding aids corresponding to the
In the third step, using the vocabularies obtained, we represented these finding aids as vectors in the vocabulary space. In such a vector representation of finding aids, we compared finding aids using a cosine similarity in conjunction with Term Frequency-Inverse Document Frequency (TF-IDF) weighting. The TF-IDF scheme weights rarely used words higher than commonly used words, and also accounts for the size of the document. We then clustered these finding aids with an online clustering tool (wCLUTO) using the agglomerative clustering algorithm. The agglomerative clustering groups finding aids based on the similarity of content, resulting in a tree of documents. The lowest levels of the tree correspond to individual finding aids and the highest levels of the tree correspond to the entire sets of finding aids. Our study focused on low-level clusters, which are of particular interest to archivists, as these clusters address the descriptive material embedded in the various EAD tags. To determine the similarities between finding aids, we extracted vocabularies for individual tags like <abstract>, <scopecontent> and <bioghist> and clustered the finding aids based on the similarity of textual content with respect to these individual tags. Further, we combined the similarity relations between finding aids, based on these individual tags, to build a space that encompasses the content similarity for a combination of tags. Our clustering results on individual and combination of tags are in agreement with the classification provided by the curators of the
We conclude from our study that if finding aids are marked up according to standard archival descriptive practice then they yield meaningful content-based clusters of similar finding aids. Further, we were able to demonstrate; i) the ability of forming 'neighborhoods' of similar finding aids using either individual tags or a combination of tags, and, ii) that the 'neighborhoods' were different for different tags or combination of tags. From this idea of a 'neighborhood' of finding aids, we propose a searchable interface for a repository of finding aids by means of the EAD tags. This search facility, we think, enhances the prospect of creating a web of similar and, thus, interconnected finding aids, which, in turn would facilitate research in the field of archives and help researchers form cliques by common research interests and goals.
Our study demonstrates the ability to apply the text processing techniques from the field of information retrieval to the field of archives with a goal of enabling EAD encoded finding aids transition to the digital world and be visible in the realm of online documents and be accessible to researchers.