Formalizing our text categorization

Posted by on 08 Aug 2012 in Activity log

We're currently using a rather messy textual classification method based on the use of <textClass> and <classCode> pointing at a non-existent scheme, and what's more, our classification codes seem to overlap a bit, and fall into two distinct classes. I think it's time to revisit this aspect of our encoding, and put it on a sound formal basis. To that end, I have:

Created a new file in /mariage/ called global_metadata.xml, in which we can centralize a variety of metadata and link to it (this should include thinks such as availability/licensing, eventually).
Modified the ODD file and generated a new schema to allow for the creation of taxonomies. In the process, I also fixed the oddity whereby <revisionDesc>/@status was only able to be set to "proofing". We now have a set of document status values which I think will be more useful.
Created an initial taxonomy of textual types which matches what we currently have.
Summarized the issue for CC and asked for guidance on how to continue.

I think we need two separate taxonomies, one for text types and one for content types (e.g. prose vs religion). Then we can add any number of <textClass> elements to any given document, pointing at the specific scheme and code, and use these to filter documents in specialist TOCs and in the search interface.

We should also presumably look for any existing applicable taxonomies that we could adopt.

This arises out of my preparation of the documents for submission to the TAPAS project, which required some standardization of data in the headers. I also removed the pointless "An Electronic Edition" subtitle from all our documents, and tweaked a couple of other things.

This entry was posted by Martin and filed under Activity log.