Some hard thinking about the file format
Posted by mholmes on 23 Mar 2009 in Activity log
I've been looking again at how the jclCompression libraries work, and it's sent me back to some thinking about how I should be saving this data in the first place. It needs careful planning. There's a range of conditions we need to avoid and allow for which make the simple bundling of files into an archive inadequate. These are my decisions:
- We'll use a plain zip format. Compression/file size are not major issues, and zip is very portable.
- Since the two input files ("left" and "right" texts) might have the same filename (being taken from different locations on a disk, but named identically), we'll need to store some info about them which specifies exactly where they came from. This means we can't zip them with their original names, so we might as well zip them with "
left.xml
" and "right.xml
". - Other files include
similarities.
xml (the current state of the similarities list), a putative combined file (described in a previous post, based on teiCorpus), and a metadata file.
There are two or three initial things I need to do before actually implementing a file save:
- Rework the id-creation code so that it enforces uniqueness across the two files, not just within one file. This is necessary since we'll be creating a single corpus file incorporating the two documents. Perhaps we should think about the possibility of users creating multiple-document archives here, and try to create ids which are not only unique across two documents, but have more chance of being unique even when more files are added. However, the user can themselves enforce this right now by setting the id prefixes in the settings XML tab, so maybe we can just leave that to the user.
- We need to figure out an appropriate metadata file format, which can be used to describe the contents of the archive. Dublin Core in RDF???
- Figure out how to structure the corpus file, and create a routine for building it.
This entry was posted by Martin and filed under Activity log.