Browse sequence and summary research
I've been looking again at the prospects of providing some alternative paths into the collection, or ways of browsing it, other than the search or date-order options. Today I did some more background reading, and came up with a couple of ideas that are worth investigating.
It looks like an approach called scatter-gather would be appropriate. This was new to me. It involves a sequence of:
- Scatter: present the collection as a series of themes or topic areas, from which the reader chooses one or more;
- Gather: collect all the documents from the chosen areas into an ad-hoc subcollection;
- Scatter again: divide the subcollection into another set of themes or topics, from which the reader again selects;
and so on, until you get down to a manageable subset of documents. Typically, the initial scatter on the whole collection is done offline, but subsequent scatters are done on-the-fly, as the reader narrows down their topics.
It seems to me that for our particular collection -- huge, baggy, and with no fundamental organizing principle, where even date order is problematic -- this might be a very effective research or reading strategy. The best description of this I've come across so far is in a 1992 paper, "Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections". A good approach to the Scatter phase is found in "Fuzzy Clustering for Topic Analysis and Summarization of Document Collections".
So now I'm looking for tools to help me do this. I've found lots of papers detailing algorithms for fuzzy topic extraction and grouping, but I can't find any actual working implementation that I can use. I'm not anxious to start implementing anything like this from scratch on my own.
I've written to the TEI list to see if anyone can recommend any implementations or tools. Meanwhile, I'm still reading on the general topic of what seems to be called "Information Foraging Theory". I think one of the principal values in the ColDesp collection, independent of its historical significance, lies in its suitability for just this kind of research into searching, browsing and reading strategies.