Lots of rethinking being done as I try to cast the text of the presentation into graphical format; I've ended up regenerating a bunch of the graphs to include two types of data on the same graph, for clarity and efficiency.
Also, while formatting the presentation, we learned that to make centring of an XHTML block element work reliably through CSS, you not only have to use
margin-left: auto; margin-right: auto; you often also have to include
left: 0; right: 0; (or some fixed values). This is necessary when the element is
Printed out various maps for finding the accommodation and the venues for the DH 2010 conference next week.
With GN's help, got the presentation code set up with our HCMC theme, using pure SVG for the logo, and coded the first three sections of the presentation (which are fairly straightforward). A Firefox 4 beta came out today, so we downloaded and tested that, and discovered that its JS engine is now on a par with Chrome, so we'll be able to use Firefox even for large presentations using this codebase in the future; version 3.6 was really unable to cope with the Coldesp presentation, so this is a good thing. It's especially so because it turns out that Chrome's SVG support leaves a lot to be desired; it seems to fail to handle the
fill-opacity setting in text and tspan elements.
I've finished the first draft of the text of my talk, and in the process created all of the graphics I need to use. I might turn this into a word-processor document at some stage (it's plain text right now), and make a PDF of it available, but first I'll build the presentation from it. It's probably too long, but there will be parts I can skip if time is short.
I'm writing the actual prose text of my talk for this presentation, because I want to see how long it turns out to be (doubtless too long), and also to let the argument drive the presentation rather than the presentation materials driving the argument. I've done about two thirds of it today, and also begun converting some of the graphs into graphics I can use in the presentation. I may end up dropping a lot of content relating to the use of containment metrics in SC, because the Jaccard Distance calculation is more appropriate and relevant. That'll simplify things a bit.
I have a complete set of the tests based on the Lear data, and I've rewritten the XSLT to render it more usefully into a set of tables (over 100 pages of them at this point). I haven't been able to look too closely at it yet, but it seems to me that normalizing the data before running the USM comparison results in slightly better results (meaning a larger number of what look like intuitively useful matches before the weird ones start), but USM still doesn't achieve results as good as ShingleCloud. That said, USM is still faster, and still gets reasonably good results without normalization.
I've now expanded my Lear test script so that it includes all three types of ShingleCloud metric (containment averaged, Jaccard with word tokenization, and Jaccard with character tokenization), NCD, and two types of output from my USM library (text normalized and not normalized). That's running now. So far, it looks as though the normalization in USM adds very little to the execution time, so it still looks faster than SC; it remains to be seen whether it finds more candidate matches or not.
I've added case normalization and punctuation stripping to my USM library, on the basis that ShingleCloud basically does both of these things, so the comparison between them is not really level unless this is done. The results will most likely be to even up the runtimes, because USM will run more slowly, but presumably it will improve the "success" of USM when compared to SC, which is currently doing better (based on my own human judgement of successful matches). I'll now have to re-run all the tests again with this new measure, but I think what I'll do is to merge all of the different variants for the Lear texts into a single operation, which will take a long time to run and process, but will perhaps give more useful output when it's done.
I've also done a fair amount of examination of the existing Lear results, and it does seem that without normalization, even SC needle/haystack calculations result in a longer run of intuitively "good" matches before spurious ones begin to appear. USM/NCD manage about 60 matches before we start to see oddities, but SC gets into the 70s.
Found a good article on the limitations of NCD which result from the block window size of the various compressors one might use to do the NCD calculation; this points out that if the length of the concatenation of the input strings exceeds the block window size, then NCD is basically useless. The limit for bzip2 (in default configuration) is 900KB, and for GZip, which I'm using, it's 32KB. This underlines my conclusions that USM deployed on text is best suited for short strings, which is the precise context in which other metrics become less powerful because of their tokenization strategies.
Figured out the problem with the large-scale Lear test script using Jaccard, and I'm now running that test (it takes a while). Following this, I'll create new versions of all of the scripts using character tokenization and Jaccard, and see how those go. That should give me a good all-round view of NCD vs SC. NCD (Complearn) uses bzlib by default, and I'm using GZip, so this may account for the differences between them.
Started early, worked late to finish XSLT transformation for tomorrow.
Started at 8.30 this morning.
The oddity I saw yesterday was a bug, and AM kindly fixed it overnight, so I've run most of the tests again using the new version, and got some interesting results. So far, I've run the three shorter tests successfully with word tokenization and ngram=1; SC now seems to be tracking USM more closely, but is still less granular on shorter strings. I need to do this again with character tokenization. The last, longer test with 5200 comparisons didn't work properly -- SC gave me 1 for every result -- so I think the script needs fixing, then running again.
AM created a new version of the SC library which has the capability of doing Jaccard Coefficients, so I've been writing new versions of my scripts to see how those numbers stack up against mine. Right now I'm still not sure how to convert them to a scale comparable with NCD; I should be subtracting them from 1 to get Jaccard Distance, which is analogous to NCD, but some of the results are greater than 1, so result in negative numbers.
There most of the day. Long trek across campus and back, twice, takes time too, but it's probably good for me in some way.
I'm helping with an unusually large DHSI course on text-encoding all this week.
There's still lots of work to do to get the presentation itself written, along with my speaking notes, but I think I'm pretty clear on what I want to say now.
MB sent me some candidate pieces of text for comparison, and for my final array of data, I decided to use a piece from the last act of Lear. I've prepared the lines (65 in the quarto, and 80 in the folio), and written what turned out to be quite a complicated bash script to do the following:
The complexity largely arose from having to learn aspects of bash scripting that I didn't know, including the oddities of using arrays, and variable scoping. My intention is to sort the results according to descending order of USM similarity, and SC similarity, and compare the order, to see if they're both doing the job in a comparable way, and to see where the distinctions lie. I also plan to build timing into my bash script so I can see which is quicker, and by how much.
This is what I've managed to get through today:
MB has just sent me some more interesting sample data from Shakespeare, which I'll also prepare, and AM has sent the chapter of his thesis dealing with ShingleCloud, which I'll need to read carefully tomorrow. He may also be able to add the jaccard measure to ShingleCloud, which would probably give a better test of similarity than the crude averaging of needle and haystack containment that I'm currently using.
I'm also going to prepare some data sets using XML markup. I'm expecting to see that the USM performs a bit more effectively on this data than the pure textual data, but that remains to be seen.
Today I've written the bash script I'm going to be using to create arrays of data points to compare ShingleCloud with USM. This is what it does right now:
Learned a few things I didn't know out of this process. For instance, bash can't do floating-point math (call awk instead); a FODS file is much easier to create than a full-scale Open Document file (it's just a single flat XML file); and this is how you call an external application from a bash script and return immediately, without getting your shell session hijacked by errors from the external application:
`soffice -calc -nologo $FODSFILE 2>/dev/null` &
This blog is for academic tasks undertaken by HCMC staff, such as writing articles for publication, reviewing abstracts for conferences, preparing and teaching classes, or proofing other people's documents.
|<< <||Current||> >>|