Archives for: June 2010

30/06/10

Permalink 04:17:55 pm, by mholmes, 92 words, 85 views   English (CA)
Categories: Activity log; Mins. worked: 240

Almost finished the presentation...

Lots of rethinking being done as I try to cast the text of the presentation into graphical format; I've ended up regenerating a bunch of the graphs to include two types of data on the same graph, for clarity and efficiency.

Also, while formatting the presentation, we learned that to make centring of an XHTML block element work reliably through CSS, you not only have to use margin-left: auto; margin-right: auto; you often also have to include left: 0; right: 0; (or some fixed values). This is necessary when the element is position: absolute.

Permalink 11:13:44 am, by mholmes, 17 words, 71 views   English (CA)
Categories: Activity log; Mins. worked: 15

Travel prep for London

Printed out various maps for finding the accommodation and the venues for the DH 2010 conference next week.

29/06/10

Permalink 02:35:59 pm, by mholmes, 121 words, 156 views   English (CA)
Categories: Activity log; Mins. worked: 240

Presentation code and theme updates

With GN's help, got the presentation code set up with our HCMC theme, using pure SVG for the logo, and coded the first three sections of the presentation (which are fairly straightforward). A Firefox 4 beta came out today, so we downloaded and tested that, and discovered that its JS engine is now on a par with Chrome, so we'll be able to use Firefox even for large presentations using this codebase in the future; version 3.6 was really unable to cope with the Coldesp presentation, so this is a good thing. It's especially so because it turns out that Chrome's SVG support leaves a lot to be desired; it seems to fail to handle the fill-opacity setting in text and tspan elements.

28/06/10

Permalink 11:40:54 am, by mholmes, 71 words, 81 views   English (CA)
Categories: Activity log; Mins. worked: 240

Finished writing text of talk

I've finished the first draft of the text of my talk, and in the process created all of the graphics I need to use. I might turn this into a word-processor document at some stage (it's plain text right now), and make a PDF of it available, but first I'll build the presentation from it. It's probably too long, but there will be parts I can skip if time is short.

17/06/10

Permalink 03:28:19 pm, by mholmes, 101 words, 87 views   English (CA)
Categories: Activity log; Mins. worked: 240

Writing the text of my talk

I'm writing the actual prose text of my talk for this presentation, because I want to see how long it turns out to be (doubtless too long), and also to let the argument drive the presentation rather than the presentation materials driving the argument. I've done about two thirds of it today, and also begun converting some of the graphs into graphics I can use in the presentation. I may end up dropping a lot of content relating to the use of containment metrics in SC, because the Jaccard Distance calculation is more appropriate and relevant. That'll simplify things a bit.

15/06/10

Permalink 03:38:09 pm, by mholmes, 105 words, 82 views   English (CA)
Categories: Activity log; Mins. worked: 120

Lear data complete

I have a complete set of the tests based on the Lear data, and I've rewritten the XSLT to render it more usefully into a set of tables (over 100 pages of them at this point). I haven't been able to look too closely at it yet, but it seems to me that normalizing the data before running the USM comparison results in slightly better results (meaning a larger number of what look like intuitively useful matches before the weird ones start), but USM still doesn't achieve results as good as ShingleCloud. That said, USM is still faster, and still gets reasonably good results without normalization.

Permalink 11:22:49 am, by mholmes, 83 words, 66 views   English (CA)
Categories: Activity log; Mins. worked: 60

Expanded Lear test script to include more measures

I've now expanded my Lear test script so that it includes all three types of ShingleCloud metric (containment averaged, Jaccard with word tokenization, and Jaccard with character tokenization), NCD, and two types of output from my USM library (text normalized and not normalized). That's running now. So far, it looks as though the normalization in USM adds very little to the execution time, so it still looks faster than SC; it remains to be seen whether it finds more candidate matches or not.

14/06/10

Permalink 03:42:32 pm, by mholmes, 191 words, 152 views   English (CA)
Categories: Activity log; Mins. worked: 120

Adding case normalization and punctuation stripping to my USM library

I've added case normalization and punctuation stripping to my USM library, on the basis that ShingleCloud basically does both of these things, so the comparison between them is not really level unless this is done. The results will most likely be to even up the runtimes, because USM will run more slowly, but presumably it will improve the "success" of USM when compared to SC, which is currently doing better (based on my own human judgement of successful matches). I'll now have to re-run all the tests again with this new measure, but I think what I'll do is to merge all of the different variants for the Lear texts into a single operation, which will take a long time to run and process, but will perhaps give more useful output when it's done.

I've also done a fair amount of examination of the existing Lear results, and it does seem that without normalization, even SC needle/haystack calculations result in a longer run of intuitively "good" matches before spurious ones begin to appear. USM/NCD manage about 60 matches before we start to see oddities, but SC gets into the 70s.

Permalink 11:16:57 am, by mholmes, 175 words, 77 views   English (CA)
Categories: Activity log; Mins. worked: 30

SC versus NCD/USM, more reading + testing again

Found a good article on the limitations of NCD which result from the block window size of the various compressors one might use to do the NCD calculation; this points out that if the length of the concatenation of the input strings exceeds the block window size, then NCD is basically useless. The limit for bzip2 (in default configuration) is 900KB, and for GZip, which I'm using, it's 32KB. This underlines my conclusions that USM deployed on text is best suited for short strings, which is the precise context in which other metrics become less powerful because of their tokenization strategies.

Figured out the problem with the large-scale Lear test script using Jaccard, and I'm now running that test (it takes a while). Following this, I'll create new versions of all of the scripts using character tokenization and Jaccard, and see how those go. That should give me a good all-round view of NCD vs SC. NCD (Complearn) uses bzlib by default, and I'm using GZip, so this may account for the differences between them.

10/06/10

Permalink 06:11:02 pm, by mholmes, 10 words, 50 views   English (CA)
Categories: Activity log; Mins. worked: 360

DHSI course

Started early, worked late to finish XSLT transformation for tomorrow.

09/06/10

Permalink 04:46:59 pm, by mholmes, 4 words, 48 views   English (CA)
Categories: Activity log; Mins. worked: 390

DHSI course

Started at 8.30 this morning.

Permalink 08:11:50 am, by mholmes, 98 words, 72 views   English (CA)
Categories: Activity log; Mins. worked: 45

New version of ShingleCloud -- processing tests again

The oddity I saw yesterday was a bug, and AM kindly fixed it overnight, so I've run most of the tests again using the new version, and got some interesting results. So far, I've run the three shorter tests successfully with word tokenization and ngram=1; SC now seems to be tracking USM more closely, but is still less granular on shorter strings. I need to do this again with character tokenization. The last, longer test with 5200 comparisons didn't work properly -- SC gave me 1 for every result -- so I think the script needs fixing, then running again.

08/06/10

Permalink 05:43:36 pm, by mholmes, 79 words, 50 views   English (CA)
Categories: Activity log; Mins. worked: 60

DH: looking at Jaccard Coefficient from SC

AM created a new version of the SC library which has the capability of doing Jaccard Coefficients, so I've been writing new versions of my scripts to see how those numbers stack up against mine. Right now I'm still not sure how to convert them to a scale comparable with NCD; I should be subtracting them from 1 to get Jaccard Distance, which is analogous to NCD, but some of the results are greater than 1, so result in negative numbers.

Permalink 05:40:25 pm, by mholmes, 24 words, 42 views   English (CA)
Categories: Activity log; Mins. worked: 330

DHSI course

There most of the day. Long trek across campus and back, twice, takes time too, but it's probably good for me in some way.

07/06/10

Permalink 06:03:28 pm, by mholmes, 13 words, 35 views   English (CA)
Categories: Activity log; Mins. worked: 330

Helping to teach DHSI course

I'm helping with an unusually large DHSI course on text-encoding all this week.

04/06/10

Permalink 02:22:28 pm, by mholmes, 93 words, 61 views   English (CA)
Categories: Activity log; Mins. worked: 240

Final day working on data for DH

  • Tweaked all my scripts and re-ran all of them.
  • Added scores from the command-line compiled NCD utility from CompLearn in addition to my USM.
  • Wrote XSLT to convert Lear result set into useful tables.
  • Rebuilt all the graphs in my ODS files.
  • Found what seems like a bug in SC, and wrote to SM with details.
  • Wrote the outline to my presentation

There's still lots of work to do to get the presentation itself written, along with my speaking notes, but I think I'm pretty clear on what I want to say now.

03/06/10

Permalink 02:48:53 pm, by mholmes, 212 words, 50 views   English (CA)
Categories: Activity log; Mins. worked: 240

More work on similarity

MB sent me some candidate pieces of text for comparison, and for my final array of data, I decided to use a piece from the last act of Lear. I've prepared the lines (65 in the quarto, and 80 in the folio), and written what turned out to be quite a complicated bash script to do the following:

  • Starts writing the output as an XML file.
  • Reads the lines of each source into a pair of arrays.
  • Writes out the lines with suitable numbering as XML, so that they're available in the output file (which I intend to process using XSLT).
  • Compares each line in each array to every line in the other array (5200 comparisons), using both USM and ShingleCloud.
  • Writes out the results to the XML file.

The complexity largely arose from having to learn aspects of bash scripting that I didn't know, including the oddities of using arrays, and variable scoping. My intention is to sort the results according to descending order of USM similarity, and SC similarity, and compare the order, to see if they're both doing the job in a comparable way, and to see where the distinctions lie. I also plan to build timing into my bash script so I can see which is quicker, and by how much.

02/06/10

Permalink 03:06:50 pm, by mholmes, 220 words, 36 views   English (CA)
Categories: Activity log; Mins. worked: 300

More progress again in preparing for DH

This is what I've managed to get through today:

  • Prepared an input data set consisting of "long" passages (speeches from Hamlet, Quarto 1 vs Folio 1, running from a few lines to a few dozen), and created a batch script that processes these, which are read from separate files as opposed to a single input file.
  • Created a data set from lines in two versions of one speech (Quarto 2 vs Folio 1), and a script to process that data set.
  • Created a third data set which consists of a variety of specific mutations and combinations of mutations of one input sentence, to demonstrate the situations in which the two similarity metrics throw up similar and different results.

MB has just sent me some more interesting sample data from Shakespeare, which I'll also prepare, and AM has sent the chapter of his thesis dealing with ShingleCloud, which I'll need to read carefully tomorrow. He may also be able to add the jaccard measure to ShingleCloud, which would probably give a better test of similarity than the crude averaging of needle and haystack containment that I'm currently using.

I'm also going to prepare some data sets using XML markup. I'm expecting to see that the USM performs a bit more effectively on this data than the pure textual data, but that remains to be seen.

01/06/10

Permalink 02:59:36 pm, by mholmes, 311 words, 48 views   English (CA)
Categories: Activity log; Mins. worked: 300

More progress in preparing for DH

Today I've written the bash script I'm going to be using to create arrays of data points to compare ShingleCloud with USM. This is what it does right now:

  • Reads some template files from disk, to assist in building two kinds of output: an XHTML page with tabular data, and a spreadsheet in FODS (Flat Open Document Spreadsheet) format, for use in OOo Calc.
  • Reads the test data in from a text file. Right now, the file contains all the test data, with one pair of text items on each line, separated by a slash. This may be quite sufficient, since I don't anticipate running tests with items larger than paragraphs, but it would also be feasible to read in a list of filenames from which to retrieve larger bits of data.
  • Calls USM on each bit of the test data and retrieves the scores.
  • Calls ShingleCloud twice, once with --containmentneedle and once with --containmenthaystack, and then averages their scores. With ShingleCloud, these two scores can be radically different, so averaging them is one crude approach to creating a single similarity measure. I've written to AM to ask if he uses anything more sophisticated in TEIComparator.
  • Creates an XHTML document showing the input data and the results in a table.
  • Creates a FODS document with the data in a table.
  • Opens the XHTML file in Firefox, and the FODS file in OOo Calc.

Learned a few things I didn't know out of this process. For instance, bash can't do floating-point math (call awk instead); a FODS file is much easier to create than a full-scale Open Document file (it's just a single flat XML file); and this is how you call an external application from a bash script and return immediately, without getting your shell session hijacked by errors from the external application: `soffice -calc -nologo $FODSFILE 2>/dev/null` &

Academic

This blog is for academic tasks undertaken by HCMC staff, such as writing articles for publication, reviewing abstracts for conferences, preparing and teaching classes, or proofing other people's documents.

Reports

Categories

June 2010
Sun Mon Tue Wed Thu Fri Sat
 << < Current> >>
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30      

XML Feeds