I'm involved in a SSHRC grant application whose PI is at another institution, and had neglected to complete a Research Application Summary Form and accompanying Conflict of Interest declaration (because I didn't know I was supposed to). Research accounting need these, and asked for them when they learned about the grant application. As I'm staff rather than faculty, some of the questions on the forms proved a little problematic, so I've had some back-and-forth with Res. Acc. to clarify the how to complete them; they're now done, and with RS for signature, following which they'll go to the Dean. I have digital and paper copies for reference; this issue will arise again, most probably, so if you're in a similar position, ask to see my copies.
Spent most of the day on my second DH 2010 submission, on the Universal Similarity Metric. In the process of writing it, I went back to work with the Delphi prototype, and although I do like Delphi, it's really obvious that Java is a better platform for this particular application; its GUI doesn't need to be rich, and the XPath limitations on Delphi are a bit too constraining, and in any case it would be useful to have the actual algorithm available in a jar file you could call on the command line, so it could be integrated into other apps. It also occurred to me that I should run some of the cluster analysis stuff I did with Douglas, Lytton and Newcastle's writing using the Universal Similarity Metric instead of the word-based analysis, and see what kind of results appear. Might be quite revealing -- or not, but in either case, the results would bear reporting alongside the rest of the material in the paper.
That's one off the table. Next week the Mariage/similarity metric submission...
Really struggling to meet the word limit. BH was kind enough to review the draft and discuss it with me. Needs a thorough going-over; back to work on it tomorrow morning.
Organizing data, doing background reading, filling in gaps, and redrafting some of the first draft. We're getting there slowly. The difficulty is going to be fitting it all into the word-limit, but I'll get to that on Thursday.
The analyses of Lytton vs Douglas and Newcastle vs Douglas threw up very similar patterns of distinctive words, and it occurred to me that we might actually be dealing with only two "authors", one Douglas and the other a sort of bureaucratic entity composed of many individuals having input into drafts of outgoing despatches. Today I set out to discover if Newcastle's and Lytton's writing can be distinguished based on word-frequency. Cluster and PCA analyses seem to show that, while they are much more similar than Douglas's despatches are to either of them, they're still largely distinctive, with only two outlying blocks (one Lytton, one Douglas). I've also done some more digging for useful background reading, and found three papers. Should be on target for an abstract by the end of the week.
I'm documenting this process in detail, because it took me a while to figure out, because it's a long time since I first did it under DH's guidance in the workshop, and in the meantime his spreadsheets have changed a little:
- Created a combined text set for Douglas and Newcastle in the Intelligent Archive, and generated a word list of the 4,000 most frequent words, using 2,000-word blocks.
- With another text set consisting only of the Douglas texts, generated a word-frequency table with the same block size, using the same 4,000 words generated in the first step. Copied that into a spreadsheet.
- With a third text set consisting only of the Newcastle texts, did the same as in the step above, and copied that into another spreadsheet. This is the core data we need for the operation.
- Moved to Windows (Arugula), because that's where we happen to have a copy of Office 2007. Started Excel and turned on macros (they're off by default).
- Opened the CraigZeta spreadsheet and saved it with a new name.
- Deleted the Author 1, Author 2, Author 1 Ind, and Author 2 Ind data (we don't need the Ind sheets for this calculation, since we have no unattributed texts).
- Created a couple of new sheets, Graph 1 and Graph 2, and moved the visible chart from the first sheet to Graph 1. Graph 2 will hold the generated chart at the end of the process, because we'll also want to move that away from the front sheet.
- Copy/pasted the Douglas data from step 2 into the Author 1 sheet.
- Copy/pasted the Newcastle data from step 3 into the Author 2 sheet.
- Ran the CraigZeta macro (View / Macros / View macros, then select it).
- Moved the generated chart from the front sheet onto the Graph 2 sheet.
- The results are now all there, but I wanted to get the top 200 Newcastle words with their scores and the top 200 Newcastle words with their scores, so I added another sheet for that, and set it up. The Douglas (Author 1) words are the first 200 in the list on the left; the Newcastle ones are those from around 3000 to 3200 (the macro re-orders the last 1,000 words so you can get them in order). Copy/paste the two sets of words into the new sheet.
- To copy/paste the scores, select and copy them, then click on the Paste down arrow to select "Paste Values", otherwise you'll just be pasting the formulas.
This I saved as an Excel 2007 macro-enabled sheet; I also wanted to make it portable, but that seems almost impossible. If you try to save as an older Excel format, it warns you that the sheets have too much data, so you'll lose some. You can export to ODS, but if you then try to open that in OpenOffice, much of the first sheet is borked (full of data elements in the msoxl namespace!). The graphs don't survive either. So I'll need to go back and print off the graphs, or save them as images, to work with them outside MS Office 2007.
As far as the results are concerned, they look as intriguing as the original Douglas/Lytton results, and remarkably similar. I might now try some comparative stuff to see whether Lytton and Newcastle can be distinguished, and I'll see how much correspondence we have between Lytton or Newcastle and other people, to see if we might find ways of contrasting the way they write to Douglas with the way they write to others.
Working again with the Douglas/Newcastle data, I generated PCA score plots using 100, 500 and 990 words; they've virtually identical, and have the Newcastle texts grouped towards the top right, and the Douglas texts at the bottom left. Generated PNG images from the results; I found that these images, and all the PNGs I'd saved before, were very low-res and ugly, so I experimented to try to find the best output settings. It turns out that only the ugly low-res defaults have the right aspect ratio; as soon as you specify a higher DPI in the Options dialog (from the Save dialog), you get a square image, so it looks a bit odd. In the end, I think I'll have to screenshot the graphs in Minitab itself, and save the screenshots. How silly.
Made some significant progress in running the same analyses done in the summer on the Lytton/Douglas correspondence with the correspondence between Douglas and Newcastle, Lytton's successor. This is what I've managed to puzzle out today, with help from DH's handbook from the summer:
- The Internet Archive is working well as a means of producing word frequency lists. I loaded the two separate files, of Douglas's writing and Newcastle's writing, and then created a text set including them. Then I used a 5000-word block size (recommended by DH), Variant spellings, Moby, OED, Filter by Corpus, and Words having highest frequency, and set the word count to 1000. (Initially I worked with smaller sets of 100 and 500, to make it easier to figure out how to do the manipulation of the results, but you can do everything with one output table of 1000 words.)
- Generate the word frequency table, then copy it to the clipboard and paste it into an empty spreadsheet in OOo Calc.
- Copy all the cells to the clipboard, switch to another spreadsheet, and click Edit / Paste Special, then check Transpose, so that the rows and columns are switched around (this is what Minitab needs).
- Move the first column (the text block titles) to the end, and change its heading to Text.
- Now you need to transfer this into Minitab. Initially, you want to select only the data cells (not including the top row) and copy/paste them into a fresh worksheet, leaving the top label row empty.
- Now deal with the label row. Select the row in OOo, then do a find-and-replace to change all apostrophes to underscores (Minitab requires this). Then select the row, and copy/paste it into the grey top row of Minitab.
- Save your Minitab project. Then rename the worksheet (using the Minitab project manager) appropriately, and save it again.
- Now to do a cluster analysis, In Minitab go Stat / Multivariate / Cluster observations.
- In the first dialog, choose the word set you want to use (C1-C100 will give you the 100 most frequent words). You can do up to about 990; 1000 will cause Minitab to choke.
- Choose Linkage: Ward, then Distance Measure: Squared Euclidean, check Standardize variables, set a cluster size of 2 (there are 2 authors), check Show dendrogram.
- Click on Customize, then type a title for your graph in the top box. For Case labels, click on the text box, then select the column headed Text in the list box on the left (this will be the column with your text block headings in it).
- Press OK, OK; you should see the dendrogram.
- Right-click on the x-axis labels and choose Edit X-scale, then type a very small font size for the text blocks to make them legible (I chose 5).
- File / Save graph, as whatever suits you.
At 100 and 500 words, Douglas and Newcastle were completely distinct; at 990 words, there's some overlap (one block of D texts is grouped with N, although they're a distinct set within N). This does demonstrate that the writers are clearly distinguishable, and makes it worth doing the Craig Zeta I plan for tomorrow.