Text-analysis on the Douglas/Newcastle correspondence
Posted by mholmes on 20 Oct 2009 in Activity log
Made some significant progress in running the same analyses done in the summer on the Lytton/Douglas correspondence with the correspondence between Douglas and Newcastle, Lytton's successor. This is what I've managed to puzzle out today, with help from DH's handbook from the summer:
- The Internet Archive is working well as a means of producing word frequency lists. I loaded the two separate files, of Douglas's writing and Newcastle's writing, and then created a text set including them. Then I used a 5000-word block size (recommended by DH), Variant spellings, Moby, OED, Filter by Corpus, and Words having highest frequency, and set the word count to 1000. (Initially I worked with smaller sets of 100 and 500, to make it easier to figure out how to do the manipulation of the results, but you can do everything with one output table of 1000 words.)
- Generate the word frequency table, then copy it to the clipboard and paste it into an empty spreadsheet in OOo Calc.
- Copy all the cells to the clipboard, switch to another spreadsheet, and click Edit / Paste Special, then check Transpose, so that the rows and columns are switched around (this is what Minitab needs).
- Move the first column (the text block titles) to the end, and change its heading to Text.
- Now you need to transfer this into Minitab. Initially, you want to select only the data cells (not including the top row) and copy/paste them into a fresh worksheet, leaving the top label row empty.
- Now deal with the label row. Select the row in OOo, then do a find-and-replace to change all apostrophes to underscores (Minitab requires this). Then select the row, and copy/paste it into the grey top row of Minitab.
- Save your Minitab project. Then rename the worksheet (using the Minitab project manager) appropriately, and save it again.
- Now to do a cluster analysis, In Minitab go Stat / Multivariate / Cluster observations.
- In the first dialog, choose the word set you want to use (C1-C100 will give you the 100 most frequent words). You can do up to about 990; 1000 will cause Minitab to choke.
- Choose Linkage: Ward, then Distance Measure: Squared Euclidean, check Standardize variables, set a cluster size of 2 (there are 2 authors), check Show dendrogram.
- Click on Customize, then type a title for your graph in the top box. For Case labels, click on the text box, then select the column headed Text in the list box on the left (this will be the column with your text block headings in it).
- Press OK, OK; you should see the dendrogram.
- Right-click on the x-axis labels and choose Edit X-scale, then type a very small font size for the text blocks to make them legible (I chose 5).
- File / Save graph, as whatever suits you.
At 100 and 500 words, Douglas and Newcastle were completely distinct; at 990 words, there's some overlap (one block of D texts is grouped with N, although they're a distinct set within N). This does demonstrate that the writers are clearly distinguishable, and makes it worth doing the Craig Zeta I plan for tomorrow.