Project X: Adjusted NCDs are working
Today I added a GUI for configuration of the steps in pre-processing lines (normalization, whitespace collapsing, lower-casing, etc.). Then I wrote the basic back-end process for parsing and processing the lines to create the massive list of raw and adjusted similarity metrics. It takes about 30 seconds to run through 100 vs 100 lines, which is 10,000 comparisons. Multiplying upwards, that would suggest that we can do our 5,000,000-comparison Sonnet texts in about five hours, but that doesn't include encoding and saving the results as XML, so I think we'll have to at least double that, even assuming that we don't run into memory issues. The app takes 15MB on startup, and runs up to 22MB when it's created its list of records (10,000); so we can assume it will take up something like 3.5GB to do its work; obviously this isn't going to happen in memory, so some disk paging will take place, but it's not too bad at all. It looks practical to achieve this in one day's dedicated running of a single Windows machine.
Next I need to tidy up the GUI, add Recent File handling, add the saving-to-disk code (both source files with their added ids, along with the huge comparison file, whose format I haven't figured out yet). But we might have something usable by a couple of weeks from now!