Project X: great strides
After some work on implementing the shifting of huge numbers into a HugeInt type, during which I learned some useful things, I went back to the original equation and realized that we'd misinterpreted it; it's not the compressed stream that's treated as a number, but the length of the compressed stream. That makes the whole calculation easier, and the numbers perfectly manageable in normal integer and extended float types.
Armed with this, I successfully implemented the NCD calculation in Delphi (yay), using ZLib. I tested this against the results from the CompLearn NCD, and the results match perfectly. The only oddity was the identity case, in which CompLearn consistently gives back exactly zero, whereas my code was giving back very low, but still above zero, numbers. After puzzling over this (I'm having a slow brain day), it struck me eventually that they're simply trapping for the identity case and returning zero. I can do that too, so now I do.
The actual calculation is only part of the battle. What we have, potentially, for (say) the Sonnet, is about five million calculations, and from these, we need to create adjusted values based on our context-based algorithm. This means we need small, fast structures to store the data. I elected to do this with a Delphi TList (fast and simple) containing records (fast, small, and easy to save and load from disk in binary format). I built the basic structures, using 20-char shortstrings for the id values -- this is not ideal (and not Unicode), but in practical terms it's perfectly sufficient and it's also fast and small.
Each record now has the id of its string from text 1 and the id from text 2. It also has the ids of the preceding string (=line, mostly) in text 1, and the successor in text 1; and the preceding and succeeding lines in text 2. The reason for needing these will be clear shortly.
When we come to calculate the adjusted value, we need to reference the NCD calculations from the preceding two lines, and the succeeding two lines. Since the original NCD can be ANY line from text 1 against ANY line from text 2, it will be rare that the preceding and succeeding records we need are even close to the one we're dealing with, let alone being next to it. To handle this, I've written lookup functions. The idea is that when the first NCD value is calcalated for any two lines, the calling code will surely know the ids of the preceding and succeeding lines in each text, because it's working through the lines in order, so it can store those ids in the record when it creates it. Later, those ids can be used to find the related pairing records preceding and following this one, so that the adjusted calculation can be done. That's all now implemented, although nothing has been tested as yet -- testing is for tomorrow.
All this is posited on the same list object holding all the records. This seemed impractical at first for a large set, but when I did the math it turned out that we'd be using about 750MB of memory for 5,000,000 comparisons; this will actually work OK, I think, as long as the machine isn't doing anything else (and it won't be, when we come to run the whole calculation for something like the Sonnet). If it does prove to be a problem, I think I might be able to find some method of paging the data out to disk, but I hope I don't have to; in-memory lookups will be way quicker.