Working on the USM
Posted by mholmes on 18 Nov 2009 in Activity log
I now have a Java class working and tested which can do the NCD calculations. I may wrap it in a larger class which can also do all the normalizations, for clarity. The testing has revealed some interesting aspects of the USM which bear on the question of "What is similarity?", or better, what on earth could constitute an objective measure of the similarity of two pieces of text. If you take one piece of text and compare it with another which consists of dozens of copies of itself, the similarity is very high (because of the repetition, which is sort of factored out by the compression); in one sense, this is true (they are very similar), but in another, it's not (they're of radically different lengths).
This entry was posted by Martin and filed under Activity log.