More work on the similarity metric
Posted by mholmes on 01 Mar 2010 in Activity log
Having implemented the similarity metric in C++ under QT, I'm now experimenting with the results I get, and comparing them with the results from the same data using my Java implementation. There are some interesting issues:
- I have noticeable different results between the Java (GZip) implementation and the C++ (zLib) version. Differences are of the order of 0.12 in some cases (12% of the range). This is both intriguing and worrying, although it may not be an issue if the only use of the values is for relative comparisons.
- The order in which the strings are dealt with (when concatenated as part of the calculation algorithm) affects the score, on the order of about 0.02 (2% of the range). This is interesting. Right now, my object calculates scores using both sequences, and averages them out, but it may be more "correct" (whatever that means) to take the larger or the smaller of the values in each case. I'll have to do some thinking about this.
This entry was posted by Martin and filed under Activity log.