Wrote an eXist module for similarity metric comparisons
Posted by mholmes on 09 May 2013 in Activity log
I've now figured out how to create an extension module for eXist, following the instructions here. These are some things I've learned:
- The only practical way to do this is to work with your module code in the context of the eXist tree, in $EXIST_HOME/extensions/modules/src/org/exist/xquery/modules.
- You can use a non-eXist namespace -- I'm using http://hcmc.uvic.ca/ns/usm -- but it seems safest to use the eXist package structure, so my package is in org.exist.xquery.modules.unisimmetric.
- All the extension modules are built together into a single jar called exist-modules.jar. You can build this jar alone, using
build.sh extension-modules
, then drop that jar into an existing eXist instance (although if the new jar was built with a substantially different version from the rest of the code, there could well be problems). - To turn on your module, you add a line to the conf.xml file like this:
<module uri="http://hcmc.uvic.ca/ns/usm" class="org.exist.xquery.modules.unisimmetric.UniSimMetricModule" />
along with the other modules.
I'm not yet happy with my module, and I'm still working on it. In particular, I'm not happy with the scores it's generating, and I think this might be something to do with other bits that get included in the GZIP stream, such as a header; if I can figure out how big those are, I can remove them from the calculation. The highest difference I seem to get is around 0.53 with completely dissimilar strings, so it seems as though the results are being compressed into a range much smaller than 0-1.