hist : detect duplicate / similarity issues
Posted by sarneil on 30 Jun 2010 in Activity log
ML sent a simple flowchart of the work process for the Canadian Legal Bibliography. Two of the boxes were labelled "detect duplicate articles (?fuzzy logic)" and "detect duplicate annotations". He wants us to modify Zotero JS code to detect duplicates. Did some research and discussed with Martin
Approach 1 to detect duplicates : use similarity score
If we could find a javascript library that could take strings and generated compressed versions, we could compare those using the similarity algorithm and determine a threshold (between 0.0 and 1.0) for what counts as similar enough to be flagged as duplicate. Problem is we can't find such a javascript library that takes a string input, compresses it and produces a string output. We wanted JS implementation because we figure that will be easiest to integrate with Zotero.
Approach 2 to detect duplicates : check for identicals
About the simplest definition of a duplicate would be two articles with identical values in at least one of the following:
1) title (case-insensitive, ignoring whitespaces),
2) Digital Object Identifier, or
3) publication, volume, issue, and page
Wrote to ML asking if approach 2 would be acceptable.
Also wrote asking if her preferred that duplicate-checking happen when the contributor tries to submit an article, or as a batch job run by a site-administrator from time to time. If we use approach 1 and have to use a php-based compression library, we'll likely be restricted to the batch option.
Also asked him questions about how automated should be the deletion of duplicate articles, and the merging of articles which differ only in their annotations, and to clarify that the rest of the tasks (input and format of data, verification of data) is manual labour.