Log in

HCMC Journal

Monument project 2022-12-05 to 2022-12-09

to : Martin Holmes
Minutes: 435

NA is coming in on Thursday to work on the OCRed images in the hope of tracking down the half-a-dozen fragmentary files, so I’ve copied all the images and HOCR over to Poirot for her to use then.

On Tuesday, began the complicated process of merging family records, and made a lot of progress, in the process finding and fixing several errors in the original data; there will be many more that get found and fixed during this process, but the general sense is that we’re moving in the right direction and will be successful.

On Thursday, I ended up with a list of 1,679 potential matches between wife and husband records which could potentially be merged. For each of these potential matches, I’ve calculated a confidence level that reflects how well they match. In every case, the husband’s name must appear as husband on the wife’s record, and vice versa, so we start fairly strong; following that, I compare the listings of children in each record and depending on how close they are, calculate a confidence level such that if (say) two out of three children are listed identically in the two records, the confidence level is counted as 0.66.

Of these 1,679 potential matches: 905 have confidence level 1.0, meaning their children-listings are identical. I believe we could confidently merge these records without further investigation. 554 have confidence values between 0.5 and 0.99. 220 have confidence levels below 0.5. The question is where we go from here.

My proposal to the project is:

Following that, we probably need to review the remaining records to determine how many potential matches were missed due to inaccuracies in the husband/wife names across records; there will presumably be a fair few of these, and we’ll have to figure out how to identify them (by fuzzy matching and so on). After we’ve merged all families that can be merged, we can eliminate duplicate children as well as the names or anyone deceased or any parent mentioned who doesn’t have their own record, and really get the numbers down to a realistic set.