Log in

HCMC Journal

Monument 2023-02-06 to 2023-02-10

to : Martin Holmes
Minutes: 480

Started work on figuring out how to extract locations from bio notes, and discovered there are 125 cases where the standard formulas are not there, so those have been turned into a spreadsheet for NA to work on.

Then on Tuesday started building and testing a function for extracting preferred location information from the standard formulas. It’s not trivial, but it can probably be done fairly reliably.

On Wednesday, worked with MA to narrow down ways to process the text we’re expecting to see, and wrote some XSpec to confirm it’s working as expected, but then I added a debug module to generate output from all the people just to see what we get; it’s not encouraging, but it did reveal some significant systematic typos which I should be able to fix with a search-and-replace when NA is no longer working on the personography files.

On Thursday, moved forward significantly with the extraction of place names, and got the vast majority of them working; this process involved many corrections to badly-formatted notes, as well as some additional cunning in the regular expressions used to find the names. I now believe that automated extraction will handle more than 90% of cases, and we should only have a few dozen or a couple of hundred that need additional research.

On Friday, added a new output which lists the distinct values of placenames recovered, along with counts of people for each placename; this revealed hundreds of errors, most of which I fixed in a long debug session, leaving around a hundred still to deal with.