CCAP: Cumberland Archives db import

Posted by on 11 Aug 2015 in Activity log

I've just done an import of the Archives database, which the museum folks believed had not been successfully imported in the original process. This resulted in the addition of some records (they won't actually show up until sysadmin runs the re-index).

The import was a bit of a nightmare, for the following reasons:

The data was full of typos. Instead of "Textual record", there was "Textual Record" (the import is case-sensitive and refuses to proceed when this sort of thing is wrong). There were also simple spelling errors such as "Philetalic Record". I caught a lot of these ahead of time, but inevitably (since this is CSV, not XML with a nice tight schema) you can't catch everything, which means you have to keep retrying problem records.
There were 68 duplicates of items already in the database. For instance, CUMB_982.020.019 was in the spreadsheet, but it's already in the database.

So I think one of two things may have happened: among the four original databases we imported, perhaps some records are in two of them; OR, more likely, when I ran the original import, I did include the Archives database, but many of the items were not imported because they had typos or other errors in them, which resulted in the museum folks thinking it hadn't been imported.

I don't think we should continue to attempt these CSV imports for Cumberland data; the original source data is too messy -- everything has to be perfect for the record to be imported successfully -- and at this stage, I have no idea which records are in there and which aren't, since we've had multiple attempts with many errors and re-tries. I'm going to move on and focus on the images for the moment.

This entry was posted by Martin and filed under Activity log.