GRS media db
Noticed stuff like this:
“Poikileâ€
littering the data. Thought I got that in the first round of tidying.
Notes regarding new schema:
Date ranges - found stuff using "1.5 mil BP" that refers to paleolithic objects
General uncertainty is registered (using ?) in approximately 540 records
There are 1837 locations - some of them look like this <1400 BC.>, this <-->, or this <ms. illus.>. More to the point is that the location field frequently includes non-city name strings (like country names and site names. 21 locations account for 2700 objects - there are 1837 locations in the data, with only 169 locations having 10 or more objects.
Sites - only about 3500 objects have the site noted. There are 284 sites in the db.
Keywords - wanted to use TAPoR tools to do a word count etc. but it gacked 4 tries out of 4, so I did it on the CLI. Out of the 12000+ individual words in the keywords field I found that only 780 some-odd occurred 10 time or more. 110ish occurred 50 times or more, and about 30 occur 100 times or more. I did not remove many words (single letter words, abbreviations like BC etc and numbers) so these counts are likely lower. The real problem is that most aren't actually keywords (Rome, Greece etc.), are effectively duplicates (erotic and eroticism) or are repeated in other fields in the database. I attached 2 files to this post: All keywords, and the top keywords. The numbers indicate frequency.
Notes - I haven't done as much work there as I have on keywords, but I expect it to look similarly bloated. As a matter of interest, the keywords and notes fields are character for character duplicates in 2878 records. I did that with this handy little query:
SELECT keywords, notes
FROM `aggregated_view`
WHERE keywords = notes
and keywords !=""
and notes !=""
View - the data look to be similar in nature to other fields, most often keywords. There needs to be clarification on this field's importance and how to distinguish it from other fields.