In the database, deleted all the records in the topics table that had no instances pointing to them (either never had any or the topic had been collapsed into another topic, e.g. ChHx into Chinese History).
Checked all the ids for the about to be deleted topics to ensure no instances pointing to them in the articles table. Found only one instance of a pointer to id 148 (originally coded as "X" by author), which I deleted from the field in the articles record.
Updated topics.txt, and sent JL and email saying if he wanted to collapse any more topics, I'm pretty confident we could do it without problem, though we are no longer able to use the text files to rebuild the db content if necessary, as they contain pointers to non-existant topic id's, though a bunch of global search and replaces would solve that.
Discovered I had not incorporated the 1896 data into the database, so did that.
Extracted text from .rtf file provided by JL.
Ran it through the process described in HowToProcessColonistFiles.txt on my Mac, with particular attention to normalizing the contractions Leona uses a lot in the transcripts, and to assigning the correct topic numbers in place of the topic codes she uses.
To import into the database, noticed that the transcript and cemetary fields were not in the order the db expected and also that the current version of mySQL requires a carriage return at the end of the last line (the previous one did not), so had to redo the upload after sorting those issues out. Settings for upload are found in sql_for_load_data.txt file on my Mac.
Tidied up the file topics.txt which contains all the topic codes and numerical ids
Created the contractions.txt file which contains all Leona's contractions and standardarized plain english substitutions.
On David's suggestion, I used a syntax like this:
SELECT * FROM articles LEFT JOIN topics ON topics.topicId = articles.topic
instead of the previous syntax:
SELECT * FROM articles,topics with a complicated AND clauses in the WHERE clause
as the COUNT query encompassing both tables exploded with the old syntax.
Currently the topics machinery presents the topics in alphabetical order and allows the user to select as many as they want (using an OR operator between each). It successfully handles one argument, but can not handle more than one argument.
What I think I'll do is use an "IS IN" operator to handle multiple arguments coming from the query interface.
Put plain-english values in the name field for each record in the topics table.
Added the following code:
Gold 151
These codes are redundant, so any future instances should be remapped as follows:
Artist 3 --> 2
Ch 31 --> 30
Grk 55 --> 53
LgHx 64 --> 63
Naturalized 81 --> 80
Postal 96 --> 93
RCNX 102 --> 101
RRHx 108 --> 107
School 111 --> 43
SkHx 113 --> 112
SVet 117 --> 115
Teacher 120 --> 43
Vet 132 --> 128
VetLetter 133 --> 135
VXL 139 --> 136
XV 150 --> 138
These codes had no instances, so I'm going to remove them from the table of topics:
BD 19 remove
C 27 remove
Death 37 remove
DM 40 remove
HuHx 57 remove
JT 61 remove
LUSH 69 remove
Misc 74 remove
RW 109 remove
X 148 remove
For now, I've preceded them with "XXX-" so they sort to the bottom of the list. I'll actually delete them when I'm sure no problems will result from doing so.
Put GUI elements in for selecting Topics (still haven't updated topic list).
Rudimentary code to take one selected value and write where clause based on that to hidden form element (similar to other filters, but for now passing in the string rather than the index int value).
Modified search.php to include that information in the SQL statement, but get a "can't find topics table" error. When I copy and paste the identical query into the phpMyAdmin sql query window, the query works.
SELECT collections.collectionName, newspapers.newspaperName, authors.authorName, topics.*, articles.* FROM collections,newspapers,authors,topics,articles WHERE articles.transcript LIKE '%teach%' AND (topics.topicName = 'Education') AND collections.collectionId = articles.collectionId AND newspapers.newspaperId = articles.newspaperId AND authors.authorId = articles.authorId AND topics.topicId = articles.topic ORDER BY pubDate ASC LIMIT 20
David suggests using a FROM clause on the articles table with an explicit LEFT JOIN argument for each of the additional tables with an ON clause equivalent to the current AND statement instead of the current FROM with all the tables and numerous AND statements.
Will try that tomorrow.
Dealt with two queries on Newspaper db:
1) a number of the transcripts (e.g. search for modern California bungalow) contain data like this: "modern California bungalow; Apr 25, 10 - 29 buildings ordered destroyed by city... full list of bldg/owners" the question is what the "10" means (year or page number). In any case, almost certainly the fix will be manual editing of the data, as the records contain a publication date in standard format at the start of the record which is successfully parsed and entered in the pubDate field for that record.
2) someone asking about ability to search for variant spellings (for names). I confirmed current capabilities and asked what further capabilities and interface person has in mind.
:: Next Page >>
The goal of this project is to take a collection of transcripts of new stories from early editions of the Times Colonist newspaper which are currently in text files containing special codes for various bits of information, normalize the records, put them into an SQL database and then write a querying front-end.
| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|---|---|---|---|---|---|
| << < | > >> | |||||
| 1 | 2 | 3 | 4 | |||
| 5 | 6 | 7 | 8 | 9 | 10 | 11 |
| 12 | 13 | 14 | 15 | 16 | 17 | 18 |
| 19 | 20 | 21 | 22 | 23 | 24 | 25 |
| 26 | 27 | 28 | 29 | 30 | 31 | |