In the database, deleted all the records in the topics table that had no instances pointing to them (either never had any or the topic had been collapsed into another topic, e.g. ChHx into Chinese History).
Checked all the ids for the about to be deleted topics to ensure no instances pointing to them in the articles table. Found only one instance of a pointer to id 148 (originally coded as "X" by author), which I deleted from the field in the articles record.
Updated topics.txt, and sent JL and email saying if he wanted to collapse any more topics, I'm pretty confident we could do it without problem, though we are no longer able to use the text files to rebuild the db content if necessary, as they contain pointers to non-existant topic id's, though a bunch of global search and replaces would solve that.
Discovered I had not incorporated the 1896 data into the database, so did that.
Extracted text from .rtf file provided by JL.
Ran it through the process described in HowToProcessColonistFiles.txt on my Mac, with particular attention to normalizing the contractions Leona uses a lot in the transcripts, and to assigning the correct topic numbers in place of the topic codes she uses.
To import into the database, noticed that the transcript and cemetary fields were not in the order the db expected and also that the current version of mySQL requires a carriage return at the end of the last line (the previous one did not), so had to redo the upload after sorting those issues out. Settings for upload are found in sql_for_load_data.txt file on my Mac.
Tidied up the file topics.txt which contains all the topic codes and numerical ids
Created the contractions.txt file which contains all Leona's contractions and standardarized plain english substitutions.
filter by topic code, in particular second and subsequent code of records which contain more than one topic code
process data for 1896 and 1897 and upload into database - done Dec. 2007 / Jan 2007
On David's suggestion, I used a syntax like this:
SELECT * FROM articles LEFT JOIN topics ON topics.topicId = articles.topic
instead of the previous syntax:
SELECT * FROM articles,topics with a complicated AND clauses in the WHERE clause
as the COUNT query encompassing both tables exploded with the old syntax.
Currently the topics machinery presents the topics in alphabetical order and allows the user to select as many as they want (using an OR operator between each). It successfully handles one argument, but can not handle more than one argument.
What I think I'll do is use an "IS IN" operator to handle multiple arguments coming from the query interface.
Put plain-english values in the name field for each record in the topics table.
Added the following code:
Gold 151
These codes are redundant, so any future instances should be remapped as follows:
Artist 3 --> 2
Ch 31 --> 30
Grk 55 --> 53
LgHx 64 --> 63
Naturalized 81 --> 80
Postal 96 --> 93
RCNX 102 --> 101
RRHx 108 --> 107
School 111 --> 43
SkHx 113 --> 112
SVet 117 --> 115
Teacher 120 --> 43
Vet 132 --> 128
VetLetter 133 --> 135
VXL 139 --> 136
XV 150 --> 138
These codes had no instances, so I'm going to remove them from the table of topics:
BD 19 remove
C 27 remove
Death 37 remove
DM 40 remove
HuHx 57 remove
JT 61 remove
LUSH 69 remove
Misc 74 remove
RW 109 remove
X 148 remove
For now, I've preceded them with "XXX-" so they sort to the bottom of the list. I'll actually delete them when I'm sure no problems will result from doing so.
Put GUI elements in for selecting Topics (still haven't updated topic list).
Rudimentary code to take one selected value and write where clause based on that to hidden form element (similar to other filters, but for now passing in the string rather than the index int value).
Modified search.php to include that information in the SQL statement, but get a "can't find topics table" error. When I copy and paste the identical query into the phpMyAdmin sql query window, the query works.
SELECT collections.collectionName, newspapers.newspaperName, authors.authorName, topics.*, articles.* FROM collections,newspapers,authors,topics,articles WHERE articles.transcript LIKE '%teach%' AND (topics.topicName = 'Education') AND collections.collectionId = articles.collectionId AND newspapers.newspaperId = articles.newspaperId AND authors.authorId = articles.authorId AND topics.topicId = articles.topic ORDER BY pubDate ASC LIMIT 20
David suggests using a FROM clause on the articles table with an explicit LEFT JOIN argument for each of the additional tables with an ON clause equivalent to the current AND statement instead of the current FROM with all the tables and numerous AND statements.
Will try that tomorrow.
I need to:
- renumber the remaining topics
- in the database do a number of search and replaces so that the values in the "topic" field for each record are correct
- put the text for each topic into the appropriate table.
Then ready to go back to code for query and reporting on this field.
:: Next Page >>
The goal of this project is to take a collection of transcripts of new stories from early editions of the Times Colonist newspaper which are currently in text files containing special codes for various bits of information, normalize the records, put them into an SQL database and then write a querying front-end.
| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|---|---|---|---|---|---|
| << < | > >> | |||||
| 1 | 2 | 3 | 4 | |||
| 5 | 6 | 7 | 8 | 9 | 10 | 11 |
| 12 | 13 | 14 | 15 | 16 | 17 | 18 |
| 19 | 20 | 21 | 22 | 23 | 24 | 25 |
| 26 | 27 | 28 | 29 | 30 | 31 | |