Category: "Activity log"
Before making the recent changes to the VIHistory site, I realized my dev, svnrepo backup and production instances were out of sync so I spent a couple of hours sorting all that out.
Also, due to recent security problems on some apache servers, the pgadmin is no longer accessible, so I installed an app pg3admin and configured it to ssh to the postgres server so I could do basic admin through that interface.
In the searchcensus page, PD wanted the list of subdistricts that appeared in the popup list to be sensitive to which census year check boxes were selected. I modified the showDataListYear function in the script/showdatalist.js file to check the status of those checkboxes and then added an argument to the call to datalist.php to send that information to the popup page.
In datalist.php, I modified the code so that if there is a value in the yearsChecked argument, a newly written block of code is executed which ensures that only those subdistrict records which belong to the selected years appear in the popup window.
It is possible as it stands for the user to select e.g. 1901 checkbox, then pick a subdistrict from 1901, then deselect the 1901 checkbox and check the e.g. 1891 checkbox, at which point the configuration which may result in a query that returns no values (namely all records from 1891 that have a subdistrict name specified in 1901). The only exception to this is if there happens to be an 1891 subdistrict name which is identical to the 1901 subdistrict name. In that case, you'd get the records from 1891 that have the subdistrict name from 1891 that happens to be the same as the selected 1901 subdistrict name, but that's probably expected behaviour.
Many of the districts/subdistricts appear in only certain censuses. (e.g. no Alberni in 1881, then one in 1891, 1901, 1911). Turns out each location is a record in the locations table that knows nothing about other location records (i.e. Alberni1891 has no relationship with Alberni1901).
In the search gui, there is a popup dialog that used to list all the distinct values of the subdistrict text (e.g. "Alberni"). Problem with that is that the user doesn't know whether a given district ("Alberni") exists in a given year (1881 vs 1891). I tried coming up with a smart way of manipulating the year checkboxes based on what was happening with the subdistrict values, but that was adding more UI problems then it was solving. So, I added the year to the subdistrict text for each value in the drop down (e.g. Alberni 1891, Alberni 1901) so at least the user can see which years that district exists in.
Next problem is that I can't use e.g. "Alberni 1891" as the value to search for in the SQL query, because of course that string doesn't appear in the subdistrict field of the locations table. So, I had to strip the year off of the value in the text box the user sees. The user must still check the checkbox for the year(s) they want to include in the search. Not ideal, but the best compromise I could come up with.
I modified the default settings to that all years are checked by default. So if the user selected "Alberni 1891" and hit submit, they'd actually get values for Alberni for 1891, 1901 and 1911. The hits show the date, so it would be pretty obvious they're getting results they don't want, and they'd then have to go back to the search form and uncheck the years they don't want. That seems better than the previous behaviour, in which the first year checkbox only was checked by default, so in our example, the user would get back 0 hits (as the first census year there was no Alberni district).
PD and ES of History wanted to get a dump of the 1901 Census. They tried using the Export feature on the site, but all they got was a white, empty window when they executed it. Looks like the number of records they were trying to export (about 46,000) exceed some RAM or processing time limit in php, so the job was never returned, resulting in the white window. I did a query to export all the males, then all the females. Those two exports worked. I then combined the two into one big spreadsheet file.
I then went into phppgadmin and discovered there were 5 records unaccounted for. Those had a value of 9 in the sex field. I then went back to the website GUI and searched for each of those five records and then incorporated them into the spreadsheet.
So, I did get the client what they asked for by a workaround, but didn't precisely identify the cause of the problem, let alone a solution.
At PD's request, I rejigged the properties chunk of the site. renamed the folder from ta to properties, took the one existing page (taxassessment.php) and split it into properties.php (landing page with links to tax assessment and building documents), taxassessment.php (intro and linke to 3 tax assessment tables), buildings.php (blurb on background and link to search page).
Also made minor tweaks to index.php page in root and the streetname_changes.php file.
Seems like renaming a folder using svn rename is a bad idea. Probably better to create a new folder manually copy the files over then manually add all the new stuff to the svn repository. commit all that and assuming it is working, then delete the unwanted folder from the repo. I.e. manually do the copy and delete that rename supposedly does.
I wanted to rename a folder from "taxes" to "properties". The folder contained one file. I did this:
1)
svn rename taxes properties
no problem. Then I committed the repository and got this:
svn: Commit failed (details follow):
svn: Item '/site_root/content/taxes' is out of date
2) So I tried updating the site and got back a taxes folder.
Then I cd'd to the containing folder and tried this:
svn delete taxes
worked. Tried to commit and got:
svn: Commit failed (details follow):
svn: Aborting commit: '/Users/sarneil/Documents/Projects/history/ViHistory/instances_of_site/svnrepo/trunk/site_root/content/ta' remains in conflict
then tried:
svn delete --force taxes
worked. Tried to commit and got:
svn: Commit failed (details follow):
svn: Aborting commit: '/Users/sarneil/Documents/Projects/history/ViHistory/instances_of_site/svnrepo/trunk/site_root/content/ta' remains in conflict
3) did this:
svn status
and got
D C taxes
> local delete, incoming edit upon update
4) Did a search on stackoverflow, and based on that, then tried
touch taxes
svn revert taxes
worked. I decided to try getting rid of the file inside first:
svn rm ta/taxassessment.php
worked. Then committed the change and now have things in sync with an empty taxes folder.
5) Then tried to delete the pesky folder
svn delete taxes
worked. Tried to commit that and got
svn: Commit failed (details follow):
svn: Item '/vihistory/trunk/site_root/content/ta' is out of date
6) Last resort on stackoverflow was "In a nutshell, the trick is to go to the .svn directory (in the directory that contains the offending file), and delete the "all-wcprops" file."
Did that, followed by
svn update
svn commit
and the taxes folder is now gone in the repository and in my copy.
No idea which of those steps (other than 6) are required.
I added the building_documents table to the production db, and added the necessary files in the production front end for the building construction data set.
I updated the tax assessment and documents to include the material on changed street names.
Next will be removing the building_permit stuff from the dev db and dev front end, then from the prod db and prod front end.
I've got a working instance of the building construction documents data set in the dev site. Searchable by name (see previous post), street and year. As some of the dates are yyyy and others are yyyy-mm-dd, I turned the field into a character field and added some code so that in the where clause only the year part of the date gets included.
Assuming PD approves, I'll create an identical instance in the production db and front-end and then clear out the building permits table and front-end stuff from both dev and prod instances, and that has been superceded by the building construction document stuff.
I copied the code for the building permit pages to create the building construction document pages (i.e. inc/bcdformbasic.inc, inc/bcdresults.inc and search/searchbcd.php). When I did a search and put Holmes into the owner field in searchbcd.php, I got zero hits, even though I know there are five records. When I do a similar search in searchbp.php, I get the five hits.
Here's the line of code in search/searchbp.php:
$where .= make_where_pattern($surname, 'bp_owner_surname', $where, $form, $fuzzy_func);
Here's the line of code in search/searchbcd.php that returns 0 hits:
$where .= make_where_pattern($holder_name, 'bcd_holder', $where, $form, $fuzzy_func);
Here's the line of code in search/searchbcd.php that returns 5 hits:
$where .= make_where_pattern(('%'.$holder_name.'%'), 'bcd_holder', $where, $form, $fuzzy_func);
I think I've figured it out, based on testing and reading the docs on the postgres similar operator. In the building permits case, the surname is just the surname; in the building construction documents case, the holder is a surname and given names, and that extra information causes the similar function to return false (the docs for pg similar say that the regexp match is across the entire string, which confirms my theory).
The name search in the bcd is now more liberal than in any of the other tables. the bcd finds the searchstring anywhere in the name field, the others have to match much more closely (no sure exactly how closely) but have separate surname and forename fields and only the surname field is searched.
Did the following so far to normalize the data:
1 delete first column (sequential id number)
2 render rooms as integer
3 replace left double guillemot with 1/2
4 render cost value as integer
5 clear trailing whitespace at end of file
6 provide correct field names in first line
bd_permit_ref,bd_date,bd_notes,bd_holder,bd_street,bd_street_num,bd_sublot,bd_lot,bd_block,bd_section,bd_add_alt_repair,bd_purpose,bd_rooms,bd_cost,bd_arch_builder
In records in which the room number value was something like 2 buildings 4 rooms each, I rendered that as 8 rooms.
Had to stop when I noticed 3000 records have only a year in the date field (8000 have year-month-day). PostGres requires the yyyy-mm-dd for a date data type. I recommended to PD that we change that field to a string and accept less sophisticated comparison operations. Now waiting to see if thinks that's ok, or if he wants to explore more involved solution (namely, split field into 3 separate fields and then add a bunch of complicated code as needed to allow for filtering by dates.
Updated my dev instance of the front end (in my account) so it's the same as the production instance (minus the huge cache of images for the maps) in the taprhist account.
Still no luck setting up a time with the sysadmins to clear out the huge table logging search queries (6 million records), and to migrate dbs. Will proceed with the dbs I've got.
- production instance is pointing to the vihistory db ; dev instance is pointing to the vihdev db
- added a building_documents table with appropriate fields to the vihdev instance
- added a building_documents_bd_id_seq sequence to the vihdev instance
Got answers to questions and data spreadsheet from PD.
The production front end is pointing at the vihdev db. I tried to back that up using the psql command provided by Jaime and I got a file of zero bytes. I then tried to use the export feature in phypgmyadmin (or whatever it's called) and that brought the db service down. Here's the error message, which suggests that the device is full, but we now think is actually a permissions problem:
==================================
SQL error:
PANIC: could not write to file "pg_xlog/xlogtemp.24088": No space left
on device
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
==================================
Sys-admins cleaned things up and restarted the service within a few hours. Thanks.
I tried connecting to the vihistory db running in the same environment, but the connection failed.
So, until I get a successful backup of the data in the server, I'm very hesitant to do any work on it. Here's what I've proposed to the sys-admins:
1) export and clear big log table from db
- write table search_log in db vihdev to file
- clear contents from table search_log in db vihdev
NB When I recently tried it from my Mac, I got a search_log.err file of
size 0 bytes. Since the last time we successfully executed the command
below the db engine has moved to mango.
psql -W -U viAdmin -h pgsql.hcmc.uvic.ca -A vihdev --command "copy
search_log to stdout;" > search_log.err
NB table has 6.5 million records, each about 1K = 6 gigs
2) create a new db to use as the production instance
- clone table vihdev, name clone vihprod
- drop db vihistory
- modify acl on vihprod based on settings for vihdev
NB I have to change connection strings on site
3) set up long-term plan for write-then-clear search_log table
NB cron job vs manual
NB frequency
4) ensure I can migrate future changes in vihdev db to vihprod db
PB sent me a large jpg to use as a banner on the conference site. Took a couple of guesses to get something the correct dimensions. Also modified the bg of the image from white to the taupe colour used on the site - they were OK with that.
Turned out that
- headerTitle div had a number of peculiar styling rules
- #header h1 had odd padding settings which resulted in the banner being noticeably off-center
I overruled by adding a selector in the cha2013styles.css file. Noticed that a lot of elements have a lot of instances of selectors and conflicting style rules applied to them - sometimes 2 or 3 selectors for the same id in the same css file.
Finally, it seems that the CMS doesn't like comments in css files, so I can't include any.
Here's the contents of the modified css file to date (with comments added here by me):
/*hide certain items in the nav-bar at the top of the page*/
/*
not logged in
li:nth-child(1) = home http://conferences.uvic.ca/index.php/index/index/index/index
li:nth-child(2) = about http://conferences.uvic.ca/index.php/cha/index/about
li:nth-child(3) = log in http://conferences.uvic.ca/index.php/cha/index/login
li:nth-child(4) = account http://conferences.uvic.ca/index.php/cha/index/user/account
li:nth-child(5) = search http://conferences.uvic.ca/index.php/cha/index/search
li:nth-child(6) = announcements http://conferences.uvic.ca/index.php/cha/index/announcement
logged in
li:nth-child(1) = home http://conferences.uvic.ca/index.php/index/index/index/index
li:nth-child(2) = about http://conferences.uvic.ca/index.php/cha/index/about
li:nth-child(3) = user home http://conferences.uvic.ca/index.php/index/cha2013/user
li:nth-child(4) = search http://conferences.uvic.ca/index.php/cha/index/search
li:nth-child(5) = announcements http://conferences.uvic.ca/index.php/cha/index/announcement
*/
div#navbar ul.menu li:nth-child(1),
div#navbar ul.menu li:nth-child(2),
div#navbar ul.menu li:nth-child(4) {
display:none;
}
div#navbar ul.menu li a[href$=account] {
display:none;
}
div#navbar ul.menu li a[href$=search] {
display:none;
}
div#navbar ul.menu li a[href$=announcement] {
display:none;
}
/* make header have thinner border on bottom and mirror it on top */
#header {
border-bottom: 1px solid #898E79;
border-top: 1px solid #898E79;
}
/* get rid of odd padding around h1 in header to ensure contents are centered */
#header h1 {
padding: 0 0 0 0;
}
/* headerTitle has a barrage of selectors and style rules, over-rule them all
most importantly over-rule the fixed height so the banner image can be contained elegantly */
#headerTitle {
height: 100%;
padding-bottom: 0px;
padding-top: 0px;
margin-bottom: 0px;
margin-left:0;
}
At request of PD, I processed the the raw checkcensus csv files he submitted into something suitable for upload to database. Oddities in data that had to be resolved: street numbers such as 7-9 or 12 and a half, number of occupants such as 20+, 20plus, 20 to 30 etc., quotation marks and commas in the dataset being confused for quotation marks and commas as delimiters.
Also set up the table, created the fields including an entry in the sequences table, all based on similar work done for the 1871 census, which was also a kind of one-off data set.
Did some twiddling with the presentation of results (particularly with how to represent the various note fields). For those records that have an id for the 1891 Dominion census, wrote code to generate a link to the full record for that individual in the Dominion Census, but have not modified the 1891 Dominion census in any way (e.g. to include a link back to an entry in the 1891 Check census).
Looked into what's involved in updating 1400+ records in the census_1911 table.
Have spreadsheet from PD. About 37 of the fields in the db can be just read in from the spreadsheet. About 12 of the fields in the db have to be calculated based on values in the spreadsheet (the nature of the calculation varies for each of the 12 instances. The notes field in the PG table describe the needed calculations. Not sure how Jamie did this last year, but it looks like I'll have to write a bunch of code to do that processing and generate the csv needed.
Then I'll need to tread carefully regarding the actual upload. I think I'll need to delete the existing records and then upload (using the copy command, or possibly the import feature in the admin client).
PD asked me to create one page to display a table of all the streets (or portions of streets) that have been renamed over the time range of the VIHistory datasets. I did that by running the tab-delimited through a few regexs to produce a styled html table and put that into a new page.
PD also asked for a similar page to display a table of all the addresses that were renumbered in 1907. That took a bit more regex processing, as I wanted to created a heading for each street and then a table for all the addresses on that street.
Made copies of a number of ta (tax assessment) files, renamed them and edited them to work with the building_permits table in the db. Did this in the dev instance in my account on our server.
Files I've added:
site_root/inc/bpformbasic.inc
site_root/inc/bpresults.inc
site_root/search/searchbp.php
The link to the searchbp.php page is on the tax assessment page (ta/taxassessment.php).
Emailed JL and PD for guidance on which fields to include in search interface(s) and which results fields are primary and which secondary.
More headaches with svn than the actual code (as usual). I had in the repo and on my local drive a folder (bp) containing 1 file. I did an svn delete and the file deleted on the local instance, but not the folder. When I then did a commit, I got a file out of date error. When I tried various ways of sorting out this problem I ended up consistently geeting "/path/on/local/drive/' remains in conflict" errors.
Googled it and discovered I'm not alone. Quite a number of people renamaing, moving or deleting files that have everything going smoothly except for one file or folder that some gets into "conflict"
Solution:
svn resolved path/to/conflicted folder
svn update path/to/conflicted folder
svn commit -m "resolving conflicted folder or whatever"
Created a building permits table in the vihdev db. Noticed that to auto-increment the building_permit_id field, you have to reference a sequence, so created the necessary sequence modelled on others I found in the db (census tables).
Processed the raw data file (spreadsheet) into normalized data (typed a couple of the data fields that I could e.g. int or date and normalized data to comply with the constraints I had established e.g. length of varchar fields). Saved that as a CSV (rather than tab-delimited) as the documentation seemed to favour the CSV approach.
Only substantial fiddling I had to do with the data was for all the records whose date field was only a year (e.g. 1889), I arbitrarily assigned them the 1st of January (e.g. 18890101) as the date field requires 8 digits.
Once that instance uploaded successfully, did the exact same thing in the production instance, just so I have a second copy of the thing somewhere.
Once that was all working in the dev instance of the db,
Replaced the contents of the trunk, Alex branch and the backup branch with the files as updated by Martin and Greg, so they're up to date.
Gave AD all the svn, web account and db connection info he should need to get to the files, check them out, and post them to the web space for testing.
note to self on nuts and bolts
on local file system:
create the folder structure you want (if you're copying an existing local instance of an svn project, you have to delete the .svn file from each folder in that project)
on command line,
cd to parent folder of the one you want to add (that parent folder has to already be in svn)
svn add FOLDER_YOU_WANT_TO_ADD
svn commit -m "message about adding new folder"
There are three files in the site which contain database connection strings:
inc/config_EDIT_ME.inc
content/maps/include/conf_EDIT_ME.inc
content/maps/include/config_EDIT_ME.xml
In each of these three files, the values for the database connection string have been replaced with placeholders. You have to make a copy of each of those files with the following names:
inc/config.inc
content/maps/include/conf.inc
content/maps/include/config.xml
In the copies, substitute the correct values for your connection string.
If the folder is in svn (which it probably is), you'll need to use svn add to add each of the files to the repo, then do your svn commit.
I've now fixed the streetcar map problem, by changing all filenames and references so that they're consistently referring to 1939 instead of 1936. The fixes have been committed to trunk, and put up on the website.
The other two issues remain outstanding; PD will get back to me with the correct firemap URL at Malaspina, and we'll wait until the DNS has been changed before addressing the problem with Firefox and captcha, since it seems to be cookie-related.
PD has checked the new VIH site on taprlans/www, and reports only these issues:
- Captcha reports errors for him on FF. I can't reproduce this -- it works fine for me -- so waiting for more details.
- There's a link on the maps.php page pointing to the "1936 streetcar map" which fails, defaulting to the panorama, because it should point to 1939. The confusion arises thus:
The map itself has the date "1939" on it. The .map file (the basic definition file) is called "vicstreetcar1939.map". However. it contains pointers to images called: victoria-streetcar-1936.png victoria-streetcar-1936-key.png and this page: http://hcmc.uvic.ca/~taprhist/content/maps/maps.php has 1936 in its caption. However. if you go to the map viewer, click on the Maps menu and drill down to it. you see the caption "1939 - Streetcar routes". I think that: - The actual date is 1939. - The images are wrongly named, as are the pointers to them in the .map file. - The caption on the maps.php page is wrong. whereas the Maps menu in the map viewer is right.
Waiting for PD to confirm my analysis before changing the image file names, the .map file, and the caption/link on maps.php. - PD reports this:
The link on the page describing the 1885 Fire Insurance Plans of Victoria needs to be changed. The page in question is located at: http://hcmc.uvic.ca/~taprhist/content/maps/firemap.php The hyperlink should be redirected to: http://www.mediastudies.viu.ca/steeple/index.htm [The link currently points to an obsolete server - http://cdi.mala.bc.ca/firemaps/
However, the new URL is about 1891 panorama images; it has nothing about the 1985 Fire Insurance Plans. Waiting for the correct URL from PD.
There's one special map that uses SVG for its interface, and which was originally working only on FF; DB had put a complete block on other browsers, but now most of them support SVG so I've removed that block. Browsers that don't support SVG should get with the program. Committed that change to SVN, but I was unable to commit the bulk of my additions (code which was not originally in SVN, but should have been) because we ran out of disk space on revision.tapor.uvic.ca. We'll be switching to the new SVN on Monday, so hopefully this problem will be solved.
Got the site basically working by doing this:
- Moved everything from taprhist/vihdev/www to taprhist/www (it really doesn't like living in that odd location, and there are hard-coded paths in several text files).
- Removed "vihdev" from paths in config files.
- Cleaned out the /home1t/taprhist/www/content/maps/cache folder so it had to start rebuilding.
- Now we found that most maps were working, but a handful were failing. The failing maps had one line in their .map file:
METADATA "queryable" "true" "tile_source" "cache" <-- This line has to be removed. END
Our surmise is that this problem line causes the server to construct a broken path to a cache folder that doesn't exist or isn't writable, and therefore it cannot read or construct tiles.
We now propose to have the DNS repointed so that vihistory.uvic.ca points at taprhist/www, and keep the live site there.
Greg also noticed that vihistory.ca is broken; outside of the ring it's pointing at mala.bc.ca DNS servers, so he's emailed PD to get him to fix that on the domain host.
I've been working for the better part of the last couple of days trying to figure out why the imap maps in ViHistory are not appearing.
Problem appears with the production front end and a test front end connected to either the old db server or the new db server.
Problem appears with a test front end which is the production front end minus the captcha code (which is the only code that has changed since Jamie left us a working site).
We weren't getting errors when we trolled the server logs on lettuce.
We did get the following errors from the sysadmin:
Apache error log:
[Tue Mar 06 09:16:21 2012] [error] [client 96.54.151.99] [Tue Mar 6
09:16:21 2012].616424 loadSymbolSet(): Unable to access file.
(/home1t/taprhist/www/content/maps/user/symbol/generic.sym), referer:
http://vihistory.ca/content/maps/htdocs/index.php?map=vicbird1889
AND syslog:
2012-03-06T09:16:21-08:00 local@mustard.hcmc.uvic.ca user.notice
php-cgi: PHP Warning: [MapServer Error]: loadSymbolSet(): (/home1t/taprhist/www/content/maps/user/symbol/generic.sym)
2012-03-06T09:16:21-08:00 local@mustard.hcmc.uvic.ca user.notice in: /home1t/taprhist/vihdev/www/content/maps/htdocs/init.php on line 125
2012-03-06T09:16:21-08:00 local@mustard.hcmc.uvic.ca user.notice
php-cgi: PHP Warning: Failed to open map file /home1t/taprhist/vihdev/www/content/maps/user/map/vi1798.map in /home1t/taprhist/vihdev/www/content/maps/htdocs/init.php on line 125
2012-03-06T09:16:21-08:00 local@mustard.hcmc.uvic.ca user.notice
php-cgi: PHP Fatal error: Call to a member function getMetaData() on a non-object in /home1t/taprhist/vihdev/www/content/maps/htdocs/init.php on line 131
2012-03-06T09:16:21-08:00 local@mustard.hcmc.uvic.ca local0.debug
suphp_wrapper: 0 PHP5
/home1t/taprhist/vihdev/www/content/maps/htdocs/init.php
That init file is as provided by the imap people, so I really, really doubt it is causing the problem, though it is the where a problem occurs. It looks like that file is trying to create objects based on what it reads from some kind of config file, and somehow that process is breaking down, so the object doesn't get created, so invoking a method on the (non-existant or empty) object throws the error. MOre precisely, it looks like the config.php file is supposed to create an array in the variable aszMapFiles and then those values are used in the init file, but for some reason something is failing with the way that array and associated variables are being populated.
Drew migrated the viHistory and vihdev postgres db's to the new db server (pgsql.hcmc.uvic.ca)
I created an instance of the viHistory site at http://lettuce.tapor.uvic.ca/~sarneil/vihtest/site_root and changed the connection string to point at the postgres service on the new db server (pgsql.hcmc.uvic.ca) rather than the old db server (db.tapor.uvic.ca). I didn't change anything else. After a couple of iterations through errors coming back from the db, Drew made the necessary adjustments and the connection is now working to the new db.
What I've got to do now is take the vihdev db on the db.tapor.uvic.ca server and put it into the viHistory db on the pgsql.hcmc.uvic.ca server. Then when we change the connection string for the production site, we change the name of the db it's pointing to as well as the server it's pointing to.
The trunk I started with has the captcha modifications in it.
I created a tag called beforeAlex which is a copy of the trunk corresponding to the publicly accessible site.
I then svn rename the config.inc file to config_Edit_Me.inc and abstracted the connection string values and committed that. I searched all the files for any other instance of the connection string values and did not find any.
I then created a tag called configAbstracted which is a copy of the trunk with the config file modified (that's the only difference between the configAbstracted tag and the beforeAlex tag).
I then created a branch called alex which is identical to the configAbstracted tag (at least until Alex starts modifying it).
Still to do:
- create an svn user for Alex with rw privileges on https://revision.tapor.uvic.ca/svn/hcmc/vihistory/branches/alex
- create a user in the database for Alex with read privileges, which he can use to modify the config_Edit_Me.php file.
Had a puzzling time with this: file originally called censusformbasicCaptcha.inc was renamed to censusformbasic.inc and file originally called censusformbasic.inc was renamed to censusformbasicNoCaptcha.inc, but I forgot to do svn rename for those. Subsequently had a miserable time trying to commit to the repository as I kept getting File "censusformbasicCaptcha.inc is out of date, path not found", but only for that file. I eventually svn delete 'd the entire svnrepo folder on my local computer and checked out the entire repository and the errors went away.
To make captcha work, you need to make modifications to the included form file and to the file that invokes the form.
// I'm using the $_SESSION['captchaPassed'] variable, so that even if the user moves from form to form or query to query, they only have to verify one captcha per session
// if $_SESSION['captchaPassed'] is not set, the user has not yet tried to validate a captcha field in this session
// if $_SESSION['captchaPassed'] is set and has value false, the user has tried to validate a captcha field in this session and failed
// if $_SESSION['captchaPassed'] is set and has value true, the user has successfully validated a captcha field in this session
I'll need to rewrite the code in the annotations section to call on the instance of captcha in the captcha folder rather than the one in the annotate folder.
Because the js file currently assumes the call is coming from a file at a certain level in the site hierarchy, and the search and annotate files are at different levels in that hierarchy, I'll have to either write a separate function for the annotate call, or modify the calls to the loadcaptcha function so they pass in the path from themselves to the root as a second argument.
==================================================
Changes to the actual search form (e.g. censusformbasic) as seen in censusformbasicCaptcha.php
==================================================
1) Make sure this line points to correct file (for production, should be searchcensus.php) :
$search_arg = "searchcensusCaptcha.php?start=0&form=$form&show=n";
2) Add this block at end of form:
<?php
if ((isset($_SESSION['captchaPassed'])) && ($_SESSION['captchaPassed'])) {
//do nothing
} else {
?>
<fieldset class="">
<legend>
Verify
</legend>
<label>
To verify that you are a real, live human and not a spambot,<br />
enter the text you see in the image below.<br />
If you can't read the text, click
<a href="javascript:loadcaptcha('captcha');"><b>here</b></a>,
or see the <a href="<?php print $doc_root; ?>/help/annotation.php#verify">help page</a>.
<?php
if ((isset($_SESSION['captchaPassed'])) && (!($_SESSION['captchaPassed']))) {
print '<br /><br /><span style="color: #ff0000; font-weight: bold;">The text you entered does not match the image. Try again. </span>';
}
?>
</label>
<br />
<label for="verifier">Image text (5 letters)</label>
<input type="text" name="verifier" id="verifier" value="" />
<img id="captcha" src="<?php print $doc_root; ?>/captcha/captcha.php" width="200" height="60" alt="Verify" style="border: 1px solid #000000;;"/>
</fieldset>
<?php
} // end of if isset($_SESSION['captchaPassed']) ...
?>
==================================================
Changes to the page calling the form (e.g. searchcensus.php) as seen in searchcensusCaptcha.php.
==================================================
1) Add these blocks to head:
/**
* Load captcha library.
*/
require_once("$doc_root/captcha/php-captcha.inc");
<script type="text/javascript" src="<? print $doc_root ?>/script/loadcaptcha.js"></script>
2) Make sure this line points to right file (for production, should be searchcensus.php):
$search_page = "searchcensusCaptcha.php";
3) Add this block immediately after the first "if (isset($_GET['start'])) {"
//SA we need to defy the spambots
// I'm using the $_SESSION['captchaPassed'] variable
// if $_SESSION['captchaPassed'] is not set, the user has not yet tried to validate a captcha field in this session
// if $_SESSION['captchaPassed'] is set and has value false, the user has tried to validate a captcha field in this session and failed
// if $_SESSION['captchaPassed'] is set and has value true, the user has successfully validated a captcha field in this session
// need to do the CAPTCHA test before any query gets sent to the DB
// Only do the test if the user has not tried to validate yet, or if the user has tried to validate and failed
// my actual is test is : if (NOT ((session var is set) AND (session var is true)) )
if (!((isset($_SESSION['captchaPassed'])) && ($_SESSION['captchaPassed'] == true))) {
if (PhpCaptcha::Validate($_POST['verifier'])) {
$_SESSION['captchaPassed'] = true;
} else {
$_SESSION['captchaPassed'] = false;
}
}
if ((isset($_SESSION['captchaPassed'])) && ($_SESSION['captchaPassed'] == true)) {
// We know user has successfully validated the captcha, so do the query and display the results
4) add this block immediately before "} // end of if isset($_GET['start'])"
} // end of if isset ($_SESSION['captchaPassed']) ...
5) make sure this line points to correct file (that is, censusformbasic):
require("$doc_root/inc/censusformbasicCaptcha.inc"); //SA was censusformbasic.inc
6) add this block immediately before the second "if (isset($_GET['start'])) {"
if ((isset($_SESSION['captchaPassed'])) && ($_SESSION['captchaPassed'] == true)) {
7) add this block after "pg_free_result($result); }}"
} // end of if isset($_SESSION['captchaPassed']) ...
Checked out the vihistory svn repository at https://revision.tapor.uvic.ca/svn/hcmc/vihistory into a directory on my mac
Copied the files down from the production site (taprhist/vihdev). Next I need to compare the two to make sure that the repository and production site are in sync.
Once that's done, then I can start work on implementing the captcha code from the annotations page into other forms, name the searchcensus one.
Also, the production site has .svn folders in it, which might be the result of using an svn client on that set of folders, or may just be an artefact of ftping files up to the server from another location that was in svn.
CC has new student ES working FrancoToile. First task is copy editing.
Got her to log in, then I created an SVN collection on Pomme.
She edited one file, then committed the collection, which worked.
She then tried updating, but of course there were no changes to any of the files, so the client reported that the directory had been skipped - not sure if that is the correct behaviour or not. I'll have to change one of the files as me before she comes in tomorrow and see what happens when she then logs in and updates.
Still have not looked at the problem with the rng file not allowing attributes and elements that it should be, but the schema is being invoked.
PD reported a number of bogus annotations had been submitted on three specific dates. I went in through phppgmyadmin to see the annotations table and made sure there were no dependencies that would be broken if I deleted records from that table (there weren't as each record in that table has a field for data table and data record id to link the data record and annotation). I then exported that table and extracted the records from the specified days and sent that to PD as a spreadsheet. He reviewed the records and told me which ones to delete, which I did about 30 in all.
I also confirmed that the active database is vihdev. The vihistory database is now out of date (though I found some similar bogus entries in the annotations table in that db and deleted them. At some point we may want to drop all the tables in the vihistory db, copy all the tables from vihdev to vihistory and then change the connection configuration in the front-end files. Not a high priority.
Finally, noticed 3.8 million records in the search_log table since 3 of July (or possibly 7 of March). That's consistent with Jaime's earlier blog posting finding 20 million records over 5 years in that table and suggestion for how to export them to a file. I'll wait before exporting the search log again, as I'll need some help from the sys admins (or may end up asking them to do it as a timed job once every six months or so).
Updated more content on the development server, supplied by PD:
http://tapor.uvic.ca/~vihdev/content/census/1911/census1911.php?page=main
http://tapor.uvic.ca/~vihdev/content/census/1911/census1911.php?page=enumcats
Added some introductory text for the 1911 census, supplied by PD. He also gave me a new version of the homepage text, which I updated.
http://tapor.uvic.ca/~vihdev/index.php
http://tapor.uvic.ca/~vihdev/content/census/1911/census1911.php?page=main
Addressed all of the issues from my July 26 meeting with PD. Specifically:
- I updated the meta tags with new information (1911 and 1901 census years), and fixed the "Courtenay" typo
- I removed the searchable "dot" for 1901 for the "Birthplace of Father" field in the advanced census search form
- I fixed the short list of incorrect records that you gave me (stemming from transcription errors)
We had also discussed adding Racial/Ethnic Origin as a searchable field for 1881, but upon further investigation I've discovered that, due to a bit of confusion on my part (there are a few similar fields), the 1881 census doesn't actually have this data. Racial/ethnic origin is actually "race_id", which doesn't appear until 1901. The closest the 1881 data comes is "origin_id", which actually looks to be the birthplace of the father (that's how it's setup in the search). So, we are going to stand pat with this change.
Due to a primary key sequence inconsistency on the occupation table between my local machine, where I first imported and tested all of the data, and the development server, many of the new occupation codes inserted for the 1911 data had a different occupation_id on each machine. As a result, the occupations in the 1911 view for census records made reference to incorrect occupation IDs and, thus, incorrect occupation codes.
Since occupation_id isn't actually needed - occupation_code is also a unique identifier and doesn't suffer from the same sequence restraints - I decided to add an occupation_code column to all of the census tables and update the views to use this column instead of occupation_id.
The steps I took:
#1: Added occupation_code character varying (5) column to 1881, 1891, 1901, 1911 on both local machine and dev server
#2: For all except 1911 executed this SQL on test server:
UPDATE census_1881 SET occupation_code = (SELECT occupation_code FROM occupation WHERE occupation.occupation_id = census_1881.occupation_id)
UPDATE census_1891 SET occupation_code = (SELECT occupation_code FROM occupation WHERE occupation.occupation_id = census_1891.occupation_id)
UPDATE census_1901 SET occupation_code = (SELECT occupation_code FROM occupation WHERE occupation.occupation_id = census_1901.occupation_id)
On my local host, I also executed this on 1911:
UPDATE census_1911 SET occupation_code = (SELECT occupation_code FROM occupation WHERE occupation.occupation_id = census_1911.occupation_id)
#3: Altered the 1881, 1891, 1901, 1911 view definitions to change this (1881 used as example):
LEFT JOIN occupation ON occupation.occupation_id = census_1881.occupation_id
To this:
LEFT JOIN occupation ON occupation.occupation_code = census_1881.occupation_code
Also changed the occupation_id column definition in the views from
census_<year>.occupation_id
to
occupation.occupation_id
. This effectively rendered the occupation_id column in the census tables obsolete, since occupation_id is no longer used as the foreign key lookup in the views.
#4: Modified vOccupations<year> views to join the census table on occupation_code rather than occupation_id
#5: Dropped now unneeded occupation_id from census_1911, but not the earlier data. It's not strictly needed in the other tables but I don't want to rock the boat too much.
#6: re-imported all of the 1911 data from my localhost to the development server.
All is good - incorrect occupations have been corrected.
Met with PD yesterday (July 26) to go over some loose ends with the new dataset. Tasks for me that arose from the meeting:
- PD pointed out some inconsistencies with occupation codes; namely, that some 1911 census records have the wrong code. I need to do a full scan to find out how many there are, and then fix them.
- PD also pointed out some incorrect data in 6 of the records. The mistakes were in the original CSV files. I need to go into the database and correct them.
- The 1881 data should have Racial/Ethnic origin as a searchable field - it's in the data but not searchable on the website.
- The 1901 data should not have a searchable 'dot' for birthplace of father.
- The meta tags should be updated with new information to match the latest data imported.
Received SD8 (Hotels and rooming houses) from PD over the weekend, along with some new occupation, relation to head, religion, birthplace, and nationality codes. Imported successfully.
INSERT INTO "location" VALUES (103, 'BC', 'British Columbia', '13', 'Victoria City', '8', 'Hotels and Rooming Houses', NULL, NULL, NULL, NULL, NULL, '1911', NULL, NULL);
After sorting out the ID issue, I've imported the SD16 data.
INSERT INTO "location" VALUES (102, 'BC', 'British Columbia', '13', 'Victoria City', '16', 'Beacon Hill', NULL, NULL, NULL, NULL, NULL, '1911', NULL, NULL);
PD sent the SD16 (Beacon Hill) data set yesterday, which is missing unique IDs. As PD says, the ID numbers likely got list during one of the many data shuffles from Access to Excel to various worksheets.
The highest ID in the census data is 31420. To ensure that any IDs used for SD16 aren't assigned to any other data sets (all of which do have unique IDs in the Excel spreadsheets), I wrote a small function in LibreOffice to generate IDs starting with 31421. In the ID column, I simply entered "=ROW()+31419" for each row.
INSERT INTO "location" VALUES (100, 'BC', 'British Columbia', '13', 'Victoria City', '15', 'Burdette-Church Hill', NULL, NULL, NULL, NULL, NULL, '1911', NULL, NULL);
INSERT INTO "location" VALUES (101, 'BC', 'British Columbia', '13', 'Victoria City', '11', 'Jubilee', NULL, NULL, NULL, NULL, NULL, '1911', NULL, NULL);
INSERT INTO "location" VALUES (99, 'BC', 'British Columbia', '13', 'Victoria City', '23', 'Humboldt', NULL, NULL, NULL, NULL, NULL, '1911', NULL, NULL);
INSERT INTO "location" VALUES (98, 'BC', 'British Columbia', '13', 'Victoria City', '17', 'James Bay North', NULL, NULL, NULL, NULL, NULL, '1911', NULL, NULL);
Received the data for Rockland-Foul Bay (SD 12) from PD yesterday and imported today, along with a handful of new occupations.
INSERT INTO "location" VALUES (95, 'BC', 'British Columbia', '13', 'Victoria City', '12', 'Rockland-Foul Bay', NULL, NULL, NULL, NULL, NULL, '1911', NULL, NULL);
Received Fernwood-Fort Street from PD and imported. Large dataset of around 3400 records. Added in some new occupation codes.
The loader app is showing its limitations when handling large files such as this one. It could use a rewrite, since it was originally designed for Access files with only cursory support for CSVs. But with so much data already in it's probably not worth the effort right now.
INSERT INTO "location" VALUES (94, 'BC', 'British Columbia', '13', 'Victoria City', '9', 'Fernwood-Fort Street', NULL, NULL, NULL, NULL, NULL, '1911', NULL, NULL);
INSERT INTO "location" VALUES (91, 'BC', 'British Columbia', '13', 'Victoria City', '2', 'Douglas-Gorge', NULL, NULL, NULL, NULL, NULL, '1911', NULL, NULL);
INSERT INTO "location" VALUES (92, 'BC', 'British Columbia', '13', 'Victoria City', '6', 'Hillside-Douglas', NULL, NULL, NULL, NULL, NULL, '1911', NULL, NULL);
INSERT INTO "location" VALUES (93, 'BC', 'British Columbia', '13', 'Victoria City', '21', 'Quadra-Harris Green', NULL, NULL, NULL, NULL, NULL, '1911', NULL, NULL);
Received Hillside-Quadra data from PD and imported successfully to development database/website.
INSERT INTO "location" VALUES (90, 'BC', 'British Columbia', '13', 'Victoria City', '7', 'Hillside-Quadra', NULL, NULL, NULL, NULL, NULL, '1911', NULL, NULL);
INSERT INTO "location" VALUES (87, 'BC', 'British Columbia', '13', 'Victoria City', '25', 'James Bay', NULL, NULL, NULL, NULL, NULL, '1911', NULL, NULL);
INSERT INTO "location" VALUES (88, 'BC', 'British Columbia', '13', 'Victoria City', '24', 'James Bay Southeast', NULL, NULL, NULL, NULL, NULL, '1911', NULL, NULL);
INSERT INTO "location" VALUES (89, 'BC', 'British Columbia', '13', 'Victoria City', '18', 'James Bay Southwest', NULL, NULL, NULL, NULL, NULL, '1911', NULL, NULL);
INSERT INTO "location" VALUES (86, 'BC', 'British Columbia', '13', 'Victoria City', '14', 'Fairfield-Rockland', NULL, NULL, NULL, NULL, NULL, '1911', NULL, NULL);
INSERT INTO "location" VALUES (85, 'BC', 'British Columbia', '13', 'Victoria City', '20', 'Oriental District', NULL, NULL, NULL, NULL, NULL, '1911', NULL, NULL);
INSERT INTO occupation (occupation_description, occupation_code) VALUES
('Pool Romm Attendant', '59990'),
('Setter', '81275'),
('Lamp Fitter', '87330'),
('General Finisher', '99999');
And a new location:
INSERT INTO "location" VALUES (84, 'BC', 'British Columbia', '13', 'Victoria City', '5', 'Central Park', NULL, NULL, NULL, NULL, NULL, '1911', NULL, NULL);
Received and added some new and updated page copy from PD. Added to both the development and live sites.
- Added new "Notes on Enumeration Categories" page for 1901 census: http://vihistory.ca/content/census/1901/census1901.php?page=enumcats
- Updated 1901 introduction page: http://vihistory.ca/content/census/1901/census1901.php?page=introduction
- Update 1901 main index page (with link to enumeration categories page): http://vihistory.ca/content/census/1901/census1901.php?page=main
Received the latest set of 1911 census data from PD - subdistrict, #1, Vic West. Imported into the development database smoothly and without incident.
Also inserted six new occupations:
INSERT INTO occupation (occupation_description, occupation_code) VALUES ('Pianist', 17140);
INSERT INTO occupation (occupation_description, occupation_code) VALUES ('Show Woman', 17225);
INSERT INTO occupation (occupation_description, occupation_code) VALUES ('Butter Maker', 77530);
INSERT INTO occupation (occupation_description, occupation_code) VALUES ('Gas Fitter', 87120);
INSERT INTO occupation (occupation_description, occupation_code) VALUES ('Cement Worker', 95200);
INSERT INTO occupation (occupation_description, occupation_code) VALUES ('Road Maker', 97415);
And a new location for the sub district:
INSERT INTO "location" VALUES (83, 'BC', 'British Columbia', '13', 'Victoria City', '1', 'Victoria West', NULL, NULL, NULL, NULL, NULL, '1911', NULL, NULL);
Added more views to the PostgreSQL database to generate lists of values that are present in the census data, for the advanced search form.
vOriginsUsed - used for the "Birthplace", "Birthplace of Father", and "Birthplace of Mother" fields:
CREATE VIEW "vOriginsUsed" AS
SELECT origin_id, origin
FROM "origins"
WHERE origin IS NOT NULL
AND origins.origin_id IN (
SELECT "vCensus1881".birthplace_id FROM "vCensus1881"
UNION
SELECT "vCensus1881".birthplace_father_id FROM "vCensus1881"
UNION
SELECT "vCensus1891".birthplace_id FROM "vCensus1891"
UNION
SELECT "vCensus1891".birthplace_father_id FROM "vCensus1891"
UNION
SELECT "vCensus1891".birthplace_mother_id FROM "vCensus1891"
UNION
SELECT "vCensus1901".birthplace_id FROM "vCensus1901"
UNION
SELECT "vCensus1911".birthplace_id FROM "vCensus1911"
)
ORDER BY upper(origin)
The origins view is pretty slow because of the large number of SELECT subqueries.
vNationalitiesUsed - used for the "Nationality" search field:
CREATE VIEW "vNationalitiesUsed" AS
SELECT nationality_id, nationality
FROM "nationalities"
WHERE nationality IS NOT NULL
AND nationalities.nationality_id IN (
SELECT "vCensus1901".nationality_id FROM "vCensus1901"
UNION
SELECT "vCensus1911".nationality_id FROM "vCensus1911"
)
ORDER BY upper(nationality)
vRacesUsed - used for the "Race/Ethnic Origin" search field:
CREATE VIEW "vRacesUsed" AS
SELECT race_id, race_name
FROM "races"
WHERE race_name IS NOT NULL
AND races.race_id IN (
SELECT "vCensus1901".race_id FROM "vCensus1901"
UNION
SELECT "vCensus1911".race_id FROM "vCensus1911"
)
ORDER BY upper(race_name)
vBandsUsed - used for the "First Nations Band" search field:
CREATE VIEW "vBandsUsed" AS
SELECT band_id, band
FROM "band"
WHERE band IS NOT NULL
AND band.band_id IN (
SELECT "vCensus1881".band_id FROM "vCensus1881"
UNION
SELECT "vCensus1891".band_id FROM "vCensus1891"
)
ORDER BY upper(band)
vMotherTonguesUsed - used for the "Mother Tongue", "First Language Spoken", and "Second Language Spoken" search fields:
CREATE VIEW "vMotherTonguesUsed" AS
SELECT mother_tongue_id, mother_tongue
FROM "mother_tongues"
WHERE mother_tongue IS NOT NULL
AND mother_tongues.mother_tongue_id IN (
SELECT "vCensus1901".mother_tongue_id FROM "vCensus1901"
UNION
SELECT "vCensus1911".first_language_id FROM "vCensus1911"
UNION
SELECT "vCensus1911".second_language_id FROM "vCensus1911"
)
ORDER BY upper(mother_tongue)
CREATE VIEW "vOccupationsUsed" AS
SELECT DISTINCT ON (upper(occupation_description)) occupation_description AS occupation_description, occupation_id, occupation_code
FROM "occupation"
WHERE
occupation_description IS NOT NULL
AND
(
(occupation.occupation_code IN (SELECT "vOccupations1881".code FROM "vOccupations1881" WHERE "vOccupations1881".count1881 > 0))
OR
(occupation.occupation_code IN (SELECT "vOccupations1891".code FROM "vOccupations1891" WHERE "vOccupations1891".count1891 > 0))
OR
(occupation.occupation_code IN (SELECT "vOccupations1901".code FROM "vOccupations1901" WHERE "vOccupations1901".count1901 > 0))
OR
(occupation.occupation_code IN (SELECT "vOccupations1911".code FROM "vOccupations1911" WHERE "vOccupations1911".count1911 > 0))
)
ORDER BY upper(occupation_description)
Since location_id isn't part of the 1911 records insert, but sub_district_id is, use this SQL to update the location_id after you've put a row into the location table corresponding to the new sub-district:
UPDATE census_1911 SET location_id = (SELECT location.location_id FROM location WHERE location_02_id = '13' AND location_03_id = CAST(census_1911.sub_district_id AS character varying(50)) AND census_year = '1911')
pg_insert()
or pg_update()
, depending on whether any conditions exist. Aside from allowing for condition support, this change also keeps the data more organized and takes out the guesswork involved in building raw SQL queries via a for
loop.
I also had to edit the _fix_field()
function in inc/lib.inc to stop the automatic escaping of data values into single quotes (e.g. 'data'). pg_insert()
and pg_update()
escape automatically, so the code in _fix_field()
was resulting in double escaping.PD recently sent me an updated version of the 1901 data with expanded address information, namely street number and street name separated into two fields. This will allow us to have street searching for both 1901 and 1911. However, the loader application didn't support updating existing rows, only importing new ones. So, I had to build update capabilities.
The loader now supports "conditionKey" lines, which can be used to specify conditions that need to be met in order to perform an update. The format of the line is:
conditionKey=SourceDataColumnName:DatabaseTableColumnName
For example, this is the condition key line that I specified when importing the new address data for 1901:
conditionKey=census_record_id:census_record_id
The field on the left of the colon tells the loader to look at the value of the census_record_id
column in the source data CSV, while the field on the right side of the colon tells the loader to check that value against the census_record_id
column in the database table.
Internally, adding one or more conditions causes the loader to use the pg_update()
function instead of the pg_insert
function for the row in question.
Multiple conditionKey lines may be specified.
Added 11 new fields to the views for each census year. The only view that actually contains data in these fields is 1911; the other years just have NULL values. All of the census year views (vCensus1881 - 1911) need to have the same list of columns since UNION queries are performed during census searches.
The new fields in the vCensus1911 view:
census_1911.employment_state,
census_1911.other_employment,
census_1911.weeks_working,
census_1911.hours_working,
census_1911.hourly_wage,
census_1911.weeks_working_other_employment,
census_1911.hours_working_other_employment,
census_1911.life_insurance,
census_1911.health_insurance,
census_1911.insurance_cost,
census_1911.education_cost
The new fields in the other views:
NULL::character varying(255) AS employment_state,
NULL::character varying(100) AS other_employment,
NULL::integer AS weeks_working,
NULL::integer AS hours_working,
NULL::numeric(6,2) AS hourly_wage,
NULL::integer AS weeks_working_other_employment,
NULL::integer AS hours_working_other_employment,
NULL::numeric(8,2) AS life_insurance,
NULL::numeric(8,2) AS health_insurance,
NULL::character varying(100) AS insurance_cost,
NULL::numeric(12,2) AS education_cost
Finished the first pass at the vCensus1911 view, which is the main census query view for the census_1911 table. Since all of the census year views need to have exactly the same columns, I started with the vCensus1901 view and modified the column definitions to suit the new data. Another consequence is that, because view definitions need to be the same for every year, any new data columns in the 1911 view need to be retroactively put into the earlier viewsm usually as NULL returns. So far, I've put in just two: secondary_language_id and secondary_language. The 1911 census is the first dataset to have a second language option, so the earlier views need to have NULL fields for the second language columns. The NULL fields look like this:
NULL::integer AS secondary_language_id,
NULL::varchar(100) AS second_language
The vCensus1911 view, in all its glory, looks like this:
CREATE VIEW "vCensus1911" AS
SELECT
census_1911.census_record_id,
'1911'::character varying(4) AS census,
census_1911.location_id,
census_1911.family_id,
NULL::character varying(10) AS case_number,
initcap(census_1911.family_name) AS family_name,
initcap(census_1911.given_name) AS given_name,
NULL::character varying(100) AS middle_name,
census_1911.sex,
census_1911.age,
census_1911.birthyear::double precision,
census_1911.birthmonth::double precision,
NULL::double precision as birthday,
census_1911.marital_status_id,
census_1911.relationship_to_head_id,
census_1911.birthplace_id,
NULL::integer AS birthplace_father_id,
NULL::integer AS birthplace_mother_id,
NULL::boolean AS french_canadian,
census_1911.occupation_id,
census_1911.religion_id,
NULL::integer AS building_id,
NULL::integer AS construction_id,
NULL::smallint AS floors,
NULL::smallint AS rooms,
census_1911.employment,
NULL::integer AS employees,
census_1911.read,
census_1911.write,
NULL::boolean AS school,
census_1911.months_school,
NULL::integer AS infirmities_id,
census_1911.comments,
NULL::integer AS band_id,
census_1911.earnings,
census_1911.earnings_extra,
NULL::integer AS earnings_period_id,
(census_1911.earnings + census_1911.earnings_extra)::numeric(6,2) AS total_earnings,
census_1911.english,
census_1911.french,
census_1911.mother_tongue_id,
census_1911.immigration_year,
census_1911.naturalization_year::smallint AS naturalization_year,
NULL::integer AS birthplace_type_id,
NULL::integer AS colour_id,
census_1911.race_id,
census_1911.nationality_id,
initcap(census_1911.address) AS address,
census_1911.age_months,
location.location_01_id,
location.location_01,
location.location_02_id,
location.location_02,
location.location_03_id,
location.location_03,
location.location_04_id,
location.location_04,
location.comments AS loc_comments,
marital_status.marital_status,
birth.origin AS birth_place,
NULL::character varying(255) AS birth_place_father,
NULL::character varying(255) AS birth_place_mother,
occupation.occupation_code,
occupation.occupation_description,
religion.religion,
NULL::character varying(100) AS building,
NULL::character varying(100) AS construction,
array_to_string(array(SELECT infirmities.infirmities FROM infirmities INNER JOIN census_record_infirmities ON census_record_infirmities.census_year = '1911' AND census_record_infirmities.infirmities_id = infirmities.infirmities_id INNER JOIN census_1911 AS temp_census ON temp_census.census_record_id = census_record_infirmities.census_record_id WHERE census_record_infirmities.census_record_id = census_1911.census_record_id), ', ')::character varying(100) AS infirmities,
relationship_to_head.relationship_to_head,
NULL::character varying(15) AS earning_period_name,
mother_tongues.mother_tongue,
NULL::character varying(15) AS birthplace_type,
NULL::character varying(20) AS colour,
races.race_name,
nationalities.nationality,
NULL::character varying(255) AS band,
census_1911.secondary_language_id,
secondary.mother_tongue AS second_language,
( SELECT count(annotations.annotation_id) AS count
FROM annotations
WHERE annotations.table_id = 7 AND annotations.record_id = census_1911.census_record_id) AS has_annote
FROM census_1911
LEFT JOIN location ON location.location_id = census_1911.location_id
LEFT JOIN marital_status ON marital_status.marital_status_id = census_1911.marital_status_id
LEFT JOIN relationship_to_head ON relationship_to_head.relationship_to_head_id = census_1911.relationship_to_head_id
LEFT JOIN origins birth ON birth.origin_id = census_1911.birthplace_id
LEFT JOIN occupation ON occupation.occupation_id = census_1911.occupation_id
LEFT JOIN religion ON religion.religion_id = census_1911.religion_id
LEFT JOIN mother_tongues ON mother_tongues.mother_tongue_id = census_1911.mother_tongue_id
LEFT JOIN mother_tongues secondary ON secondary.mother_tongue_id = census_1911.secondary_language_id
LEFT JOIN races ON races.race_id = census_1911.race_id
LEFT JOIN nationalities ON nationalities.nationality_id = census_1911.nationality_id;
SELECT GROUP_CONCAT
subquery for Martin's Adaptive DB ( see http://hcmc.uvic.ca/blogs/index.php?blog=44&p=7646&more=1&c=1&tb=1&pb=1 ), which is what I had planned to use, wouldn't fly because PostgreSQL doesn't support GROUP_CONCAT. Luckily, there's another (and arguably better) solution using the aggregate array_to_string()
function:
array_to_string(array(SELECT infirmities.infirmities FROM infirmities INNER JOIN census_record_infirmities ON census_record_infirmities.census_year = '1911' AND census_record_infirmities.infirmities_id = infirmities.infirmities_id INNER JOIN census_1911 AS temp_census ON temp_census.census_record_id = census_record_infirmities.census_record_id WHERE census_record_infirmities.census_record_id = census_1911.census_record_id), ', ') AS infirmities
So, the SELECT query, which returns 0 or more rows with one column each, is cast as an array. That array is then converted to a string, using the separator defined at the end of the function to join the values together.Created some views for the 1911 data, to match previous years (haven't yet created the big search view for 1911):
Religion count:
CREATE VIEW "vReligions1911" AS
SELECT religion.religion_id AS code, religion.religion AS name, count(census_1911.census_record_id) AS count1911 FROM (religion LEFT JOIN census_1911 ON ((census_1911.religion_id = religion.religion_id))) GROUP BY religion.religion_id, religion.religion;
Occupation count:
CREATE VIEW "vOccupations1911" AS
SELECT occupation.occupation_code AS code, occupation.occupation_description AS name, count(census_1911.census_record_id) AS count1911 FROM (occupation LEFT JOIN census_1911 ON ((census_1911.occupation_id = occupation.occupation_id))) GROUP BY occupation.occupation_code, occupation.occupation_description;
Location count (2 views):
CREATE VIEW "vCensus1911LocCount" AS
SELECT census_1911.location_id, count(census_1911.census_record_id) AS loc_count FROM census_1911 GROUP BY census_1911.location_id;
CREATE VIEW "vLocation1911Count" AS
SELECT location.location_id, location.location_01_id, location.location_01, location.location_02_id, location.location_02, location.location_03_id, location.location_03, location.location_04_id, location.location_04, location.comments, location.notes, location.map_name, location.census_year, "vCensus1911LocCount".loc_count FROM (location LEFT JOIN "vCensus1911LocCount" ON (("vCensus1911LocCount".location_id = location.location_id))) WHERE ((location.census_year)::text = '1911'::text);
I further expanded the loader application and Loader class to do an optional check for existing records before importing data. This is useful when importing new values to an existing database table using a dataset that has existing as well as new data. I wrote this feature because PD sent me occupation and relationship-to-head code CSVs which have both codes that are already in the database tables (occupation and relationship_to_head respectively) from previous census years, as well as new codes needed for the 1911 census. I couldn't just delete the existing data in the tables and do a fresh re-import because the tables contain some extra data not in PD's spreadsheets. So, I devised a feature for the loader that allows for one or more "existing key" lines to be specified in the map file for the dataset. The value of each specified key in the row of data is queried against the database. Multiple keys may be specified to check multiple values.
The format of the line is:
checkForExisting=SourceDataColumnName:DatabaseTableColumnName
For example, this is my map file for the new occupation data CSV file:
occupation_id='DEFAULT'
occupation_code=*pad(occupation_code,5,'0','left')
occupation_description=occupation_description
checkForExisting=occupation_code:occupation_code
Two fields are imported, occupation_code and occupation_description (they have the same names in both the database table and the CSV). occupation_id is generated as a 'DEFAULT' (i.e. auto-increment) value. The checkForExisting line specifies that, before importing a row, the value of that row's occupation_code (the left side of the colon) column will be checked against the database table's occupation_code (the right side of the colon). If there's a match then the row is skipped.
I've put the documentation for this feature in the Loader manual (basically word-for-word from this blog post).
I expanded the loader application's error/warning log system to use the Loader class I previously wrote. Warnings and errors are now gathered into arrays and then printed at the end, rather than during the iteration over rows. Warnings and errors are also now organized into 'line', 'message', 'type', and 'SQL' values in the Loader class. I also added an option to show either all errors/warnings complete with line numbers and SQL, or just a list of distinct errors, which will be useful for quickly assessing the variety of errors/warnings in an import.
I also modified the logging process so that, instead of having to specify a target directory for the log files each time the loader application is used, the logs are just saved to a "logs" subdirectory within the loader app. Much simpler and more intuitive.
I've completed (I think) the necessary modifications to the handy Loader application originally written by DB, which was designed to import census data from an MS Access database or CSV file into a PostgreSQL table. A complete description of the application can be found in the Loader Manual document which is distributed with the program. Because both the format of the data and the development environment have changed substantially - we now receive the raw data from PD in CSV format, and we're developing on Linux instead of Windows (which is what DB was using) - quite a few changes were needed. I also made a few general improvements along the way.
General Improvements
- Completed the CSV importing portion of the code, which was never finished (because DB was focused on Access), and made that the default option rather than MS Access
- Created a Loader class (inc/loader.php) to abstract some of the functionality previously "spaghettied" in the main loader file, mainly loading the datasource and gathering all of the rows. The interface for both of these tasks, among others, is now the same whether a CSV file or Access DB is used
Mapping Function Improvements
I wrote some new mapping functions and modified some existing ones to help me import the 1911 data (complete documentation in Loader manual. All of these functions are in inc/lib.inc):
- Wrote a new
_existsMultiple()
function that works like the existing_exists()
function but allows for testing multiple values (e.g. test whether a value exists in any specified row) - Made the existing
_is_true()
function, which was just being used by other loader map functions, available for direct use in map files. It will test fields for human-readable true/false values ("Y", "yes", "N", "no", etc.) and return a boolean value - Wrote a new
_fractionalMonth()
function, which takes a value like "3/12" and returns "3". This was written because the 1911 data from PD uses that format (3/12 and similar) to denote the age in months for infants - Modified the existing
_lookup()
function to accept a string literal as a search value, rather than a reference to a source data column name - Wrote a new _
occupation()
function to deal with both 4 and 5 digit occupation codes. The function formats the value to 5 digits and then passes everything on to_lookup()
One To Many Relationships
The Loader application did not support one-to-many (OTM) relationships. So, for example, if a census record could have multiple infirmities (i.e. census record -> has many infirmities), there was no way to properly record these multiple infirmities. Instead, the data was shoehorned into the census year data tables in an awkward way. So, for a more accurate representation of the 1911 (and beyond) data, in consultation with MH I decided to write OTM relationship support into the Loader application and mapping files.
The new system for mapping OTM relationships is somewhat convoluted and complex, but because I don't have the time or desire to rewrite the Loader application, I had to work within its confines. It uses the value of a specified CSV column and performs a lookup on a secondary table to grab the ID associated with that value. Then, it inserts that ID, along with the census record ID and any other data you want to pass, into the one to many lookup table.
A line looks like this (all one line):
oneToMany:RawDataColumn=one_to_many__table;target_column_in_otm_table;lookup_table,lookup_table_key_fields,lookup_table_return_field;extra_key:val_data
Or, a better concrete example (again, all one line):
oneToMany:Blind=census_record_infirmities;infirmities_id;infirmities,infirmities:"Blind",infirmities_id;census_year:1911
How it works:
First, the line is split at the '=', resulting in two arrays which we'll call LeftSide and RightSide. LeftSide[1] gives us the CSV column from which we'll be taking our data. RightSide contains all of the other info.
RightSide is itself exploded with a semicolon separator, and breaks down like this:
- 0: the table holding the OTM relationship records. In this case, census_record_infirmities is used for the census record -> has many infirmities relationship.
- 1: the name of the column in the OTM table, the value for which will be generated from LeftSide[1] (the data in the CSV column)
- 2: the information for the ID lookup. The arguments here, comma-separated, follow the same conventions as the
_lookup()
function (which is used to get the value we want - documented in the Loader manual). - 3: An optional list of comma-separated key:value pairs which will be inserted into the OTM table without modification. In the above example, census_year:1911 means that "1911" will be inserted into the census_record_infirmities.census_year column for each record.
oneToMany:Blind=census_record_infirmities;infirmities_id;infirmities,infirmities:"Blind",infirmities_id;census_year:1911
oneToMany:DeafAndDumb=census_record_infirmities;infirmities_id;infirmities,infirmities:"Deaf and Dumb",infirmities_id;census_year:1911
oneToMany:CrazyOrLunatic=census_record_infirmities;infirmities_id;infirmities,infirmities:"Crazy",infirmities_id;census_year:1911
oneToMany:IdioticOrSilly=census_record_infirmities;infirmities_id;infirmities,infirmities:"Helpless",infirmities_id;census_year:1911
Thanks to PD, I've found the resource for the majority of standardized codes used in the database. The Canadian Families Project has a PDF user guide for its 1901 census database: http://web.uvic.ca/hrd/cfp/data/1901%20Census%20User%20Guide.pdf
The VIHistory database uses the following codes from the guide (found under the header 'Standard Variable Codes'):
- Birth place
- Infirmity
- Mother Tongue
- Nationality
- Race
- Religion
IMPORTANT NOTE: the codes in the guide for occupation and relationship to head are NOT used in the database. Instead:
- Relationship to head: this seems to be a custom system (unless I hear otherwise from PD or JL)
- Occupation: the occupation codes follow the ISCO 1968 classification system
Work done for March 15th includes:
census_1911 table schema:
- Changed
family_id
frominteger
tocharacter varying(15)
to allow for alphanumeric values such as "151a". Since the ID doesn't actually refer to anything else, it's fine to have a varchar value. - Dropped the
secondary_occupation_id
column and added aother_employment character varying(100)
column in its place. The OtherEmployment CSV column doesn't follow the same conventions as the ChiefOccupation column, so the occupation codes don't apply. The newother_employment
column is simply a direct import of the OtherEmployment text. - Added
weeks_working_other_employment
andhours_working_other_employment
integer columns, which I forgot to make during the initial creation.
Occupation data:
- Inserted new rows into the occupation table (name (code)): Shipwright (87450), Master Mariner (98819), Delivery Man (98636), Secretary, School Board (39200), Teacher (13340)
Made a couple of schema changes to the census_1911 table today:
- Changed
naturalization_year
fromsmallint
tocharacter varying(50)
to account for non-integer values in the data such as "*", "N/A", etc. - Added a temporary
sub_district_id
field to store the subdistrict data. Once we receive complete census tract info for the locations table, this column will be used to populate thelocation_id
column and then deleted.
The method for importing raw spreadsheet data into the PostgreSQL database involves making a text file with a number of "mapping" instructions for each column in the spreadsheet. These mapping files are used by the custom data importing application, "Loader", originally written by DB. For example, to map the "family_id" SQL column to the "FamilyName" CSV column, one would enter:
family_id:FamilyName
That tells the data importer to insert the value of the "FamilyName" raw data column into the "family_id" SQL table column. There's support for a number of functions as well, which transform the CSV data. The complete format and documentation for the mapping files is available in the ViHistory technical documentation.
The mapping of the 1911 data, which I'm still working on, has been fairly straightforward, with a few wrinkles:
- I wrote a new function for Loader called
_existsMultiple()
, which is a extended version of the existing_exists()
function that can accept multiple semi-colon separated field names and/or values - I made the existing function
_is_true()
, which was only used internally by other Loader functions, available for use in the map files to convert Y/N and other similar columns to boolean values - Wrote new
_fractionalMonth(
) Loader function to find a value like "3/12" and return "3". This is because infants' ages have been represented as a fraction, and the month needs to be extracted to populate theage_months
column
PD to answer some questions that are preventing me from completing the mapping.
Just sent PD and JL two more questions about the 1911 data:
Multiple races and nationalities: for the race (RaceTribeOrOrigin in your spreadsheets) and nationality (Nationality in your spreadsheets) data, the Rock Bay dataset has some rows with multiple entries, such as "Canadian;American" for nationality and "Canadian;Welsh" for race. The existing nationalities and races data in the database doesn't currently have any dual entries like this - they'll all single races or nationalities. Should we just create new races and nationalities to match the data (e.g. "Canadian-Welsh"), or do we need a more complex system that allows, for example, one census record to be associated with both the Canadian AND the Welsh races (see my explanation of "one to many" below)?
Infirmities: the infirmities data is tricky. In the 1911 data we have Blind, DeafAndDumb, CrazyOrLunatic, and IdioticOrSilly fields. Theoretically, one person may be all of these things (or at least a couple). The issue is that, in the existing census years, there's only one "infirmities_id" field, which corresponds to just a single infirmity. If a census record has more than one infirmity (e.g. Blind and CrazyOrLunatic), then it's just noted as a "Two or more of the above" infirmity, which means that the actual infirmity data itself is lost. This is a consequence of the current table structure, which only allows for one infirmity to be associated with each record.
So, we have a few options:
#1: keep the status quo; don't change how infirmities are stored (i.e. record either one specific infirmity or the "Two or more of the above" option).
#2: have separate boolean table columns for each infirmity type. The main consequence here is that, if the infirmities change from year to year, table structure will fluctuate wildly.
#3: use a third table to keep track of which infirmities belong to each record. This is called a "one to many" relationship and allows one record to be associated with many infirmities, without the need to have individual columns for each infirmities. This option is the most flexible, since it allows for an indefinite number of infirmities.
After some consultation with Martin, we believe that #3 is the best option for this project. It will result in more work (for me) than #1 or #2, but is a huge step towards normalized data, and, as I mentioned, is by far the most flexible of the options.
After a thorough examination of the sample 1911 census data given to me by PD, I've created a census_1911 table in the PostgreSQL database for the new data set. I used the census_1901 table as a starting point and added/subtracted columns as needed to match the new data. The 1911 table has 45 columns vs. 33 in the 1901 table, but that may change as we get more of the data; PD is sending us a spreadsheet for each subdistrict so he may be refining the format along the way.
I documented the census_1911 table with SQL comments. Here's the CREATE TABLE query, with comments:
CREATE TABLE census_1911
(
census_record_id serial NOT NULL, -- Unique ID
location_id integer, -- Foreign key reference to location table; generated from "SubDistrictNo" CSV column
family_id integer, -- Direct import of "FamilyNumber" CSV column
family_name character varying(100), -- Direct import of "lastName" CSV column
given_name character varying(100), -- Direct import of "FirstName" SQL column
sex character(1), -- Direct import of "Sex" CSV column, either "M" or "F"
age smallint, -- Direct import of "AgeAtLastBirthday" CSV column
marital_status_id integer, -- Conversion of "SingleMarriedWidowed" CSV column from text to marital_status foreign key reference
relationship_to_head_id integer, -- Foreign key reference to relationship_to_head table; "relHeadCode" CSV column corresponds to relationship_to_head.relationship_to_head_id
birthplace_id integer, -- Foreign key reference to origins table; generated from "PlaceOfBirth" CSV text column
occupation_id integer, -- Foreign key reference to occupation table; "viOccupationCode" CSV column corresponds to occupation.occupation_code
religion_id integer, -- Foreign key reference to religion table; "religionCode" CSV column corresponds to religion.religion_id
"read" boolean, -- Conversion of "CanRead" CSV column from Y/N to boolean field
"write" boolean, -- Conversion of "CanWrite" CSV column from Y/N to boolean field
infirmities_id integer, -- Foreign key reference to infirmities table; amalgamation of "Blind", "DeafAndDumb", "CrazyOrLunatic", and "IdioticOrSilly" Y/N CSV columns into a single field
comments text, -- Direct import of "Notes" CSV column
months_school smallint, -- Direct import of "MonthsInSchool" CSV column
earnings numeric(12,2), -- Direct import of "EarningsAtChiefOcc" CSV column
english boolean, -- Generated from presence of "ENGLISH" in "LanguageSpoken1" or "LanguageSpoken2" CSV columns
french boolean, -- Generated from presence of "FRENCH" in "LanguageSpoken1" or "LanguageSpoken2" CSV columns
immigration_year smallint, -- Direct import of "YearOfImmigration" CSV column
naturalization_year smallint, -- Direct import of "YearOfNaturalization" CSV column
earnings_extra numeric(6,2), -- Direct import of "EarningsAtOtherOcc" CSV column
race_id integer, -- Foreign key reference to races table; generated from "RaceTribeOrOrigin" CSV column
nationality_id integer, -- Foreign key reference to nationalities table; generated from "Nationality" CSV column
mother_tongue_id integer, -- Foreign key reference to mother_tongues table; generated from "LanguageSpoken1" CSV column
address character varying(255), -- Direct import of "Habitation" CSV column
age_months smallint, -- Import of "AgeAtLastBirthday" CSV column, but only if the value is fractional (i.e. below 1)
employment character varying(60), -- Amalgamation of "Employer", "Employee", and "WorkingOnOwnAccount" Y/N CSV columns to single, comma-separated text string
education_cost numeric(12,2), -- Direct import of "CostOfEducation" CSV column
life_insurance numeric(8,2), -- Direct import of "LifeInsurance" CSV column
health_insurance numeric(8,2), -- Direct import of "HealthInsurance" CSV column
insurance_cost character varying(100), -- Direct import of "CostOfInsurance" CSV column
weeks_working integer, -- Direct import of "WeeksWorkingChiefOcc" CSV column
hours_working integer, -- Direct import of "HoursWorkingChiefOcc" CSV column
hourly_wage numeric(6,2), -- Direct import of "RateOfEarningPerHour" CSV column
secondary_occupation_id integer, -- Foreign key reference to occupation table; generated from "OtherEmployment" CSV column
secondary_language_id integer, -- Foreign key reference to mother_tongues table; generated from "LanguageSpoken2" CSV column
employment_state character varying(255), -- Direct import of "StateWherePersonIsEmployed" CSV column
street_number character varying(50), -- Direct import of "StreetNumber" CSV column
street_name character varying(100), -- Direct import of "StreetName" CSV column
religion_text character varying(100), -- Direct import of "Religion" CSV column
occupation_text character varying(100), -- Direct import of "ChiefOccupation" CSV column
birthmonth integer, -- Conversion of "MonthofBirth" text CSV column to PostgreSQL month integer
birthyear integer, -- Direct import of "YearofBirth" CSV column
CONSTRAINT census_1911_pkey PRIMARY KEY (census_record_id),
CONSTRAINT census_1911_birthplace_id_fkey FOREIGN KEY (birthplace_id)
REFERENCES origins (origin_id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT census_1911_infirmities_id_fkey FOREIGN KEY (infirmities_id)
REFERENCES infirmities (infirmities_id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT census_1911_location_id_fkey FOREIGN KEY (location_id)
REFERENCES "location" (location_id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT census_1911_marital_status_id_fkey FOREIGN KEY (marital_status_id)
REFERENCES marital_status (marital_status_id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT census_1911_mother_tongue_id FOREIGN KEY (mother_tongue_id)
REFERENCES mother_tongues (mother_tongue_id) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE CASCADE,
CONSTRAINT census_1911_nationality_id FOREIGN KEY (nationality_id)
REFERENCES nationalities (nationality_id) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE CASCADE,
CONSTRAINT census_1911_occupations_id_fkey FOREIGN KEY (occupation_id)
REFERENCES occupation (occupation_id) MATCH SIMPLE
ON UPDATE RESTRICT ON DELETE RESTRICT,
CONSTRAINT census_1911_race_id FOREIGN KEY (race_id)
REFERENCES races (race_id) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE CASCADE,
CONSTRAINT census_1911_relationship_to_head_id_fkey FOREIGN KEY (relationship_to_head_id)
REFERENCES relationship_to_head (relationship_to_head_id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT census_1911_religion_id_fkey FOREIGN KEY (religion_id)
REFERENCES religion (religion_id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
)
WITH (
OIDS=TRUE
);
COMMENT ON COLUMN census_1911.census_record_id IS 'Unique ID';
COMMENT ON COLUMN census_1911.location_id IS 'Foreign key reference to location table; generated from "SubDistrictNo" CSV column';
COMMENT ON COLUMN census_1911.family_id IS 'Direct import of "FamilyNumber" CSV column';
COMMENT ON COLUMN census_1911.family_name IS 'Direct import of "lastName" CSV column';
COMMENT ON COLUMN census_1911.given_name IS 'Direct import of "FirstName" SQL column';
COMMENT ON COLUMN census_1911.sex IS 'Direct import of "Sex" CSV column, either "M" or "F"';
COMMENT ON COLUMN census_1911.age IS 'Direct import of "AgeAtLastBirthday" CSV column';
COMMENT ON COLUMN census_1911.marital_status_id IS 'Conversion of "SingleMarriedWidowed" CSV column from text to marital_status foreign key reference';
COMMENT ON COLUMN census_1911.relationship_to_head_id IS 'Foreign key reference to relationship_to_head table; "relHeadCode" CSV column corresponds to relationship_to_head.relationship_to_head_id';
COMMENT ON COLUMN census_1911.birthplace_id IS 'Foreign key reference to origins table; generated from "PlaceOfBirth" CSV text column';
COMMENT ON COLUMN census_1911.occupation_id IS 'Foreign key reference to occupation table; "viOccupationCode" CSV column corresponds to occupation.occupation_code';
COMMENT ON COLUMN census_1911.religion_id IS 'Foreign key reference to religion table; "religionCode" CSV column corresponds to religion.religion_id';
COMMENT ON COLUMN census_1911."read" IS 'Conversion of "CanRead" CSV column from Y/N to boolean field';
COMMENT ON COLUMN census_1911."write" IS 'Conversion of "CanWrite" CSV column from Y/N to boolean field';
COMMENT ON COLUMN census_1911.infirmities_id IS 'Foreign key reference to infirmities table; amalgamation of "Blind", "DeafAndDumb", "CrazyOrLunatic", and "IdioticOrSilly" Y/N CSV columns into a single field';
COMMENT ON COLUMN census_1911.comments IS 'Direct import of "Notes" CSV column';
COMMENT ON COLUMN census_1911.months_school IS 'Direct import of "MonthsInSchool" CSV column';
COMMENT ON COLUMN census_1911.earnings IS 'Direct import of "EarningsAtChiefOcc" CSV column';
COMMENT ON COLUMN census_1911.english IS 'Generated from presence of "ENGLISH" in "LanguageSpoken1" or "LanguageSpoken2" CSV columns';
COMMENT ON COLUMN census_1911.french IS 'Generated from presence of "FRENCH" in "LanguageSpoken1" or "LanguageSpoken2" CSV columns';
COMMENT ON COLUMN census_1911.immigration_year IS 'Direct import of "YearOfImmigration" CSV column';
COMMENT ON COLUMN census_1911.naturalization_year IS 'Direct import of "YearOfNaturalization" CSV column';
COMMENT ON COLUMN census_1911.earnings_extra IS 'Direct import of "EarningsAtOtherOcc" CSV column';
COMMENT ON COLUMN census_1911.race_id IS 'Foreign key reference to races table; generated from "RaceTribeOrOrigin" CSV column';
COMMENT ON COLUMN census_1911.nationality_id IS 'Foreign key reference to nationalities table; generated from "Nationality" CSV column';
COMMENT ON COLUMN census_1911.mother_tongue_id IS 'Foreign key reference to mother_tongues table; generated from "LanguageSpoken1" CSV column';
COMMENT ON COLUMN census_1911.address IS 'Direct import of "Habitation" CSV column';
COMMENT ON COLUMN census_1911.age_months IS 'Import of "AgeAtLastBirthday" CSV column, but only if the value is fractional (i.e. below 1)';
COMMENT ON COLUMN census_1911.employment IS 'Amalgamation of "Employer", "Employee", and "WorkingOnOwnAccount" Y/N CSV columns to single, comma-separated text string';
COMMENT ON COLUMN census_1911.education_cost IS 'Direct import of "CostOfEducation" CSV column';
COMMENT ON COLUMN census_1911.life_insurance IS 'Direct import of "LifeInsurance" CSV column';
COMMENT ON COLUMN census_1911.health_insurance IS 'Direct import of "HealthInsurance" CSV column';
COMMENT ON COLUMN census_1911.insurance_cost IS 'Direct import of "CostOfInsurance" CSV column';
COMMENT ON COLUMN census_1911.weeks_working IS 'Direct import of "WeeksWorkingChiefOcc" CSV column';
COMMENT ON COLUMN census_1911.hours_working IS 'Direct import of "HoursWorkingChiefOcc" CSV column';
COMMENT ON COLUMN census_1911.hourly_wage IS 'Direct import of "RateOfEarningPerHour" CSV column';
COMMENT ON COLUMN census_1911.secondary_occupation_id IS 'Foreign key reference to occupation table; generated from "OtherEmployment" CSV column';
COMMENT ON COLUMN census_1911.secondary_language_id IS 'Foreign key reference to mother_tongues table; generated from "LanguageSpoken2" CSV column';
COMMENT ON COLUMN census_1911.employment_state IS 'Direct import of "StateWherePersonIsEmployed" CSV column';
COMMENT ON COLUMN census_1911.street_number IS 'Direct import of "StreetNumber" CSV column';
COMMENT ON COLUMN census_1911.street_name IS 'Direct import of "StreetName" CSV column';
COMMENT ON COLUMN census_1911.religion_text IS 'Direct import of "Religion" CSV column';
COMMENT ON COLUMN census_1911.occupation_text IS 'Direct import of "ChiefOccupation" CSV column';
COMMENT ON COLUMN census_1911.birthmonth IS 'Conversion of "MonthofBirth" text CSV column to PostgreSQL month integer';
COMMENT ON COLUMN census_1911.birthyear IS 'Direct import of "YearofBirth" CSV column';
PD gave me the 1911 census data for the Rock Bay subdistrict yesterday. After a thorough examination of the data and a comparison to previous years, some questions arose. I sent PD and JL these questions via email (text is unedited):
Employment types:
There are three fields: "Employer", "Employee", and "WorkingOnOwnAccount". It looks like these can have three possible values: "Y", "N", and blank. My first instinct was to store these as three separate fields in the 1911 census table, but, looking at the data for previous years, these are actually stored as a single text field, separated by commas, called "employment". For example, a 1901 record might have the text "Employer, Own Account" in the "employment" field instead of separating the data into separate fields. This struck me as odd, since, at least in the 1901 data, the original Access file also has the fields separated. Were you aware that the data was converted from multiple fields to a single field? Though my instinct is to keep the fields separate, you may have had good reason to convert these fields.
Another consequence of converting the separate fields into one big text field is that you lose the distinction between "N" for no and blank for refusal to answer. If we maintain separate fields for the different employment types, then census records with a refusal to answer can be properly recorded.
The location table:
To be blunt, I'm confused by the structure of the location table. I think I just need an explanation about how the table structure works, and how to make new locations without going against the current conventions. I've attached the locations table data as a CSV - would you be able to give me a quick rundown of how the various place names in each record relate to each other? For example, "British Columbia -> Vancouver -> Cowichan" seems a bit odd to me. If you can explain the relationship to me and also show me how a new entry would work - for, say, the Rock Bay subdistrict in the 911 data - then I'll be able to handle new census tracts as you give them to me.
Meeting with JN, PD and JL to plan the next phase of the project. Outcomes:
- The 1911 data is now in the form of a spreadsheet, which will be provided to us in CSV format. JN will adapt the original import tool so that it can use CSV instead of relying on an ODBC connection to an Access db.
- There are still some questions about the 1911 data itself, which require input from ES -- for instance, there are three occupation codes, but only one question on the census about occupation; what are these codes, and how were they derived from the source data? It seems likely that some fields will be combined, eliminated, or not used in the interface.
- The new data will be organized into ten separate files by census tract, and a field needs to be added to record the census tract.
- JN will send PD the table of "relationships to heads of households" table data.
- There are some corrections to previous years' data which need to be made on a record-by-record basis. These have been done in a spreadsheet, so some mechanism will have to be used to parse that spreadsheet and compare it to the database, then make corrections where necessary.
- There are about 600 records for a previous year which need to be entered. Again, if we get those in CSV format they should be easy to import.
- There are two new databases of building permits, one from TC announcements and the other from city records. These will need to be integrated in the same way that e.g. tax assessment data is currently integrated, with the possibility of searching by name and possibly also by address (now that more of the tables will contain address information).
- Records in 1911 will ultimately be mapped to property ids, which will enable lookup of geo data for them. The mechanism for this is yet to be worked out.
- We'll also look into possible ways of adding address data which has been compiled for earlier census years from street directories to the census tables for the earlier years. This might not be too complicated if the addresses have been tied to the ids of existing records in the database; if not, it might take some ingenuity.
In order to do proper development and testing on VIHistory I needed a proper LAMP stack with PostgreSQL support on my machine. I didn't want to develop on the live site, and I needed more features than releases such as XAMPP and LAPPStack could offer, namely ODBC support for the Microsoft Access -> PostgreSQL conversion process (PD and JL provide us with data in MS Access format). So, following a number of guides(*), I successfully installed my own LAMP stack - Linux, Apache, MySQL, and PHP - with PostgreSQL.
Initial package setup (via Synaptic):
apache2 apache2-mpm-prefork apache2-utils apache2.2-common libapache2-mod-php5 libdbd-mysql-perl libmyodbc libmysqlclient mysql-client-5.0 mysql-common mysql-server-5.0 mysql-server-core-5.0 odbcinst1debian php5 php5-cli php5-common php5-odbc unixodbc postgresql-8.4 postgresql-client-8.4 postgresql-client-common postgresql-common postgresql-contrib php5-pgsql libapache2-mod-auth-mysql libapache2-mod-auth-pgsql php5-mysql phpmyadmin pgadmin3 iodbc mdbtools libmdbodbc
In addition to the basic stuff I installed a few bells and whistles such as phpMyAdmin for MySQL and pgadmin3 for PostgreSQL administration.
Setup is minimal. Your default web root will be /var/www. Assuming all of the packages installed correctly, you should see the default Apache page at http://localhost/ immediately. Note that, by default, anyone who knows your IP address will be able to connect to your new web server. To allow only your machine to connect to your server:
sudo pico /etc/apache2/ports.conf
- Look for this line:
Listen 80
- Change to:
Listen 127.0.0.1:80
- Save and close
- Restart apache:
sudo /etc/init.d/apache2 restart
- You're done.
To enable phpMyAdmin support: add this to the bottom of /etc/apache2/apache2.conf:
# Include phpMyAdmin
Include /etc/phpmyadmin/apache.conf
To initialize the postgres user for PostgreSQL:
- sudo su postgres -c "psql -d template1"
- ALTER USER postgres WITH PASSWORD 'YourUniquePasswordHere';
- \q
- sudo passwd -d postgres
- sudo su postgres -c passwd
- Enter the password you specified in step #2
Steps 1-3 deal with setting up the postgres user in PostgreSQL, while steps 4-6 set the postgres user's system password. From there, you can setup additional users (see references below).
ODBC connectivity - here's where it gets a bit dicey. Connecting to MySQL is simple (see the ODBC reference below), but MS Access is trickier since there's no native ODBC support for Linux. There are a few commercials options for Linux MS Access ODBC drivers, but they're astronomically expensive. After a lot (a lot!) of digging around, I finally came across a blog post from 2003 that explains how to get MS Access-ODBC connectivity on Linux using open source tools. Unfortunately, these open source tools (I think) only allow for read access, while the commercial alternatives can also write. Also unfortunately, the website is down at the time of writing so I had to load a cached version of the page, which takes nearly forever but does eventually work. Since it's from 2003 it's a bit outdated, but as long as you've installed the necessary packages (in the list above: unixodbc, mdbtools, libmdbodbc), it still holds up. So, to add an MS-Access ODBC:
- Install the necessary packages (already done if you followed the list above) - the key is libmdbodbc
- Add this to /etc/odbcinst.ini, which is probably empty (or doesn't exist yet):
[MDBToolsODBC] Description = MDB Tools ODBC drivers Driver = /usr/lib/libmdbodbc.so.0 Setup = FileUsage = 1 CPTimeout = CPReuse =
- Define a datasource name (DSN), which basically means adding a unique entry to /etc/odbc.ini (which might be empty or not yet there) which links up the ODBC driver we just created with the database to which we want to connect:
[MyAccessFile] Description = My Microsoft Access Database Driver = MDBToolsODBC Database = /path/to/access/file.mdb Servername = localhost UserName = Password = port = 5432
- As long as you've put all of the files in the right places, you should have a working ODBC-Access connection (see the test section below).
Now to test your MySQL, PostgreSQL, and Microsoft Access ODBC connections with this simple script, which I put in /var/www/test.php:
<?php
$mysql = mysql_connect('localhost', 'username', 'password');
echo "MySQL: ";
echo $mysql ? "success" : mysql_error();
echo "<br />";
$pg = pg_connect("host=localhost port=5432 dbname=template1 user=postgres password=yourpostgrespassword");
echo "PostgreSQL: ";
echo $pg ? "success" : pg_last_error();
echo "<br />";
$odbc = odbc_connect("MyAccessFile", "", "");
echo "ODBC: ";
echo $odbc ? "success" : odbc_errormsg();
?>
* - References:
- Ultra-basic LAMP installation for Ubuntu: http://www.howtoforge.com/ubuntu_lamp_for_newbies
- Setting up phpMyAdmin: http://www.howtoforge.com/ubuntu_debian_lamp_server
- Package list for PHP with ODBC support on Ubuntu: http://www.php.net/manual/en/function.odbc-connect.php#94257
- Setting up PostgreSQL: http://www.ubuntugeek.com/howto-setup-database-server-with-postgresql-and-pgadmin3.html
- Adding users to a PostgreSQL database: http://www.cyberciti.biz/faq/howto-add-postgresql-user-account/
- Microsoft Access Database using Linux and PHP (Yahoo cache version): http://74.6.117.48/search/srpcache?ei=UTF-8&p=http://bryanmills.net/archives/2003/11/microsoft-access-database-using-linux-and-php/&fr=sfp&u=http://cc.bingj.com/cache.aspx?q=http://bryanmills.net/archives/2003/11/microsoft-access-database-using-linux-and-php/&d=4680032763580841&mkt=en-US&setlang=en-US&w=463c4d2e,a36d7b22&icp=1&.intl=us&sig=A7p1jhQjSG4M7jIlS5N85Q--
Modified the spambot .htaccess redirection code. Rather than sending suspected spambots a 403 error, they're now getting redirected here, just in case a legit user has the same user agent as a spam bot:
We seem to be having a timing issue with domain names for our database server. If we use
$connect_str = "host=db.tapor.uvic.ca . . . ";
connection works. We're migrating to new domain names, but when we use:
$connect_str = "host=pgdev.tapor.uvic.ca . . . ";
the connection fails.
The instance of VIHist on lettuce is running against mapserver 4.10. The instance of VIHist on mustard is running against mapserver 5.x. (Same is true for the two instances of MapOfLondon experimental map.) Both instances of MoL work, but only the instance of VIHist running on mapserver 4.x is working. We're not sure if the VIHist code includes a call that is deprecated in 5.x (most likely), or if the build of 5.x happens to not have a lib that VIHist needs (but that MoL doesn't need). If all else fails, we'll put a build of mapserver 4.x on mustard, so that the instance of VIHist on that machine will work, and allow us to complete the change to the new clustered setup with new domain name for the db connections.
1. Victoria Census 1911
Add table with all this data and integrate this with the search
interfaces for the global name search and for the cross-census search
2. Victoria Tax Assessment Rolls, 1891
Add records to existing table
Modify occupation codes (may entail modifying records in occupation
table too)
Modify street addresses to separate street number from street name
3. Business Directory 1871
Add table, integrate into search + report code for business directories (and possibly global name search)
4. Business Directory 1877
Add table, integrate into search + report code for business directories (and possibly global name search)
5. Building permits 1870 - 1917
Add table(s), search and report specific to this dataset, integrate into
global name search
6. Check census 1891
Add table, search and report specific to this dataset, possibly integrate into cross census search and/or global name search
7. 1891 census
Add records.
If I recall correctly:
we're holding off on #1 the 1911 census until the other tasks are done.
for #2, who is providing the additional 1891 Tax Assessment roll records? I'll send a separate email on the occupation codes.
JL was to provide me some datasets, is that all of #3 Business Directory 1871, #4 Business Directory 1877 and #5 Building Permits 1870 - 1917 and #6 check census 1891. If not, then which are you providing and where are the remaining datasets coming from?
The first priority is #7 1891 census add records. For that I am waiting for data from PD.
I've got a file called Victoria_1911_sample.accdb which I believe is a sample of 100 records from the 1911 census. I'm not sure if I'll be able to look at it while we're in our temporary digs, but this isn't the first priority task in any case.
In the vihistory site, these three files had html injected into them by a hacker exploiting the fact that those pages use an argument in the GET array to specify a file which is then included into the host file.
index.php
content/census/1891/index.php
content/census/1901/census1901.php
I've added a php function which takes the page argument in the GET array and checks that against a list of acceptable values. If there is a match, then an appropriate path-and-file name is returned by the function; if not, then an empty string is returned by the function. The calling pages already test for an empty string and generate an error message, so I think I've implemented the "better approach".
The calling files I modified are:
http://vihistory.uvic.ca/content/census/1901/census1901.php
http://vihistory.uvic.ca/content/census/1891/census1891.php
http://vihistory.uvic.ca/content/census/1881/census1881.php
http://vihistory.uvic.ca/content/bd/1882/bd1882.php
http://vihistory.uvic.ca/content/bd/1892/bd1892.php
http://vihistory.uvic.ca/content/bd/1902/bd1902.php
http://vihistory.uvic.ca/content/documents/article.php
Those pages now invoke a confirm_page_ok function, which is found in a lib file on the site.
Technical details of cause of problem included in helpdesk RT #162668.
I've spent a good part of the day reviewing David's code in trying to solve the problems you've reported when:
1) user does a query and gets a report
2) user clicks on family and gets a new report (also happens when user clicks on location)
3) user clicks on "next" button
The code generating the page is not including the values from the search form. Similar problem is occuring if you click on the resort buttons at the top of each column.
searchcensus.php
/*
SA added these conditions so that if user clicks on Family or Location, then the results navbar
on the resulting generated page includes the appropriate values derived from the GET array in the
URLs of the navigation buttons
*/
if ((isset($family_number)) && (!empty($family_number))) {
$args .= "&familynumber=" . $family_number;
}
if ((isset($district_id)) && (!empty($district_id))) {
$args .= "&districtid=" . $district_ids;
}
if ((isset($district)) && (!empty($district))) {
$args .= "&district=" . urlencode($district);
}
if ((isset($subdistrict_id)) && (!empty($subdistrict_id))) {
$args .= "&subdistrictid=" . $subdistrict_id;
}
if ((isset($year_got)) && (!empty($year_got))) {
$args .= "&year=" . trim($year_got);
}
censusresults.inc:
//SA 0912007 trying to get the number of records/page to show up on generated page when user
//clicks on Family or Location
$family_url .= "&pagerows=" . $rows_page ;
- 1
- 2