John Lutz has obtained four text files, each consisting of transcripts of stories from early editions of the the Times Colonist newspaper. He wishes to make these available to a wide audiences: researchers, students and the general public.
- To make this material available on the internet in the form of a searchable database.
- To normalize the data in the data files and render it usable in an sql database.
- To create a front-end to query the sql database and report back.
- Output formats and constraints to be determined.
-John Lutz, principal; Leona Taylor and Dorothy Mindenall, data creation; Stewart Arneil, software developer; workstudy students TBA, data preparation, reviewing and testing;
- Intellectual property on the data TBA. Code written by Stewart owned by the University of Victoria.
John to obtain secondary account on unix.uvic.ca and sql database on uvic.ca to host.
Project will use combination of text editor and Martin's Transformer to normalize the data. The intermediate normalized data format is XML-based, but that will ultimately be transformed into tab delimited or whatever is needed to import into an SQL table.
The database will be mySQL running on a server provided by UVic.
The front-end will be php running on an account on unix.uvic.ca
FUNDING AND RESOURCES:
Data checking and testing by workstudy students, paid for and supervised by John Lutz.
Management and code writing by Stewart as part of job at HCMC.
MILESTONES AND END OF THE PROJECT:
First milestone: determine structure for each record and how each field in each record is to be populated (extract from data in raw text or derive from previously extracted data or leave null) and complete list of all codes, abbreviations etc. in raw text which need to be expanded/normalized by end of January 2007.
Second milestone: XML files with normalized records and fields and normalized values in those fields by end of March 2007
Third milestone: simple query interface allowing for search string and/or date of publication. by end of April 2007
Fourth milestone: more sophisticated query interface allowing range of dates, limited to one of four datasets or all of them, etc. summer 2007
To be housed on uvic servers indefinitely and maintained by HCMC.
CRITERIA FOR CESSATION OF PROJECT:
Lack of timely direction on features to include, conventions for normalizing data etc.
FIT WITH EXISTING WORK:
Project will exploit software used in office (Transformer)
Stewart sharpens his regular expression knowledge to extract dates, topic codes etc. from raw text files.
Will use standard mysql and php back-end and front-end.
Queries likely to be very similar to those used in VIHistory and other sites, presentation likely to be similar to that used in other sites.
The goal of this project is to take a collection of transcripts of new stories from early editions of the Times Colonist newspaper which are currently in text files containing special codes for various bits of information, normalize the records, put them into an SQL database and then write a querying front-end.
|<< <||> >>|