Wrote the XSLT to convert appendices, and began work on the reference list (bibliography) code. I've got the list framework working. I'm now looking at the rather odd NLM structures used for reference items. They don't seem to have any way of distinguishing authors from editors, other than by wrapping them in <person-group> tags with a person-group-type attribute; I guess that reflects the reality in scientific fields, were no-one publishes anything alone. The whole thing seems less structured than a TEI equivalent, being more of a loose agglomeration of tags.
My nine test files now convert without errors (except the missing reference ids that refer to the bibliography items in the reference list, which are not converting yet because they're back matter). So two-thirds of the job is done. There are some oddities in the model structure of NLM -- for instance, every section (<sec>) must have a <title>, which seems ridiculous, and links (<xref>, uri and <ext-link>) cannot contain abbreviation tags, which seems pointlessly restrictive, when they can contain bold, italics etc. However, that's not really my problem, except that it requires me to throw away some information during the conversion.
I had a table exams which had a conditions_1 field and a conditions_2 field. Each of those is an integer pointing to a record in the conditions field. Each record in the conditions field contains an index field (integer) and a text field. What I wanted was to return the text value for both conditions. I have to use some kind of alias so that I can refer to the conditions table twice in the one query. Here's the code that works:
SELECT exams.exam_id, exams.condition_1, exams.condition_2 , c1.condition_text AS condition_text_1, c2.condition_text AS condition_text_2
FROM exams
LEFT JOIN examiners ON examiners.examiner_id = exams.examiner_1
LEFT JOIN conditions as c1 ON (
c1.condition_id = exams.condition_1
)
LEFT JOIN conditions as c2 ON (
c2.condition_id = exams.condition_2
)
Important to note that the "conditions as c1" appears in the LEFT JOIN clause and not in the SELECT or FROM clauses. If the alias is specified in the SELECT or FROM clauses then you get an error 1066 duplicate table name/alias error. The SELECT clause includes c1 AS condition_text_1 and c2.condition_text AS condition_text_2 so that two separate fields are returned in the results each of which is based on an independent JOIN clause using a distinct alias to the same conditions table. Thanks to Martin for helping with this.
WHERE exams.p_surname LIKE '%en%'
GROUP BY exams.exam_id
ORDER BY exam_date
I've now added a buildTables() method to the wizardQuery object, which creates the framework for building all the tables. A major difficulty was that I needed to open a connection to the db and keep it open throughout the process, to ensure that the temporary tables survive; however, some of the subsidiary processing for getting arrays of tracts also opened and closed a connection in the middle of this, and it turns out that if PHP finds an existing connection already open with the same parameters, it will use that connection instead of creating a new one, and then of course close it. I had to rewrite some of the city tract set object code so that it created arrays of tracts, indexed by year, when it was first instantiated, rather than calling on it to do that later.
With that out of the way, I now have a multi-dimensional array in place, indexed at the top level by year, and subsequently by city/tractset, holding a set of objects (class tableQueryUnit), each of which knows its year, city, tract set and ethnicity groupings, and has created a temporary table based on them.
The next stage is to have those objects perform the normal calculations on their temporary tables to generate the three value arrays we need. I can't simply throw these tables at the original code written by JD, because that code is explicitly designed to create connections and dispose of them immediately; we need to keep connections alive in order to preserve our temporary tables until we've finished with them. So the next part of the procedure is to create analogues of JD's functions which can have a connection object passed to them as well as a table name. They don't need a $ctWhereClause parameter, because the creation of our temporary tables has already taken care of that, so they should be a bit simpler in that respect.
All calculations done for a specific temporary table are concerned only with that table -- in other words, there is no interaction between tables. That means that one of my tableQueryUnit objects can initiate all of the calculations which are relevant to it, and store the results in arrays of table cells, without any interaction outside of itself.
Stayed late hacking at teiJournal/NLM XSLT, and manning the office while the others were at the Pierre Berton award event.
I now have teiJournal-to-NLM motoring along quite well; I've got all the way through handling body elements, as far as I can tell. Next I'll try bulk-converting the whole set of articles and bug-fixing on those, before I move on to the difficult issue of the reference list and appendices.
Fruitful project meeting to see where the transcription tasks have reached, and to plan our strategy for next semester. Decisions were mainly made for the two tag-teams, working on the Sonnet and the Amboise texts. The plan is basically that they'll finish up (any time soon), then they'll get together and examine each other's markup to look for inconsistencies, then we'll revise markup to incorporate all the best of both approaches. Then, the Sonnet team will begin tagging line equivalences, based on the 1609 text. The Amboise team will start adding <note> elements containing questions they'd like to see answered, along with possible suggested answers if they know them; these will form the basis of scholarly annotations. We need to categorize these notes by topic, using a type attribute. The other team will do the same, once they've tagged their line number equivalences.
Added some of the basic body block-level element handling. I was able to take advantage of some of the XHTML processing already written to deal with tables, because NLM purports to use the XHTML table model, but there are still apparently some issues; <caption> seems to require another block-level element below it, and the @class attribute is not supported. Still, we're making progress.
I've added the PHP to create temporary tables, and tested that it works when executed during the query processing. So we have a proof of concept. The next phase is to figure out how to organize results generation from the temporary tables in an efficient way.
The basic problem we have is that we're generating three distinct tables in three distinct ways; and we're potentially including in all three tables data which is generated from several different temporary tables, all of which will be destroyed as soon as they go out of scope. We want to avoid re-generating the same temporary tables each time. However, we don't want an almighty spaghetti-code function which does everything in one go; all three tables are far more complicated than the original three tables, because they're multi-year and multi-city as well as being multi-ethnicity.
I think the sanest approach might be this:
- An umbrella method in the
wizardQueryclass is responsible for creating the connection, then iterating through the data structure to create the temporary tables. - The
cityTractListclass can be enhanced with more data fields, enabling it to store the following information:- The temporary table name (each temporary table is associated with one
cityTractList, being a city, a year, and a tract set). (We ignore the slight inefficiency of potentially having two temporary tables generated from the same city/year, with different tract lists; these are best treated as two separate entities for the sake of simplicity.) - A mapping of constructed ethnicity grouping names to their id numbers (negative integers stored in the "o" field of the temporary tables). We use negative numbers to make it clear that the constructed ethnicities are not mappable to any of the origin strings in the base data tables. So each city tract set has a list of at least one id+name (such as "-1", "European").
- A pointer to the established
mySQLconnection, so that it can work on the temporary tables.
- The temporary table name (each temporary table is associated with one
- Each city tract set is called upon by the parent
wizardQueryinstance to create its own sections of the three tables. It does these calculations and stores the results in the form of arrays of sequences of table cells -- one sequence for each ethnicity grouping, with the number of cells in a sequence dependent on the data returned (e.g. two cells for the D table) or the number of eth. groupings (so one cell for each eth. grouping, so each grouping has an interaction with each other one). - Once all the city tract sets (each item in the
ctListmember of thewizardQueryobject instance) has done its calculations, the parentwizardQueryobject can dispose of the connection, causing the temporary tables to be deleted. - Now the
wizardQuerycan begin constructing the three tables. It builds the header rows, then for each year, it adds a year cell on the left, then calls on thectListitem to supply its pre-calculated cells, and adds them in. - It does each of the three tables; each process can be split out into a separate function, for clarity, and we can implement and test each one separately.
- Once the three tables have been created, they're returned as results to the calling AJAX object.
I'm pretty sure that's the best way to do the job, and it should lend itself to incremental implementation; I can start with the D-table function, and do some speed testing, before moving on to the others.
MA is working on an article containing a lot of fairly complicated Unicode stuff, for submission to a journal, and we've all been helping in the struggle with fonts, MSWord, OpenOffice, and Acrobat/PDF. Just logging the time spent. We learned a bit about Devanagari, and the Gandhari font, in the process.