<bibl> and <biblStruct>

Martin Holmes

University of Victoria HCMC

In this presentation, I want to try to draw some conclusions about the overall practicality of TEI markup as the basis for scholarly publications by looking at one specific area of TEI, the encoding of bibliographical citations, to try to assess its suitability for the task of marking up born-digital scholarly documents for print and electronic publications.

(1) Why focus on bibliographical markup?

(2) The dream of Laurent

Encode bibliographical items in TEI in such a way that they:

I've called this the Dream of Laurent because, although I used to share the dream, and we have written together on this topic in the past, I've gradually lost faith in the practicality of it. At the heart of the dream is the idea that the TEI <biblStruct> tag, because of its structured and style-neutral form, is the best way to encode bibliographic references.

(3) Zotero

To examine the practicality of this, I've taken a look at a tool whose only job is to manage bibliographic references and export them in various formats: Zotero. I want to take a quick look at how Zotero does what it does, and how successful it is in terms of the desired features.

(4) Zotero's item type list

  • Artwork
  • Audio Recording
  • Bill
  • Blog Post
  • Book
  • Book Section
  • Case
  • Computer Program
  • Conference Paper
  • Dictionary Entry
  • Document
  • E-mail
  • Encyclopaedia Article
  • Film
  • Forum Post
  • Hearing
  • Instant Message
  • Interview
  • Journal Article
  • Letter
  • Magazine Article
  • Manuscript
  • Map
  • Newspaper Article
  • Patent
  • Podcast
  • Presentation
  • Radio Broadcast
  • Report
  • Statute
  • Thesis
  • TV Broadcast
  • Video Recording

The initial classification of a bibliographic item is crucial. Styleguides have explicit instructions for rendering different types of item. Zotero has just over 30 item types; my own list, amassed over the years from all my biblStructs, has 62.

(5) Entering items in Zotero

Entering a journal article in Zotero

When you choose a specific item type, you get a customized data entry form for that item type, which is different from the others. For instance, if you're inputting a Journal Article, you get fields for volume, issue, pages, ISSN and so on; for a Radio Broadcast, you get Episode Number, Recording Type, Network, etc. Zotero needs to store these fields in discrete structures in order to fulfill its mission of exporting to a range of different formats and rendering for different styleguides.

(6) Entering a podcast in Zotero

Entering a journal article in Zotero

Here's the form for entering a podcast. You can see that a lot of the fields are specific to the item type or to similar types ("Podcaster", "File type", "Running time"). These specific fields are required when rendering for specific styleguides; for instance, to render this item according to the APA Styleguide to Electronic References, we need to know that it's a podcast (the word "podcast" appears in the rendering), we need to know its episode number, and we need the file URL. We also ought to have the date of the podcast itself, which is required by the APA rendering, but there's no field for that.

Zotero saves this data in an sqlite database, but it also has a "native" RDF file format. If we export this single item into "Zotero RDF", we get a document that uses 21 elements from 7 different namespaces. Exporting a collection of only six bibliographic items gets us 43 elements spread across 9 different namespaces.

(7) Exporting from Zotero

Zotero export to APA:

  • Novella, S., Novella, B., Novella, J., Bernstein, E., & Watson, R. (n.d.). The Skeptics' Guide to the Universe Podcast 239. The Skeptics' Guide to the Universe. MP3, . Retrieved from http://media.libsyn.com/media/skepticsguide/skepticast2010-02-10.mp3.

"Correct" version per APA Styleguide to Electronic References:

  • Novella, S., Novella, B., Novella, J., Bernstein, E., & Watson, R. (Podcasters). (2010, February 10). The Skeptics' Guide to the Universe Podcast [Show 239]. The Skeptics' Guide to the Universe. Podcast retrieved from http://media.libsyn.com/media/skepticsguide/skepticast2010-02-10.mp3.

Zotero offers exports to 16 different citation styles. I picked one at random, and compared what Zotero gives us to what the style guide actually prescribes. This is pretty accurate, but it's not perfect.

(8) Zotero to MODS

  <mods>
    <titleInfo>
      <title>The Skeptics' Guide to the Universe Podcast 239</title>
    </titleInfo>
    <typeOfResource>text</typeOfResource>
    <genre authority="local">podcast</genre>
    <genre authority="marcgt">theses</genre>
    <genre>MP3</genre>
    <name type="personal">
      <namePart type="family">Novella</namePart>
      <namePart type="given">Steve</namePart>
      <role>
        <roleTerm type="code" authority="marcrelator"/>
      </role>
    </name>
    <name type="personal">
      <namePart type="family">Novella</namePart>
      <namePart type="given">Bob</namePart>
      <role>
        <roleTerm type="code" authority="marcrelator"/>
      </role>
    </name>
    <name type="personal">
      <namePart type="family">Novella</namePart>
      <namePart type="given">Jay</namePart>
      <role>
        <roleTerm type="code" authority="marcrelator"/>
      </role>
    </name>
    <name type="personal">
      <namePart type="family">Bernstein</namePart>
      <namePart type="given">Evan</namePart>
      <role>
        <roleTerm type="code" authority="marcrelator"/>
      </role>
    </name>
    <name type="personal">
      <namePart type="family">Watson</namePart>
      <namePart type="given">Rebecca</namePart>
      <role>
        <roleTerm type="code" authority="marcrelator"/>
      </role>
    </name>
    <location>
      <url>http://media.libsyn.com/media/skepticsguide/skepticast2010-02-10.mp3</url>
    </location>
    <abstract>Special Guest: Brian Dunning
News Items: Enceladus Update, Synthetic Organisms, Spray On Glass, Gasoline from Carbon, Oral Conception
Who's That Noisy
Name That Logical Fallacy: False Analogy
Science or Fiction</abstract>
    <relatedItem type="series">
      <titleInfo>
        <partTitle>The Skeptics' Guide to the Universe</partTitle>
      </titleInfo>
    </relatedItem>
  </mods>

Zotero also provides seven different data export formats, including MODS and BibTex. When exporting to MODS, Zotero does a pretty good job -- but again not perfect, as we can see. This is not a text resource; neither is it a thesis; and MP3 is not, as far as I know, a genre.

(9) Why have I been going on about Zotero?

  • Its only job is biblio markup.
  • It has specialized data entry forms for many different item types.
  • It explicitly labels each item type.
  • It has special fields for each item type.
  • It still fails to collect enough data for a simple item like a podcast.
  • It still fails to correctly render the podcast.
  • It's still lacking about 30 item types (by my reckoning).
  • It still fails to output correctly to other formats (e.g. MODS).

My point here is not to criticize Zotero; it's to point out that even a dedicated application developed over several years with no other purpose than to encode and render bibliographic citations does not succeed in doing a perfect job. Also, in doing the great job it does, it requires a much larger range of tags and attributes (or data item types) than TEI provides, and needs explicitly to distinguish at least thirty different types of bibliographic item in order to handle data collection, storage and rendering properly.

(10) Over to TEI...

<biblStruct type="podcast">
 <monogr>
  <title level="m">The Skeptic's Guide to the Universe Podcast 239</title>
  <respStmt>
    <resp>hosted by</resp>
    <persName><forename>Steve</forename><surname>Novella</surname></persName>
  </respStmt>
  <respStmt>
    <resp>podcaster</resp>
    <persName><forename>Jay</forename><surname>Novella</surname></persName>
    <persName><forename>Bob</forename><surname>Novella</surname></persName>
    <!-- [etc.] -->
  </respStmt>
  <imprint>
    <date when="2010-02-10">2010</date>
  </imprint>

  <biblScope type="episode">293</biblScope>
 </monogr>
 <series>
  <title level="s">The Skeptic's Guide to the Universe</title>
  <respStmt>
    <resp>produced by</resp>
    <name>The <orgName>New England Skeptical Society</orgName> in association with the <orgName>
James Randi Educational Foundation</orgName> (JREF): <ref target="http://www.theness.com">http://www.theness.com</ref></name>
  </respStmt>
 </series>
 <note>
    Abstract: Special Guest: Brian Dunning<lb/>
    News Items: Enceladus Update, Synthetic Organisms, Spray On Glass, Gasoline from Carbon, Oral Conception<lb/>
    Who's That Noisy<lb/>
    Name That Logical Fallacy: False Analogy<lb/>
    Science or Fiction </note>
 <idno type="url">http://media.libsyn.com/media/skepticsguide/skepticast2010-02-10.mp3</idno>
</biblStruct>

Now let's compare this with TEI's <biblStruct>. I've done a sample encoding of the same podcast item. A number of difficulties are immediately apparent:

The TEI <biblStruct> has a relatively small set of generic fields available to it. When we have to encode things such as issue number or page references, we already fall back on the @type attribute (<biblScope type="pages">), and the guidelines tend to deviate from prescription to suggestion, leading to divergent practice in the community. For instance, examples in the current guidelines, on the same page, show both of these:

<biblScope>pp 1013–23</biblScope>
<biblScope type="pages">3-46</biblScope>
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/CO.html#COBITY

I think it's clear that <biblStruct> is hopelessly inadequate as a basis for a system of encoding citations. Even if we were to put serious work into extending it with new elements and attributes, and try our best to provide exhaustive examples showing how to encode every possible type of item and piece of data we can imagine, we will inevitably fail, because the bibliographic information space is too rich, too complicated, and too mutable. The Dream of Laurent is a fantasy.

(11) What to do?

In February this year, Wendell Piez described two distinct ways in which TEI markup is used, on the TEI-L list like this:

XML encoding done for the sake of fitting data to a particular application or family of applications -- which typically require at least some forms of regularity and predictability -- and retrospective markup, which aims to *describe* documents irrespective of concerns with how regular they are, and which may be as interested in irregularities as in regularities. (Markup benefits [was: Subscriber benefits], 02-02-2010.)

In a related approach, the ad-hoc committee on encoding of bibliographic citations, of which I'm a member, began its discussions with this (among other) assumptions: that there is a difference between:

My contention is that this difference is not as real as we think it is -- "born digital" does not mean what we think it means -- because ALL human writing is performed in a specific style, whether rigorous or not; and that it is not helpful, because it makes us think in terms of highly-structured, database-like encoding methods such as that typified by <biblStruct>. In effect, we are always doing "retrospective markup", and we might as well accept it, and act accordingly, because actually it makes our lives easier.

(12) Documents always have style

Styleguides prescribe the way we actually write our text. Chicago author-date form (Chicago 15th ed. 16.107-120) requires the year, while MLA does not include the year unless there are multiple works by the same author in the bibliography (MLA 6th ed. 6.4.3). APA includes commas between the elements, and the p. prefix for page numbers; while it doesn't specifically mention volume numbers, it has an example with "chap. 2", from which I've extrapolated (APA 5th ed. 3.101). Although we could abstract all of these into some more complex structure to remove the style-specific features, why bother when we'll have to render them out in a style again anyway?

(13) A simpler approach

(14) An example

<bibl>
  <author>
    <persName><surname>Patrik</surname>, <forename>Linda E.</forename></persName>
  </author>
  <date notBefore="2007-03-20" notAfter="2007-06-21" when="2007">2007</date>.
  <title level="a">Encoding for Endangered Tibetan Texts</title>.
  <title level="j">DHQ: Digital Humanities Quarterly</title>
  <biblScope type="vol">1</biblScope>, no. <biblScope type="issue">1</biblScope>
(<date>Spring</date>).
  <ref target="http://digitalhumanities.org/dhq/vol/1/1/000004/000004.html">
http://digitalhumanities.org/dhq/vol/1/1/000004/000004.html
  </ref>.
</bibl>

(15) An overview

I believe we should treat born-digital journal articles in exactly the same way we treat original print or manuscript documents when we encode them. They come with built-in style, and we should accept it (or correct it if it's wrong). We should confine ourselves to tagging as many pieces of data as we can find in the text, rather than re-formatting the text to try to create a set of database records in XML, then tie ourselves in knots trying to reformat that structure back into the style that it originally had.

16. So...

Download this presentation.