<bibl> and <biblStruct>

Martin Holmes

University of Victoria HCMC

In this presentation, I want to try to draw some conclusions about the overall practicality of TEI markup as the basis for scholarly publications by looking at one specific area of TEI, the encoding of bibliographical citations, to try to assess its suitability for the task of marking up born-digital scholarly documents for print and electronic publications.

(1) Why focus on bibliographical markup?

I've done a lot of this (ACH/ALLC Abstracts, Scandinavian-Canadian Studies Journal, IALLT Journal, etc.).
2,140 <biblStruct>s
For journal articles, about 50% of the markup time is taken up with bibliographical references.

(2) The dream of Laurent

Encode bibliographical items in TEI in such a way that they:

are highly structured and machine-readable
support interoperability with other biblio markup formats
can be rendered in accordance with multiple style guides (Chicago, MLA, APA, etc.)

I've called this the Dream of Laurent because, although I used to share the dream, and we have written together on this topic in the past, I've gradually lost faith in the practicality of it. At the heart of the dream is the idea that the TEI <biblStruct> tag, because of its structured and style-neutral form, is the best way to encode bibliographic references.

(3) Zotero

Zotero's only purpose is to manage biblio data.
If it can be done, Zotero surely does it.

To examine the practicality of this, I've taken a look at a tool whose only job is to manage bibliographic references and export them in various formats: Zotero. I want to take a quick look at how Zotero does what it does, and how successful it is in terms of the desired features.

(4) Zotero's item type list

Artwork
Audio Recording
Bill
Blog Post
Book
Book Section
Case
Computer Program
Conference Paper
Dictionary Entry
Document
E-mail
Encyclopaedia Article
Film
Forum Post
Hearing
Instant Message
Interview
Journal Article
Letter
Magazine Article
Manuscript
Map
Newspaper Article
Patent
Podcast
Presentation
Radio Broadcast
Report
Statute
Thesis
TV Broadcast
Video Recording

The initial classification of a bibliographic item is crucial. Styleguides have explicit instructions for rendering different types of item. Zotero has just over 30 item types; my own list, amassed over the years from all my biblStructs, has 62.

(5) Entering items in Zotero

When you choose a specific item type, you get a customized data entry form for that item type, which is different from the others. For instance, if you're inputting a Journal Article, you get fields for volume, issue, pages, ISSN and so on; for a Radio Broadcast, you get Episode Number, Recording Type, Network, etc. Zotero needs to store these fields in discrete structures in order to fulfill its mission of exporting to a range of different formats and rendering for different styleguides.

(6) Entering a podcast in Zotero

Here's the form for entering a podcast. You can see that a lot of the fields are specific to the item type or to similar types ("Podcaster", "File type", "Running time"). These specific fields are required when rendering for specific styleguides; for instance, to render this item according to the APA Styleguide to Electronic References, we need to know that it's a podcast (the word "podcast" appears in the rendering), we need to know its episode number, and we need the file URL. We also ought to have the date of the podcast itself, which is required by the APA rendering, but there's no field for that.

Zotero saves this data in an sqlite database, but it also has a "native" RDF file format. If we export this single item into "Zotero RDF", we get a document that uses 21 elements from 7 different namespaces. Exporting a collection of only six bibliographic items gets us 43 elements spread across 9 different namespaces.

(7) Exporting from Zotero

Zotero export to APA:

Novella, S., Novella, B., Novella, J., Bernstein, E., & Watson, R. (n.d.). The Skeptics' Guide to the Universe Podcast 239. The Skeptics' Guide to the Universe. MP3, . Retrieved from http://media.libsyn.com/media/skepticsguide/skepticast2010-02-10.mp3.

"Correct" version per APA Styleguide to Electronic References:

Novella, S., Novella, B., Novella, J., Bernstein, E., & Watson, R. (Podcasters). (2010, February 10). The Skeptics' Guide to the Universe Podcast [Show 239]. The Skeptics' Guide to the Universe. Podcast retrieved from http://media.libsyn.com/media/skepticsguide/skepticast2010-02-10.mp3.

Zotero offers exports to 16 different citation styles. I picked one at random, and compared what Zotero gives us to what the style guide actually prescribes. This is pretty accurate, but it's not perfect.

(8) Zotero to MODS

  <mods>
    <titleInfo>
      <title>The Skeptics' Guide to the Universe Podcast 239</title>
    </titleInfo>
    <typeOfResource>text</typeOfResource>
    <genre authority="local">podcast</genre>
    <genre authority="marcgt">theses</genre>
    <genre>MP3</genre>
    <name type="personal">
      <namePart type="family">Novella</namePart>
      <namePart type="given">Steve</namePart>
      <role>
        <roleTerm type="code" authority="marcrelator"/>
      </role>
    </name>
    <name type="personal">
      <namePart type="family">Novella</namePart>
      <namePart type="given">Bob</namePart>
      <role>
        <roleTerm type="code" authority="marcrelator"/>
      </role>
    </name>
    <name type="personal">
      <namePart type="family">Novella</namePart>
      <namePart type="given">Jay</namePart>
      <role>
        <roleTerm type="code" authority="marcrelator"/>
      </role>
    </name>
    <name type="personal">
      <namePart type="family">Bernstein</namePart>
      <namePart type="given">Evan</namePart>
      <role>
        <roleTerm type="code" authority="marcrelator"/>
      </role>
    </name>
    <name type="personal">
      <namePart type="family">Watson</namePart>
      <namePart type="given">Rebecca</namePart>
      <role>
        <roleTerm type="code" authority="marcrelator"/>
      </role>
    </name>
    <location>
      <url>http://media.libsyn.com/media/skepticsguide/skepticast2010-02-10.mp3</url>
    </location>
    <abstract>Special Guest: Brian Dunning
News Items: Enceladus Update, Synthetic Organisms, Spray On Glass, Gasoline from Carbon, Oral Conception
Who's That Noisy
Name That Logical Fallacy: False Analogy
Science or Fiction</abstract>
    <relatedItem type="series">
      <titleInfo>
        <partTitle>The Skeptics' Guide to the Universe</partTitle>
      </titleInfo>
    </relatedItem>
  </mods>

Zotero also provides seven different data export formats, including MODS and BibTex. When exporting to MODS, Zotero does a pretty good job -- but again not perfect, as we can see. This is not a text resource; neither is it a thesis; and MP3 is not, as far as I know, a genre.

(9) Why have I been going on about Zotero?

Its only job is biblio markup.
It has specialized data entry forms for many different item types.
It explicitly labels each item type.
It has special fields for each item type.
It still fails to collect enough data for a simple item like a podcast.
It still fails to correctly render the podcast.
It's still lacking about 30 item types (by my reckoning).
It still fails to output correctly to other formats (e.g. MODS).

My point here is not to criticize Zotero; it's to point out that even a dedicated application developed over several years with no other purpose than to encode and render bibliographic citations does not succeed in doing a perfect job. Also, in doing the great job it does, it requires a much larger range of tags and attributes (or data item types) than TEI provides, and needs explicitly to distinguish at least thirty different types of bibliographic item in order to handle data collection, storage and rendering properly.

(10) Over to TEI...


<biblStruct type="podcast">

 <monogr>

  <title level="m">The Skeptic's Guide to the Universe Podcast 239</title>

  <respStmt>

    <resp>hosted by</resp>

    <persName><forename>Steve</forename><surname>Novella</surname></persName>

  </respStmt>

  <respStmt>

    <resp>podcaster</resp>

    <persName><forename>Jay</forename><surname>Novella</surname></persName>

    <persName><forename>Bob</forename><surname>Novella</surname></persName>

    <!--  [etc.] -->

  </respStmt>

  <imprint>

    <date when="2010-02-10">2010</date>

  </imprint>

  <biblScope type="episode">293</biblScope>

 </monogr>

 <series>

  <title level="s">The Skeptic's Guide to the Universe</title>

  <respStmt>

    <resp>produced by</resp>

    <name>The <orgName>New England Skeptical Society</orgName> in association with the <orgName>
James Randi Educational Foundation</orgName> (JREF): <ref target="http://www.theness.com">http://www.theness.com</ref></name>

  </respStmt>

 </series>

 <note>

    Abstract: Special Guest: Brian Dunning<lb/>

    News Items: Enceladus Update, Synthetic Organisms, Spray On Glass, Gasoline from Carbon, Oral Conception<lb/>

    Who's That Noisy<lb/>

    Name That Logical Fallacy: False Analogy<lb/>

    Science or Fiction </note>

 <idno type="url">http://media.libsyn.com/media/skepticsguide/skepticast2010-02-10.mp3</idno>

</biblStruct>

Now let's compare this with TEI's <biblStruct>. I've done a sample encoding of the same podcast item. A number of difficulties are immediately apparent:

Should this be an <analytic> or a <monogr>?
Why do we need <imprint> just to get <date>?
Why is there no natural home for the URL of the MP3 file?
The roles of agents in <respStmts> are arbitrary. The guidelines recommend (but do not prescribe) using <resp>'s @key attribute with a range of values from MARC 21's taxonomy of terms, but this list, although long, does not include "podcaster" or "presenter".
@type attributes for the generic <biblScope> are also arbitrary.
For "produced by", we're using an arbitrary value in a <resp>; however, <bibl> has <sponsor> available to it, while <biblStruct> does not.

The TEI <biblStruct> has a relatively small set of generic fields available to it. When we have to encode things such as issue number or page references, we already fall back on the @type attribute (<biblScope type="pages">), and the guidelines tend to deviate from prescription to suggestion, leading to divergent practice in the community. For instance, examples in the current guidelines, on the same page, show both of these:

<biblScope>pp 1013–23</biblScope>
<biblScope type="pages">3-46</biblScope>
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/CO.html#COBITY

I think it's clear that <biblStruct> is hopelessly inadequate as a basis for a system of encoding citations. Even if we were to put serious work into extending it with new elements and attributes, and try our best to provide exhaustive examples showing how to encode every possible type of item and piece of data we can imagine, we will inevitably fail, because the bibliographic information space is too rich, too complicated, and too mutable. The Dream of Laurent is a fantasy.

(11) What to do?

Common assumption: there is a difference between:
- Encoding citations as represented in a source document (such as a pre-existing print document or manuscript)
- Encoding citations as part of a publishing process (such as a manuscript of a book to be published)
Is this a real distinction?
"Born digital" does not mean "born without style".
Every document I receive conforms (well or badly) to a styleguide.

In February this year, Wendell Piez described two distinct ways in which TEI markup is used, on the TEI-L list like this:

XML encoding done for the sake of fitting data to a particular application or family of applications -- which typically require at least some forms of regularity and predictability -- and retrospective markup, which aims to *describe* documents irrespective of concerns with how regular they are, and which may be as interested in irregularities as in regularities. (Markup benefits [was: Subscriber benefits], 02-02-2010.)

In a related approach, the ad-hoc committee on encoding of bibliographic citations, of which I'm a member, began its discussions with this (among other) assumptions: that there is a difference between:

# 2.1 Encoding citations as represented in a source document (such as a pre-existing print document)
# 2.2 Encoding citations as part of a publishing process (such as a manuscript of a book to be published)

My contention is that this difference is not as real as we think it is -- "born digital" does not mean what we think it means -- because ALL human writing is performed in a specific style, whether rigorous or not; and that it is not helpful, because it makes us think in terms of highly-structured, database-like encoding methods such as that typified by <biblStruct>. In effect, we are always doing "retrospective markup", and we might as well accept it, and act accordingly, because actually it makes our lives easier.

Dublin, April 2010