Introduction to Marking up Humanities Texts

Examples of Humanities Computing projects from UVic <ref target="http://mariage.uvic.ca/">Le Mariage sous L'ancien Régime</ref>: an electronic anthology

http://mariage.uvic.ca/. This is an anthology of texts and images from around the 17th century in France.

Relating spatial and literary references in <ref target="http://mapoflondon.uvic.ca/">Early Modern London</ref>

http://mapoflondon.uvic.ca/. This project links a large historical map of London to a variety of information about streets, churches and other landmarks, with a focus on the literary history of the city.

A collection of <ref target="http://pear.hcmc.uvic.ca:8080/ach/site/abstracts.htm">abstracts for a conference</ref>

http://pear.hcmc.uvic.ca:8080/ach/site/abstracts.htm. The ACH/ALLC 2005 Conference was held at UVic, and we created this system to produce the online schedule with abstracts in XML, XHTML, PDF and Text format, as well as the printed abstracts book.

Linguistic analysis of <ref target="http://web.uvic.ca/hrd/iallt2003/oldenglish/OEparagraph-1.html">Old English</ref>

http://web.uvic.ca/hrd/iallt2003/oldenglish/OEparagraph-1.html. This experimental project focuses on syntactic analysis of Old English sentence structure, giving readers access to very detailed linguistic information about the text they're reading.

Problems with this kind of "markup" What is metadata and what is not? What do italics actually mean? What does a period mean? Traditional texts are not machine-readable.

Our historical textual conventions are very sophisticated, but they're not machine-readable. XML encoding enables us to identify very precisely the structure of a text, and its features. This can be a very painstaking and difficult. It can also force the elimination or resolution of inconsistencies and ambiguities which may be important to the overall effect of the text. Nevertheless, as Unsworth says, there can be "an enormous return on the exercise of foolish consistency". Marking up a text will often force us to articulate clearly our view of what the text is and how it works. This often makes humanists uncomfortable; they thrive on such ambiguities and tensions. However, we can think of markup as a way to make explicit a theory of the text. Any given markup expresses a theory or set of theories about the text. These need not be mutually exclusive -- multiple markups can be applied to the same text -- but within one markup version, we are forced into consistency.

Preparing for markup: Document analysis Pick your text

Which edition or MS will you use? What representations of the text are available to you? How reliable are they?

Identify your audience

What is their interest in the text? What standards and technologies are they used to working with? What can you provide for them that is not available elsewhere?

Choose your focus: what features or aspects of the text are important?

Rhyme/meter? Paleography? Physical structure? Thematic structure? Linguistic features? What will you change about the text, or add to it? In other words, what is the knowledge you're trying to represent?

Analyse your document: List the features of the text that you wish to capture in your encoding.

Look at some sample documents and think about what features of those texts might be important for various audiences.

Example: <ref target="http://www.bl.uk/learning/timeline/large126785.html">Page 1 of manuscript of Jane Eyre</ref>

XML XML consists mainly of <emph>tags</emph> and <emph>attributes</emph>. (See the <ref target="material/simple_note.htm"> example</ref>.) XML represents a document as a <ref target="material/tree_structure.png">hierarchical tree structure</ref>. The structure can only branch in one direction, and <ref target="material/sonnet_130_structure.htm">elements cannot overlap</ref>. Open tag: <code><name></code> Close tag: <code></name></code> Full tag: <code><name></code>Joe<code></name></code> Empty tag: <code><pb /></code>

This is short for <pb></pb>

Attribute: <code><name type="first"></code>Joe<code></name></code>

The order of attributes in a tag is not constrained in any way (in other words, we cannot depend on their being in a particular order, or make their order significant; processors will ignore the order of attributes).

Rulesets for document structure <code><abbr expan="United Nations">UN</abbr></code>

This is a standard TEI tag; it's purpose is obvious.

This is apparently equivalent; humans can easily tell that it means basically the same thing.

This is the same information again, in a different format.

Machines can't cope with this. They must know what to expect.

We must all agree on what tags and attributes we're going to use, what they mean, and how we're going to use them.

SCHEMAS (or DTDs) define what tags and attributes are used, and where.

Schemas and DTDs formalize our agreement. They are also machine-readable, so (for instance) editor applications can use them to constrain our choices and help us to write conformant documents.

Schemas Background: <ref target="material/simple_note_full.htm">HTML page</ref> showing a simple XML document and associated ruleset

Real <ref target="material/mariage.rng">rulesets</ref> (WARNING: Large file!), a.k.a. DTDs and schemas, are complex, but comprehensible by computers (and humans if need be) The Text Encoding Initiative (<ref target="http://www.tei-c.org/">TEI</ref>) provides a set of standard modular rulesets and tools for creating them

The TEI site (http://www.tei-c.org/) also provides extensive guidelines on how to mark up texts.

Start New Document with oXygen Start the oXygen application as you would any other program Choose New... from the File Menu "XML document", choose XML Document in the New Document dialog box, and click OK. [<ref target="new_doc.png">Screenshot</ref>] Check "Use DTD or schema" checkbox. Choose "RelaxNG", select "XML syntax". Type this into the URL box at the top:<lb/> http://hcmc.uvic.ca/tei.rng Make sure "TEI" is selected in the "Document root" box, then click "OK". [<ref target="create_xml_doc.png">Screenshot</ref>]

Edit Your Sample Document in oXygen Red wiggly lines indicate errors in the file. Insert appropriate tags - note how oXygen helps / constrains you, based on the schema. Build a minimal document structure. Copy and paste your text from <ref target="http://hcmc.uvic.ca/presentations/xml/material/sonnet_130.htm">http://hcmc.uvic.ca/presentations/xml/material/sonnet_130.htm</ref>. Confirm document is well-formed (blue check) and valid (red check).

"Well-formed" means that the document is correctly hierarchical (all tags are closed, and the tag structure forms a correctly branching tree). "Valid" means that the document conforms to the schema (it uses the right tag names attributes, and attribute values in the right places).

Save your file. That's the end of this session.

Before the next session, try using the TEI <choice>, <orig> and <reg> tags to encode a modern regular form of the poem alongside the original text.

Setting up your workspace Download the XML data file we'll be working on:<lb /> <ref target="http://hcmc.uvic.ca/presentations/xml/material/css_intro.xml">http://hcmc.uvic.ca/presentations/xml/<lb/>material/css_intro.xml</ref> Download the schema:<lb /> <ref target="http://hcmc.uvic.ca/presentations/xml/material/css_intro.rng">http://hcmc.uvic.ca/presentations/xml/<lb/>material/css_intro.rng</ref> Start a new CSS file in oXygen:<lb/>File / New / CSS. Save the CSS file. Call it "css_intro.css". Link the CSS file to your XML file by adding a processing instruction, as we did before: <code><?xml-stylesheet href="css_intro.css" type="text/css"?></code>

Selectors we will use <code>div{...}</code> (<term>type</term> selector)

This is the simplest type of selector. It selects all elements with the tag name "div".

<code>div p{...}</code> (<term>descendant</term> selector)

This selects all p elements which are descendants of div.

<code>title[level="m"]{...}</code> (<term>attribute</term> selector)

This selects all title elements which have an attribute level="m".

<code>quote:before{...}</code> and <code>quote:after{...}</code> (<term>pseudo-selectors</term>)

These pseudo-selectors are a little more complicated; they enable you to add content which will appear before or after the element. Using these selectors, for instance, you can opening and closing quotation marks before and after a quote.

More details: <lb /><ref target="http://www.w3.org/TR/CSS21/selector.html">http://www.w3.org/TR/CSS21/selector.html</ref>

The link above goes to the CSS 2.1 specification; CSS 3 defines many more selectors and pseudo-selectors, but only a few are currently supported by browsers.

Properties we will use <code>display: block | inline | none;</code> (hiding and showing elements)

Elements can be displayed as blocks (such as paragraphs, which are multi-line wrapping blocks of text), inline (words or phrases that occur within a block, such as italicized terms) or "none" (which means not displayed at all).

<code>width: 60%;</code> (sizing elements)

Width and height can be specified for any block elements, in percentages (relative to the containing block) or units such as pixels (400px) or inches (6in).

<code>margin-top: 1em;</code> (space around elements)

A margin setting does just what you would think: it puts a space between the element and anything contiguous to it. Margins can be set in percentages or units such as pixels, inches or ems (the height of the element's font). Margins can be set separately on all four sides of the element, or in a single setting (margin: 2%).

<code>text-align: left | right | center | justify;</code> <code>font-size: 150%;</code> <code>font-family: georgia, "times new roman", serif;</code>

The font-family property takes a series of comma-separated values, in order of preference; the browser will use the first one which is available. Font names with spaces should be enclosed in quotes, and the series should end with one of the generic font families (serif, sans-serif, cursive, fantasy or monospace).

<code>font-style: italic;</code> <code>font-weight: bold;</code> <code>color: black;</code> <code>background-color: white;</code>

Steps in building a stylesheet 1. Specify which elements to hide.

Some elements are not normally viewed in a document; one good example might be the teiHeader, which is often used only by metadata parsers or coders comfortable with reading XML. Such elements can be hidden by setting display: none.

2. Specify which elements are blocks.

By default, all elements are displayed as inline. The first step in organizing the layout is to determine which elements should be block elements. Typical examples in TEI would be head tags and p tags.

3. Set margins on block elements.

Space out the elements on your page by setting margin values. Other properties to consider at this stage are padding and border.

4. Set text alignment on block elements.

Some elements (such as headings) might be centred, others (such as paragraphs) will look best when justified.

5. Set font size on block elements.

The hierarchy of a text is often signalled through setting different font sizes for different levels of heading. Blockquotes are sometimes shown with a smaller font size than the surrounding paragraph text.

6. Style inline elements.

TEI emph tags will typically be italicized, acronyms or abbreviations might be bold, and monograph titles will typically be in italics.

CSS Task 2 (much harder) Download the XML data file for task 2:<lb /> <ref target="http://hcmc.uvic.ca/presentations/xml_css/materials/gowers_sample.xml">http://hcmc.uvic.ca/presentations/xml_css/materials/gowers_sample.xml</ref> Look at page-images of the original text:<lb /> <ref target="http://hcmc.uvic.ca/presentations/xml_css/materials/gowers_pages.htm">http://hcmc.uvic.ca/presentations/xml_css/materials/gowers_pages.htm</ref> Build a stylesheet that makes the XML appear as close as possible to the page-images, like this.

Limitations of CSS There's no interactivity

Nothing is clickable, and nothing pops up when you mouse over things.

We can't display images or other embedded content.

CSS allows us to specify a background image for an element or for the whole file, but it doesn't allow us to show images in the document, based on a filename or url in the document.

CSS is useful for display of simple documents, or for proofing our markup, but not much more.

In order to do something more dynamic with our documents, we need to convert them into something else. In the next part of the session, we'll be converting our XML into HTML using XSLT.

XSLT: eXtensible Stylesheet Language Transformations XSLT is an XML language

XSLT code is expressed in XML format. That makes it a little different from other computer languages you might know.

The purpose of XSLT is to turn XML into something else.

The input to an XSLT transformation is always XML. The output may be one several formats.

XSLT can produce XML, HTML, or text output. We will be writing XSLT to produce XHTML output.

XHTML is the most recent form of HTML. It is based on the old HTML many of us are familiar with, but it is expressed in well-formed and valid XML. In other words, it's simultaneously XML and HTML.

Getting started with XSLT Create a stylesheet: <lb/> File / New / XSL Stylesheet / Version 1.0.

Version 2 also exists, but it's not supported in browsers yet. We're going to view the results of our transformation in a browser, so we need to stick to version 1.0. It should be good enough.

Save your file as "css_intro.xsl". Link your XML file to this stylesheet. Replace the old xml-stylesheet instruction with this one: <code><?xml-stylesheet href="css_intro.xsl" type="text/xsl"?></code> Now simplify your XML file by removing the <code>xmlns</code> attribute in the root element. Also remove the schema declaration.

Don't ask why; it's too complicated to explain, and you'll wish you never asked. We just have to do it for this exercise.

Open the XML file in your browser.

You won't see anything right now, because our XSLT file is empty.

Our first template: <xsl:template match="/"><lb/> <html xmlns="http://www.w3.org/1999/xhtml"><lb/> <head><lb/> <title><xsl:value-of select="TEI/teiHeader/fileDesc/titleStmt/title/text()" /></title><lb/> </head><lb/> <body><lb/> <xsl:apply-templates select="TEI/text/body" /><lb/> </body><lb/> </html><lb/> </xsl:template>

XSLT works by templates. It parses the XML tree of the input file, and every time it finds something that matches one of its templates, it carries out the instructions in that template. Here, we're matching the root of the document, which is signified by a slash character. We're telling the processor: "When you find the root of the document, output a basic HTML file". Then we're telling it to extract a specific piece of information from the teiHeader, and use that for the title of the HTML page. Finally, we're telling it to continue processing from the text/body level of the tree.

Wrapup: How our projects actually work The XML documents are stored in an XML Database called eXist: <graphic url="xmldb_structure.png" width="400px" height="549px" rend="display: block; clear: all; margin-left: 20%; margin-top: 5px;" />

eXist is an open-source Java application that runs on the server. It can be integrated with a sophisticated web application server called Cocoon.

The Website is managed by Cocoon: <graphic url="cocoon_app.png" width="579px" height="428px" rend="display: block; clear: all; margin-left: 20%; margin-top: 5px;" />

You can see that a number of different languages are actually involved. The query comes in to the system in the form of a URL. This is then passed to an XQuery processor, which creates a detailed query in the XQuery language. eXist deals with the XQuery, gets the data out of the db in XML format, and passes it back into the pipeline. Then the results are processed through XSLT, to produce XML, HTML, Text or XSL:FO. XSL:FO can be passed to our XEP processor to produce PDF. The results are sent back to the browser.