Wendat Project
Schema and guidelines for encoding lexical data
Martin Holmes
2019–2024

Table of contents

7. Document ids and other identifiers

In our XML encoding, identifiers are the xml:id attributes that appear on elements. Our objective in creating identifiers is to provide a project-wide unique id for every item in the collection that we need to find, process, link to or otherwise manipulate. In general, transcribers/encoders should not have to worry about identifiers, because most will be assigned by the programmer or generated through automated processes. However, our policies on identifiers are documented here.

7.1. Initial assignment of ids

7.1.1. Transcribed manuscripts

The following section covers transcribed manuscripts. Other document and element types are covered in a subsequent section.

Each manuscript has a document id (<TEI>/xml:id) beginning with ms: ms59, ms60, msPotier1751 etc. These are intended to be short enough to be unique in the project and sufficient to identify an MS. Where the xml:id is a little longer than we would wish, the MS will also have a shorter version of its id for use in linking contexts; for cases such as ms59, this will be identical to the main id, but for cases where the id is longer, it will be a truncation, so msSagard becomes msSag, and msPotier1751 becomes msP51. The shorter id will be encoded in the n attribute on the manuscript's root <TEI> element.

Each entry in a document (i.e. each <entryFree> element), as well as each <form> element, will have a unique xml:id attribute, but these will be generated automatically after the document has been transcribed and its encoding is judged to be relatively stable. The identifiers for <entryFree> elements will be constructed as follows:

  • a leading ef_ meaning ‘entryFree’
  • the short form of the manuscript id (e.g. msP51) followed by an underscore
  • a four-digit (using leading zeroes) numeric counter for the entry

The ids for <form> elements will be constructed as follows:

  • a leading fef_ meaning ‘form inside entryFree’
  • the short form of the manuscript id (e.g. msP51) followed by an underscore
  • a six-digit (using leading zeroes) numeric counter for the entry

Note that we use four-digit counters for entries and six-digit counters for forms based on our experience with the likely numbers of each in a manuscript file. This may change in future if longer or more complex manuscripts are encoded, but changing this is trivial.

Examples:

The fourth entry in msJCB:

ef_msJCB_0004

The 22nd form in msJCB:

fef_msJCB_000022

The advantages from using this system will be:

  • Every id will be unique across the entire project.
  • Any list of entries or of forms sorted by id will appear grouped by manuscript and sorted in document order, without any requirement for a special sort algorithm.
  • Given any id, its containing manuscript and its approximate position in the document can be derived trivially.

Note that we do not encode anything related to the structural location of an element in its id; the id does not tell us whether an <entryFree> is nested inside another <entryFree>, or which <entryFree> a <form> element appears in. This is not necessary because this information can be discovered instantly during any processing, and whenever we list out or render items based on their id, we can provide any relevant or required information about their context.

7.1.2. Subsequent modification of mnuscript ids

Once ids are assigned, they will (we hope) never be modified. However, it may be necessary to interpolate new ids if it is discovered (for instance) that a specific <form> or <entryFree> element should actually be split into multiple elements, or a section of text was missed during transcription. These are the protocols for interpolating ids.

  • Take the id of the element of the same kind (<entryFree> or <form>) which precedes the new item in the document order (in other words, its preceding-sibling or (in the case of nested <entryFree> elements) its first ancestor.
  • Add the letter a; if there are subsequent interpolated elements, add b, c, and so on, as appropriate.
  • If the preceding element is itself already interpolated, add instead the next letter in the sequence.
  • If this process results in a duplicate id—for example, if you are attempting to interpolate new items into a sequence that already consists of interpolated items—then consider requesting a mechanical reprocessing of the document to re-id everything, and update any links to existing ids. The programmer can undertake that process if necessary.

In cases where elements need to be deleted for some reason, they may just be deleted, leaving an id in the sequence unused.

Martin Holmes. Date: 2024-01-09