TEI 2017 Victoria, British Columbia, Canada November 11 - 15

XML Tues Nov 14, 11:00–12:30

Encoding Cryptic Crossword Clues with TEI (paper)

Martin Holmes* Martin Holmes is a programmer in the University of Victoria Humanities Computing and Media Centre. He served on the TEI Technical Council 2010–2015 and was Managing Editor of the Journal of the TEI 2013–2015.

1Although word-square puzzles have existed since ancient times (Austin), the first modern crossword did not appear until 1913. Crosswords featured in British newspapers from 1923, and within a few years, some included clues which were more than “plain definitions,” including “elusive definitions,” anagrams and “hints” (Macnutt, 19). The wholly cryptic crossword evolved by the early 1940s, since which time cryptic crosswords have appeared daily in British newspapers.
2Whereas a “simple” crossword clue is merely a definition, a cryptic clue is a more sophisticated puzzle typically consisting of two parts: a definition, and a set of codified instructions for building the solution. These components are woven into a phrase or sentence which has its own internal logic usually unconnected with the actual answer, intended to mislead the solver. This is a recent example from a Guardian crossword:
Amending pub sign, add in Cook’s vessel (7,5)
The answer is PUDDING BASIN, an anagram of “pub sign add in” (signalled by “amending”), defined by “Cook’s vessel.” The complete clue suggests perhaps the addition of HMS Endeavour to a pub signboard.
3Several different types of cryptic clue emerged in the first decades of the tradition, and the “rules” for setting clues were codified by the influential early setters Afrit (1949) and Ximenes (aka Macnutt), who presented a taxonomy of clue types and principles for setters to adhere to in the interests of fairness. In the decades since, crossword setters have largely conformed with these core principles; although some have been more rigorously “Ximenean” than others, it is fair to say that the tradition has been remarkably consistent, and a crossword solver doing a regular puzzle in a daily newspaper over the last 50 years will not have experienced much change in the form and style of clues. Some clue types, such as those based on literary quotation, appear to be less common in recent years, while some new conventions and clue types have developed. At least one species of clue, in which the answer is arrived at by describing the clue itself, seems to be more common recently. This is not covered by Macnutt’s taxonomy (falling presumably into his miscellaneous “various” category); it might be categorized as an “embodiment” clue, and an excellent example from the master setter Araucaria (John Graham), is this:
Of of of of of of of of of of (10)
The answer is OFTENTIMES.
4The best cryptic crossword clues exhibit the allusive compression, elegance and wit that characterizes good poetry, and can elicit similar delighted responses from solvers. This alone makes them worth studying as a distinct form of literary text. It would also be illuminating to examine the evolution of clue types and conventions over the last eighty years, and investigate how the content and themes of cryptic clues reflect the changing world of the setters and solvers. To do such work, a systematic method of encoding clues is required.
5Computing methods have been applied to cryptic crosswords to auto-generate grids and clues (Berghwl & Yi), and to parse clues (Hart & Davis), while Williams & Woodhead proposed a formal notation for clue components. However, as far as I know, no systematic approach to encoding cryptic clues in XML has been developed. This paper will present a TEI schema and guidelines for encoding the components of clues and solutions using <taxonomy>, <seg>, and @ana, developed for a project aiming to encode a representative sample of puzzles from British newspapers over the last eighty years, enabling algorithmic analysis of trends, features and clue types. Two taxonomies are being developed, one of clue types (starting from the lists in chapters 6, 7 and 8 of Macnutt), and the other of clue components. The objective is to assign each clue to one or more categories, and to break down its structure to clarify the way it works, and how it simultaneously misleads the solver, as in this encoded example from Picaroon (2017) in the Guardian. The clue is “Four card players wrapping party gifts (6)”:

<item ana="crs:ctpContainerContents">
  <seg ana="crs:ccpForm">
    <seg ana="crs:ccpConvention">Four card players</seg>
    <seg ana="crs:ccpSignal">wrapping</seg>
    <anchor xml:id="item_003_1"/>party
  <seg ana="crs:ccpDef" xml:id="item_003_2">gifts<anchor xml:id="item_003_3"/></seg>
  <seg ana="crs:ccpLength">(6)</seg>
  <span ana="crs:ccpMisdirection" from="#item_003_1" to="#item_003_3">
    The phrase <mentioned>party gifts</mentioned>
    crosses the definition/form boundary.</span>
  <span ana="crs:ccpMisdirection" from="#item_003_2" to="#item_003_3">
    The definition <mentioned>gifts</mentioned> is
    a noun in the context of the complete clue, but
    needs to be read as a verb to function as the
The answer, ENDOWS, is defined by gifts. The four card players are the points of the compass, ENWS (East, North, West and South), as used in writing on bridge and other four-player card games; inside these is a common British word for a party, do. Seasoned XML encoders may notice that one coherent phrase party gifts spans the boundary between the form (constructor) component and the definition component, creating a sort of overlapping hierarchy phenomenon which undermines the solver’s ability to parse the clue correctly.