TEI 2017 Victoria, British Columbia, Canada November 11 - 15

XML Tues Nov 14, 11:00–12:30

Sex in the TEI: The TEI 2016 gender check (paper)

Vanessa Hannesschläger* Vanessa Hannesschläger is a researcher at the Austrian Centre for Digital Humanities of the Austrian Academy of Sciences (ACDH-OEAW), where she is responsible for legal issues. She is involved in several projects in which she works on data modelling, digital editing, and in the outreach department. In addition, she is completing her PhD with the German department of the University of Vienna. Her research interests include legal frameworks of digital research, biography theory, archive theory, modern Austrian literature, and the contemporary developments of gender issues in society. For more information, please visit http://vanessahannesschlaeger.wordpress.com/ . and Peter Andorfer* Peter Andorfer studied history at Innsbruck University, where he finished a PhD in history with a thesis on the works of the Tyrolean peasant Leonhard Millinger (1753–1834). During an extended research period at the Herzog August Bibliothek in Wolfenbüttel (Lower Saxony, Germany), financed by a “Digital-Humanities Scholarship,” he published an online edition of Millinger’s main work The Depiction of the World. He has also worked on the topics “research data” and “scientific collections” in DARIAH-DE and maintains the webpage www.digital-archiv.at for developing and deploying different kinds of DH-projects.

1The abstracts of the TEI Conference and Members’ Meeting 2016 were published by the hosts (Austrian Centre for Digital Humanities / Austrian Academy of Sciences) as TEI encoded XML documents on GitHub (Hannesschläger and Schopper 2016). This, and the fact that these documents were published under a CC-BY-SA-4.0 license, made it possible to take these data and “play” with them - for instance by building a web application to publish as well as analyse the data.
2Among other things, the editors tagged the forenames of the authors with the according <forename>. This allowed us to ask the question about gender distribution among the contributors to the conference. What started as a playful exercise in data mining, processing, and analysis, lead to categorical questions about how to assign and especially how to encode gender information to persons. We decided to genderize the <forename>s rather than the <person>s and will explain this dencision during our talk with reference to contemporary gender theory.
3As far as alignment of forenames and gender is concerned, this is a simple task, at least from a technical point of view. As described on the tei2016app website in detail (Andorfer and Hannesschläger 2016) looked for a comprehensive and structured list of forenames that have already been mapped to genders, e.g., a list of female forenames and a list of male forenames. Secondly, the tagged <forename> of the respective TEI abstract had to be checked against these lists.
4The first list of gendered names we found was is the one provided by Mark Kantrowitz that is used e.g., in the NLTK package (Kantrowitz 2017). This list was ingested into a django-based web service and accessed by an XQuery script, iterating through all forename elements of the abstracts corpus, sending each forename to the service’s endpoint and storing the returned answer.
5While simple from a technical viewpoint, from a gender studies viewpoint this approach was questionable because Kantrowitz does not provide information on how the list was compiled or what criteria were applied to group names into the categories female, male, and pet.
6Other sources like e.g., genderize.io do not only provide more data, but also give information about how the data was gathered and categorized. The most important argument for this data source was genderize.io’s claim that the data collected there was assembled by scraping data from social network profiles, where people can declare their gender themselves. 1
7Solving the issue of finding an adequate data source led to the question of how to encode this scraped information in a useful and TEI conformant way. The <sex> tag only allows to encode assumptions about a person’s sex, and <gender> about morphological gender of a lexical item, but neither of this fits our needs as we wanted to encode the gender a forename is most commonly associated with. As it turned out, the broader issue of how to encode a person’s sex has lead to quite some lengthy debates in the TEI community,2 non of which consider the distinction between sex and gender (West and Zimmerman 1987) or discuss the questionable praxis of assigning either to a person other than oneself. The discussions focus on which values should be used (allowed) to encode a person’s sex but do not consider the question on if and how a forename element could/should be gendered.
8For the current project, we “solved” this issue by encoding the <forename>’s gender with the help of a @type. Concerning the values of these attributes, we came across the same issues that were discussed in context of <sex>, e.g., Should we encode gender information following some (iso)standard or choose custom/arbitrary values? Finally, we decided to chose the values “female”, “male”, and “nomatch” (the latter for forenames that did not match any name gendered by genderize.io - and genderchecker.com, which was used to reconcile names not found in genderize.io).
9As a result, we can now say that 87 forenames of the contributors of the TEI-conference were male, 37 female and three “no-matches”. Concerning authorship of the published papers, there were 39 abstracts with more male than female author’s forenames, in 14 abstracts more female than male, eight texts with an equal distribution and two abstracts with an unclear result (meaning that most names couldn’t clearly be assigned a male or female gender).


  1. However, it has to be mentioned that we do not have full confidence in the truth of the claim that the data was gathered from social networks because genderize.io only knows two genders, but platforms like Facebook already offer many more choices.
  2. E.g., https://github.com/TEIC/TEI/issues/426