Sex in the TEI: The TEI 2016 gender check

Sex in the TEI: The TEI 2016 gender check Vanessa Hannesschläger Vanessa Hannesschläger is a researcher at the Austrian Centre for Digital Humanities of the Austrian Academy of Sciences (ACDH-OEAW), where she is responsible for legal issues. She is involved in several projects in which she works on data modelling, digital editing, and in the outreach department. In addition, she is completing her PhD with the German department of the University of Vienna. Her research interests include legal frameworks of digital research, biography theory, archive theory, modern Austrian literature, and the contemporary developments of gender issues in society. For more information, please visit http://vanessahannesschlaeger.wordpress.com/ . vanessa.hannesschlaeger@oeaw.ac.at Peter Andorfer Peter Andorfer studied history at Innsbruck University, where he finished a PhD in history with a thesis on the works of the Tyrolean peasant Leonhard Millinger (1753–1834). During an extended research period at the Herzog August Bibliothek in Wolfenbüttel (Lower Saxony, Germany), financed by a Digital-Humanities Scholarship, he published an online edition of Millinger’s main work The Depiction of the World. He has also worked on the topics research data and scientific collections in DARIAH-DE and maintains the webpage www.digital-archiv.at for developing and deploying different kinds of DH-projects. peter.andorfer@oeaw.ac.at TEI Consortium

Creative Commons Attribution 4.0 International

No source, born digital.

TEI 2017 Conference Abstracts.

en paper gender conference contributors data enrichment Tracey El Hajj encoded the file

We decided to genderize the forenames rather than the persons and will explain this decision during our talk with reference to contemporary gender theory.

The abstracts of the TEI Conference and Members’ Meeting 2016 were published by the hosts (Austrian Centre for Digital Humanities / Austrian Academy of Sciences) as TEI encoded XML documents on GitHub (Hannesschläger and Schopper 2016). This, and the fact that these documents were published under a CC-BY-SA-4.0 license, made it possible to take these data and play with them - for instance by building a web application to publish as well as analyse the data.

Among other things, the editors tagged the forenames of the authors with the according forename. This allowed us to ask the question about gender distribution among the contributors to the conference. What started as a playful exercise in data mining, processing, and analysis, lead to categorical questions about how to assign and especially how to encode gender information to persons. We decided to genderize the forenames rather than the persons and will explain this dencision during our talk with reference to contemporary gender theory.

As far as alignment of forenames and gender is concerned, this is a simple task, at least from a technical point of view. As described on the tei2016app website in detail (Andorfer and Hannesschläger 2016) looked for a comprehensive and structured list of forenames that have already been mapped to genders, e.g., a list of female forenames and a list of male forenames. Secondly, the tagged forename of the respective TEI abstract had to be checked against these lists.

The first list of gendered names we found was is the one provided by Mark Kantrowitz that is used e.g., in the NLTK package (Kantrowitz 2017). This list was ingested into a django-based web service and accessed by an XQuery script, iterating through all forename elements of the abstracts corpus, sending each forename to the service’s endpoint and storing the returned answer.

While simple from a technical viewpoint, from a gender studies viewpoint this approach was questionable because Kantrowitz does not provide information on how the list was compiled or what criteria were applied to group names into the categories female, male, and pet.

Other sources like e.g., genderize.io do not only provide more data, but also give information about how the data was gathered and categorized. The most important argument for this data source was genderize.io’s claim that the data collected there was assembled by scraping data from social network profiles, where people can declare their gender themselves.

However, it has to be mentioned that we do not have full confidence in the truth of the claim that the data was gathered from social networks because genderize.io only knows two genders, but platforms like Facebook already offer many more choices.

Solving the issue of finding an adequate data source led to the question of how to encode this scraped information in a useful and TEI conformant way. The sex tag only allows to encode assumptions about a person’s sex, and gender about morphological gender of a lexical item, but neither of this fits our needs as we wanted to encode the gender a forename is most commonly associated with. As it turned out, the broader issue of how to encode a person’s sex has lead to quite some lengthy debates in the TEI community,

E.g.,

non of which consider the distinction between sex and gender (West and Zimmerman 1987) or discuss the questionable praxis of assigning either to a person other than oneself. The discussions focus on which values should be used (allowed) to encode a person’s sex but do not consider the question on if and how a forename element could/should be gendered.

For the current project, we solved this issue by encoding the forename’s gender with the help of a type. Concerning the values of these attributes, we came across the same issues that were discussed in context of sex, e.g., Should we encode gender information following some (iso)standard or choose custom/arbitrary values? Finally, we decided to chose the values female, male, and nomatch (the latter for forenames that did not match any name gendered by genderize.io - and genderchecker.com, which was used to reconcile names not found in genderize.io).

As a result, we can now say that 87 forenames of the contributors of the TEI-conference were male, 37 female and three no-matches. Concerning authorship of the published papers, there were 39 abstracts with more male than female author’s forenames, in 14 abstracts more female than male, eight texts with an equal distribution and two abstracts with an unclear result (meaning that most names couldn’t clearly be assigned a male or female gender).

Andorfer, Peter, and Vanessa Hannesschläger. 2016. Gender distribution among the contributors to TEI 2016. tei2016app. . Hannesschläger, Vanessa, and Schopper, Daniel 2017. Book of Abstracts in TEI XML. TEI Conference and Members’ Meeting 2016. . Kantrowitz, Mark. 2017. Name Corpus: List of Male, Female, and Pet names. CMU Artificial Intelligence Repository. Last modified: 02-Apr-1997. West, Candace, and Don H. Zimmerman.1987. Doing Gender. Gender and Society. 1(2): 125–151.