Abstracts 2024 Edition

Abstracts 2024 Edition

Location: rooms 02.28 (Thursday, November 28th) and 00.28 (Friday, November 29th), KU Leuven, Mgr. Sencie Instituut, Erasmusplein 2, 3000, Leuven (Belgium) & online. All times indicated are CET.

Abstracts November 28th, 2024

14.30-15.15 Marijke Beersmans and Evelien de Graaf (KU Leuven)
Automating NEL for Ancient Greek and Latin: Initial Proposals and Experiments
Named Entity Linking (NEL) or Disambiguation, often considered a sub-problem of Entity Linking (EL), focuses on the unambiguous identification of entities in text. For ancient Greek and Latin, NEL has been primarily performed manually on individual texts or small corpora as no fully automated pipeline has been published for this task.
In this presentation, we summarise our initial proposals for and experiments with developing an automated Named Entity Linking (NEL) method for Ancient Greek and Latin. A short overview of the current state of NEL in this field will highlight the lack of standards in annotation practices and consensus on a resource for identifying and linking unique individuals. We then explore our rationale for choosing Paulys Realencyclopädie der classischen Altertumswissenschaft (RE) as a Knowledge Base (KB) for disambiguating individuals. An evaluation of using the RE for manual annotation of literary Greek and Latin texts will paint a clear picture of its coverage, potential, and limitations. 
The second part of our presentation will focus on our experiments with Machine Learning methods for NEL. We create a small NEL dataset by working backwards from mentions of Ancient Greek text sources in the RE, linking them back to editions of the original texts in the GLAUx corpus using various Trismegistos and GLAUx identifiers. We then utilize this dataset to train and test a deep learning entity linking model, BLINK. This model has the advantages of being both conceptually simple and in theory applicable in the challenging setting of this project, which is at the same time multilingual, low-resource and considers linking to a custom knowledge base. This presentation will detail preliminary results. 

15.15-16.00 Julia Jennifer Beine (University of Vienna)
Gotcha! Catching Schemers in Roman Comedy and its Receptions
In this talk, I will show how to create a data profile of scheming slaves and servants in ancient and early modern comedies using network analysis. This method will be exemplified by analysing texts incorporated in DraCor, a digital infrastructure for programmable corpora. I will especially reflect on the potential and limitations of network analysis for the study of European drama. Finally, I will elaborate on how network analysis allows for studying cross- epochal phenomena such as classical receptions.

16.30-17.15 Thea Sommerschield (University of Nottingham) From Antiquity to ACL: following an inscription’s journey through machine learning methods, tools and risks
In this talk I will survey the many aspects of the study of ancient inscriptions which can today be augmented or assisted by AI tools, and what new directions of study this nascent field might take. We will trace the evolution of AI techniques available to historical researchers, from early systems for reading the stylus texts of Vindolanda to the most recent advancements presented at the ACL 2024 workshop ‘Machine Learning for Ancient Languages’ held in Bangkok, Thailand. Throughout this journey through tasks, modes and languages, we will focus on what challenges in AI for epigraphy are yet to be tackled, what risks in the available data we should be aware of, and on the principles of responsible AI use in the context of the ancient world.

17.15-18.00 Giuseppe G. A. Celano (Leipzig University)
Opera Graeca Adnotata and Opera Latina Adnotata: the creation of multi-layer corpora for Ancient Greek and Latin
The contribution presents the creation of two multi-layer corpora for Ancient Greek and Latin, i.e., Opera Graeca Adnotata (OGA) and Opera Latina Adnotata (OLA), with reference to the their design and the challenges they pose to represent different kinds of annotations across thousands of texts and millions of tokens. A multi-layer corpus is a digital linguistic resource in which textual data is annotated across multiple levels of linguistic information in a scalable way, thus allowing for cross-searches. OGA and OLA employ the PAULA XML formalism, where different levels of token-based annotations are connected in a graph structure. I will focus on the morphosyntactic annotation of the texts and the issues related to the development of machine learning models for it, which include annotation scheme inconsistencies and comparison of different models. A hands-on demo will also be offered on how to query annotations through the use of Annis, a browser-based search tool.


Abstracts November 29th, 2024

09.00-09.45 Francesco Mambrini (Catholic University, Milan)
Harmonizing Ancient Greek treebanks in Universal Dependencies. Challenges and perspectives
Over the past decade, Ancient Greek treebanks have grown significantly, now surpassing 1 million words with morphosyntactic annotations spanning across various genres and periods of Greek literature (from the Homeric poems to documentary papyri). Moreover, treebanks have already been used in a series of studies on the language and style of Ancient Greek texts and on language teaching. The usability of the Greek treebanks, however, is limited by a series of factors. The major collections (PROIEL, Perseus’ AGLDT and Pedalion) adopt different formalisms of dependency grammar and different tagsets. More insidiously, they show a wide variance in the interpretation of the same morphosyntactic phenomena, both across the different projects and within the same collection.
The Universal Dependencies (UD) provides a unified annotation framework, with set of unified guidelines suitable for the annotation of morphosyntactic phenomena for both ancient and modern languages; it is therefore well-suited to addressing the fragmentation problem of the Ancient Greek treebanks. However, while several projects have either adopted the UD standard  for annotation (PTNK), or provide versions converted to UD (PROIEL, Perseus’ AGLDT), the harmonization problem is far from solved. The most serious problems are in fact due to: incomplete conversion of the texts, outdated conversions (which have not been revised according to the latest version of the guidelines), lack of annotation for relevant phenomena (e.g. ellipsis) and inconsistent treatment of many morphosyntactic phenomena.
This talk outlines the primary challenges encountered in aligning existing treebanks with the UD annotation guidelines. We discuss the main types of inconsistencies found across the different treebank collections and within the same corpora. Additionally, we consider the different approaches attempted to detect inconsistency in annotated corpora and discuss the results of applying some of these methods to the available UD treebanks.

09.45-10.30 Daniela Santoro and Chiara Zanchi (University of Pavia)
WordNets for ancient languages: state of the art and the contribution of LLMs
In our presentation, after briefly introducing WordNets and their architecture, we focus on the research conducted within the framework of the Linked WordNets for Ancient Indo-European Languages project. Our lexical databases combine WordNet’s neo-structuralist view of meaning with Cognitive Semantics, while also incorporating language-specific features. Within this project, we are pursuing various lines of research: (i) we manually validate existing synsets generated using the so-called expand method; (ii) we focus on specific areas of the lexicon, such as temperature terms; and (iii) we explore the potential of using Large Language Models (LLMs) to refine the synsets of our WordNets. In particular, we will present a study that explores the integration of LLMs into the Latin WordNet for the automatic generation of Latin synsets. Utilizing Mistral-7B for its balance between performance and efficiency, the research initially employed prompt tuning in zero-shot and few-shot modes to evaluate the model’s flexibility in task adaptation. The implementation also includes fine-tuning with LoRA (Low-Rank Adaptation), leveraging existing data from the Latin WordNet. These methodologies aim to optimize synset generation, addressing the challenge of limited training data availability.

11.00-11.45 Silvia Stopponi and Saskia Peels-Matthey (University of Groningen)
Evaluating lexical semantic change detection for Ancient Greek with Word Embeddings
Lexical semantic change detection for Ancient Greek remains a relatively underexplored area. Prior to Stopponi et al. (2024), no study employed word embedding models or applied specific metrics to identify semantic shifts in this language.
In Stopponi et al. (2024), we  trained diachronic word embedding models and detected semantic change by means of the Vector Coherence  (VC) and the J, two measures previously adopted by Cassani et al. (2021). Manual evaluation of the results demonstrated that both VC and J effectively capture semantic stability, while VC was found to be more reliable than J in identifying instances of semantic change.
However, a systematic assessment of the performance of the measures, including a comparison with external resources, was still missing. For this purpose we built a benchmark by extracting attested cases of semantic change in Ancient Greek from existing scholarly work. We then used the resource, a list of Ancient Greek words which changed their meaning through time, to assess whether the VC measure is able to detect such changes. In this talk we present the building of the benchmark, including the challenging aspects, and the results of the evaluation, including their statistical significance.

11.45-12.30 Andrea Farina (King’s College London)
Annotating preverbed motion verbs in Latin and Ancient Greek. Quantitative studies and future directions
Preverbs, morphemes prefixed to verbal bases, play a key role in many languages, including Latin and Ancient Greek. Traditionally, studies on preverbs have relied on qualitative approaches, focusing on individual examples and theoretical interpretations. This talk aims to present a data-driven, corpus-based quantitative analysis of preverbs, providing new insights through computational methods. The corpus (541,620 tokens) spans Latin and Ancient Greek texts from the 8th century BCE to the 2nd century CE. I will discuss the challenges of manually annotating over 2,800 occurrences of preverbed motion verbs, with annotations performed at multiple levels (morphological, syntactic, semantic). Data extraction and analysis were carried out using automatic methods, which allowed for the systematic processing of large-scale textual data. This quantitative approach has revealed significant patterns in preverb usage, which I will present and discuss. Beyond the linguistic focus, I will also comment on how these analyses can contribute to cultural analytics, offering insights into the ancient world. Finally, I will outline the structure of PrevNet, a digital resource in development. PrevNet will provide extensive linguistic data on Latin and Ancient Greek preverbs and motion verbs, supporting future research and enabling cross-linguistic comparative studies. This resource will be useful not only for linguists but also for scholars interested in the broader cultural and historical contexts of ancient languages.

13.45-14.30 Colin Swaelens (Ghent University)
Similarity Detection: A Starting Point for Greek
Antique literature survived thanks to scribes painstakingly copying texts from one manuscript to the other, prior to the art of printing. Occasionally, these scribes added metrical paratexts to the manuscripts, i.e. texts standing next to the main text (Genette, 1987) and introduced in Byzantine scholarship by Lauxtermann (2003) as book epigrams. Ghent University’s Database of Byzantine Book Epigrams (Ricceri et al., 2023) stores more than 12,000 of such epigrams, being verbatim transcriptions precisely as they are found in the manuscripts. This entails that the Greek of these epigrams is interspersed with orthographic inconsistencies, mainly due to phonetic changes like the itacism. These verbatim transcriptions are called occurrences and are grouped under one or more so-called types, a readable representation of its occurrences in standardised, classical Greek. Eventually, we aim to develop a dynamic system to group hemistichs, verses and epigrams based on distinct similarity measures in order for scholars to find all kinds of similar texts instead of only the ones that pop up in their mind. While developing those similarity measures, just like any other algorithm, evaluation is an essential part of the development process. However, a gold standard for the evaluation of verse similarity measures does not exist. At this point, we already conducted a pilot study on pairwise annotation of 2 verses with 10 annotators. Each verse was set off alongside six pairs of verses, of which the annotator had to mark the most similar one in their opinion. The inter-annotator agreement (IAA) yielded an agreement score of 57.69%, which is seen as a moderate agreement (Landis & Koch, 1977). This agreement score is the arithmetic mean of the agreement between each pair of annotators, as all annotators annotated the exact same set of verses. Despite the rather modest size of this pilot study, it is possible to unravel the distinct lines of reasoning of the annotators. They did not receive detailed instructions for the annotation process, because of which every annotator was free to have their own focal point. The most remarkable of those focal points was the metre. One of the annotators based their judgement on the amount of syllables a verse counts. The majority, however, seemed to take syntax as a decisive factor to determine the most similar verse; semantics were only deciding, if the syntax of both options was identical. While the gold standard is being annotated, we already started computing similarity between words. These similarities will, in a next stage, be used to compute similarity between (half) verses. The main goal of the experiment is to find out whether transformer embeddings take into account enough context to find identical or similar words with deviant orthography.

14.30-15.15 Dominique Longrée and Valérie Thon (University of Liège)
Detection of Textual Motifs and Multichannel Deep Learning with LASLA lemmatized and tagged files: two case studies
We wish to explore the difference between Deep Learning applied simply to the lexical forms of a text and Multichannel Deep Learning, which also takes into account the lemmas and morphosyntax. More precisely, the idea is to better understand the criteria, that is the ‘textual motifs’ on which Deep Learning relies to offer its classification. We will propose two case studies, illustrating at the same time the possibilities of exploiting the Hyperdeep software: a study of intertextuality (Ovid) and an attempt to date some letters written by Peter Damian.

15.45-16.30 Thibault Clérice (Inria Paris & Federico II University, Naples)
Sentence classification and semantics: identifying sentences with sexual semantics in Latin from 300 BCE to 900 CE
In this study, we propose to evaluate the use of deep learning methods for semantic classification at the sentence level to accelerate the process of corpus building in the field of humanities and linguistics, a traditional and time-consuming task. We introduce a novel corpus comprising around 2500 sentences spanning from 300 BCE to 900 CE including sexual semantics (medical, erotica, etc.) based on the foundational work of J. N. Adams. We evaluate various sentence classification approaches and different input embedding layers, and show that all consistently outperform simple token-based searches. We explore the integration of idiolectal and sociolectal metadata embeddings (centuries, author, type of writing), but find that it leads to overfitting.

16.30-17.15 Frederick Riemenschneider (Heidelberg University)
Ira ex machina. Multilingual Models and Emotion Analysis in Classical Texts
Computational approaches to Latin and Ancient Greek present unique challenges due to the nature of these languages, their closed corpora, and the specialized research questions they often provoke. In this talk, I will showcase how multilingual language models can address some of these challenges.
I will first discuss whether, when, and to what extent cross-lingual transfer aids in building effective methods for Latin and Greek, as well as the obstacles researchers face in adapting modern tools to process ancient texts.
In the second part of my talk, I will introduce RAGE (Roman and Greek Emotions), a recent project that uses Semantic Role Labeling to analyze emotions in Latin and Greek literature. I will demonstrate how we employ multilingual language models within RAGE, using both large pre-trained models with algorithmically optimized prompts and smaller models fine-tuned through active learning. Moreover, the project integrates Named Entity Recognition and Linking to identify and connect characters within these texts, offering new ways to interpret emotions in Latin and Greek works.