Ms. Xi Wangzhi will present her paper Semantic Entity Recognition in Historical Texts: Extracting Information from Late Imperial China’s Inscriptions on Material Infrastructure at the DH Benelux Conference on June 6 in Session 5a: Textual Analysis and Stylometry. The session starts at 10.30 am in the Conference Room 2.
This research undertakes a Semantic Entity Recognition (SER) task, an extension of traditional Named Entity Recognition (NER), focusing on inscriptions on material infrastructure in late imperial China. SER, in this context, is not limited to identifying named entities such as persons, locations, and organizations. Instead, it encompasses a more comprehensive range of entities, including spatial, temporal, descriptive, quantitative, qualitative, and conceptual ones. This broadened scope is essential for extracting and interpreting multi-layered information in historical sources. Recent advancements in large pre-trained language models have highlighted their potential in various natural language processing (NLP) tasks, even with small labeled datasets.
The aim of this work is to automatically extract information from our annotated data through NER methodologies. Our primary sources are stele inscriptions related to city walls, bridges, and roads recorded in fangzhi 方志, commonly translated as “local gazetteer”. Our current corpus consists of 467 annotated inscriptions, totaling 280,823 characters, encompassing 31,455 occurrences of 16 different entity types across various provinces in China. Preliminary experiments using classical Chinese BERT models have shown the potential of SER on our dataset.