Sögulegi íslenski trjábankinn

  • Eiríkur Rögnvaldsson
  • Anton Karl Ingason
  • Einar Freyr Sigurðsson
  • Joel C. Wallenberg
Keywords: Treebank

Abstract

The article describes the background for and construction of Icelandic Parsed Historical Corpus, IcePaHC, a million word parsed historical corpus of Icelandic that has just been completed (Wallenberg et al. 2011). This corpus contains fragments of 60 texts ranging from the late twelfth century to the present day and serves the dual purpose of being both a cornerstone of Icelandic language technology and also an invaluable tool in Icelandic diachronic syntax research. The corpus is unusual in many ways. First, it was designed to serve as a tool for both language technology and syntactic research, and was developed by scholars with research experience in both diachronic syntax and computational linguistics. Secondly, the corpus spans almost ten centuries – the oldest texts were written in the final decades of the twelfth century and the youngest are from the first decade of the present century. Thirdly, the corpus contains over one million words and is thus among the largest of the parsed corpora that have been published for any language. Fourthly, access to the corpus is completely open and free and thus requires no registration or paperwork, and the same is true for all the software used in its construction and also for other software developed within the project. In the present paper, we follow the Introduction by describing the background to the treebank, whose origins lie in three different projects. Several aspects of the material in the treebank are then discussed – the selection of texts, their quality, and their conversion to modern Icelandic spelling. We then explain our decision to build a Penn style treebank and we offer an overview of the annotation process. Following a case study which shows how the treebank can be used to investigate historical differences in use of the numeral/indefinite pronoun einn “one, a”, we present our open source policy and set out “10 basic types of user freedom” for language resources. The corpus has been made available via free download (https://linguist.is/icelandic_treebank/Download). Both the software and the corpus itself are distributed under the LGPL license (https://www.gnu.org/licenses/lgpl.html).

Published
2021-06-21
Section
Peer-Reviewed