Hoppa till huvudinnehåll

OCR of Swedish texts

Datum

16 maj 2024 13:00–15:00

Plats

C201A

Eventtyp

Workshop

Öppet

Öppet för allmänheten

In this workshop we share our experiences and talk about our ongoing work on Optical Character Recognition (OCR) applied to Swedish texts. 

13:00--13:30 OCR at KBLab (Abstract)
Robin Kurtz, The National Library of Sweden (Kungliga Biblioteket, KB)

13:30--14:00 OCR challenges and solutions in historical document digitalization (Abstract)
Erik Lenas and Viktoria Löfgren, The National Archives  

14:00--14:30 OCR error correction of Swedish printed texts at Språkbanken Text (Abstract)
Dana Dannélls, Språkbanken Text

14:30--15:00 Coffee break

15:00--16:00 Discussions 

 

 

Abstracts

OCR at KBLab 

Robin Kurtz, The National Library of Sweden

The National Library of Sweden (Kungliga Biblioteket, KB) collects, preserves, and gives access to almost everything that is published in Sweden. To facilitate access for both researchers and the general public, KB aims to digitize most of its collections. Due to the sheer size of KB's collections, optical character recognition (OCR) is applied to make them searchable and usable for researchers interested in statistical analysis or language technology. While KBLab aims to develop their own OCR model to redo and improve upon the existing datasets, this is a large-scale undertaking given the size of KB's collections and does not allow for agile re-iterations. Instead we focus on post-OCR correction to iteratively improve the OCR quality, with a focus on newspaper texts and older governmental proceedings.

OCR challenges and solutions in historical document digitalization 

Erik Lenas and Viktoria Löfgren, The National Archives

For this workshop we will present a coming OCR-project here at The Swedish National Archives where we aim to produce a high quality OCR of 5 billion words from 19th century Swedish newspapers (kubhist2). We aim to explore two different methodologies. The first approach revisits an existing OCR of this corpus and tries to improve it, using transformer models for post-correction. The second approach is doing it all from scratch, finetuning modern OCR transformer architectures and layout models on in-domain data and running them on the entire corpus. An assumption of ours is that the question of choosing post-correction of an older OCR or redoing it all from scratch is qualified by the fact that the text is historical, but this assumption needs to be tested within the project. We will briefly go through what we have done so far, both with post-correction using transformer models and with training of our own OCR transformer models on the kubhist2 corpus. After this we will present a few of the challenges, the solution of which, we see as crucial for the success of the project. We will also touch on the importance of this project, and in general, the importance of getting high quality digital text from scanned historical documents.

OCR error correction of Swedish printed texts at Språkbanken Text  

Dana Dannélls

Språkbanken Text serves as a national center for the collection and refinement of language resources, primarily Swedish text corpora and lexicon resources. These collections are freely available to the general public for research, teaching, and other purposes. Searches are facilitated by an underlying machine-readable text annotation layer, enabling users to search for specific content. The annotations are accurate for the majority of texts, ensuring reliable search results. However, historical texts — originally converted from scanned historical documents using Optical Character Recognition (OCR) techniques — often contain errors, significantly impacting the ability to search within the collections. To address this issue, Språkbanken Text has, over the past decade, explored various post-correction methods. The most recent and effective method has now been incorporated into Strix, Språkbanken's Text research platform, allowing for accurate processing of historical texts and better search results.