OCR for Historical German Dialect Documents | Natural Language Processing Group

(The data for this project is in German!)

OCR (Optical Character Recognition), i.e. converting images to texts, is often framed as a solved problem. However, particularly for historic document, transforming handwriting into text is often still a challenge, particularly in cases were language modelling cannot be used to recover errors, i.e. if texts are written in dialect. The goal of this project is to develop an OCR model that can convert high quality scans of historic German handwriting in dialect to text. High-quality scans and annotated training data are available from the Research Center Deutscher Sprachatlas at Marburg University.