View all open action items here.
The Board welcomed Matt who joined the meeting virtually by Google Hangouts. Discussion was held on the state of Arabic OCR software. Frederick noted the use of Sakhr OCR and RDI among institutions with large collections of Arabic materials and the high rate of individual character-level corrections. Matt described the Open Islamicate Texts Initiative (OpenITI) and positive experiences with the open-source Kraken OCR engine. The OpenITI a large is a multi-institutional effort to construct the first machine-actionable scholarly corpus of premodern Islamicate texts
For mapping Arabic in ALTO, Art asked about the similarity between tashkil marks in Arabic text and ruby symbols in Asian languages. Matt will upload some sample Arabic materials to the ALTO github site.
In response to a question from Matt, Ashok clarified the relationship between Tesseract and the OCR support in the Cloud Vision API. Google supports Tesseract development and the recent neural network support in Tesseract was assisted by the API team. Ashok described the current workflow of the Cloud Vision API, which now includes handwriting recognition support.
Matt had another commitment and had to step out of the meeting at the 2 hour point. Ashok suggest a contact for Asian language expertise that Art will follow up on. There was some discussion about the OCR correction proposal in IIIF and how that may be an important development in furthering the use of ALTO in digitization projects. The meeting wrapped up after 2.5 hours, thanks to all involved.