Agenda
Attending members
Minutes
wrt agenda item 3. Review recent schema issues:
ALTO 4.0: adaptation of “Processing” substructure - The revision has been available since September. Art will do a final check with Clemens and then ask the Board to vote on GitHub.
Schema documentation about use of <SP> needs clarification - Based on the Board’s last discussion, there is support for adding a Content attribute to the SP element.
Add LANG and ROTATION attributes to Page element - Art will ask for a use case and link the issue to the discussion on shape-element usage.
wrt agenda item 4. Touch base on high priority schema issues:
ALTO - PAGE xml: Object mapping and possible transformation generation - Christian’s document is a good start at listing the element mapping between ALTO and PAGE. Stefan’s feeling is that this may no longer be a high priority. Ashok asked if ALTO should have the goal of being the unified format for text representation as an alternative to encodings like PAGE. Stefan noted that PAGE captures aspects of documents that ALTO does not, and that ALTO reflects its origins in the context of METS, where METS encodes the higher level, logical structure of a document.
Confidence value calculation (CC - WC - PC) - annotation extension - Jo pointed out that as the number of OCR engines increase that produce ALTO, the greater the need is for clarity on confidence values. Stefan asked if it would be possible to use identifiers to indicate what algorithm was in use and noted that there might be great variations in how metrics for glyph and word accuracy are used. Ashok described how his team utilizes an error rate calculated from a development set at the character level, that is then aggregated for a coarser granularity. Frederick suggested that the Board provide a standard that OCR providers can follow.
Reading Order (IMPACT) - As with ALTO-PAGE, this issue reflects a desire for a higher level, logical encoding. PAGE supports this functionality and this issue again explores whether there should be overlap between ALTO and representations like PAGE and METS.
wrt agenda item 5. Language Discussion:
Hany described some of his experience with Arabic OCR. He did extensive work on font libraries at the Library of Alexander, and identified variations in OCR results between bitonal images and color scans. At the Qatar National Library, assigning the proper font library has proved to be an important step in digitizing materials, and the workflows there can include machine learning and processes that utilize the first few pages of a book for initial testing and training. They have agreed to OCR 8,000 books from the University of New York’s Arabic collection and Hany noted that the SHARIAsource project, which has a kick-off meeting in the next week, may surface issues associated with encoding Arabic OCR in ALTO. Frederick would be interested in learning more about the University of New York project.
Cally provided information on the NewspaperSG at the National Library of Singapore. NewspaperSG has been in operation for nearly a decade, and covers over 200 titles across multiple languages. They have not yet identified an effective OCR solution for Tamil and Javanese language materials. Raju will send some options, noting that there some OCR companies in India with this expertise, and pointed out that some Indian languages use letters from the Arabic alphabet, such as Urdu and Hindi.
wrt agenda item 6. Follow up on upcoming F2F meeting at DATeCH 2019:
Art will touch base with Clemens about reaching out to Günter Mühlberger for joining the meeting.
The ALPS Schema may provide some ideas on how to represent lattice structures in XML. Ashok will check if any of the internal representation of the lattice structure used by the Cloud Vision team can be shared. The lattice is stored internally in a protocol buffer, and represents multiple hypotheses, which may have value for handwriting representation and Ashok will be able to step through it in person at the meeting in DATeCH.
It was agreed that a Schema issue will be created on the use of ALTO for handwriting in order to provide a starting point for discussion.
wrt agenda item 8. The next meeting is tentatively scheduled for March 22, 2019 at 9am EST.