2018-11-29 ALTO Board Meeting Minutes


  1. Welcome new Board Members. [All]
  2. Find and tell a non-offensive, maybe self-deprecating joke before the meeting begins and/or after it ends. [All]
  3. Arabic OCR and Text Representation: experiences and challenges - we had a great discussion on this topic at the f2f meeting in Las Vegas and I think it would be useful for the larger group. [All]
  4. Spring/Summer F2F Meeting Opportunities - the survey for this would/will normally go out in January but it would be useful to get a heads up on conferences/gatherings that should be in the list. [All]
  5. Review schema issues and other updates. [Art/Clemens/Jo]
  6. Planning for special meeting on handwriting support in ALTO [All] - following up on the suggestion at the 2018-07-26 ALTO Board Meeting that we should invite Günter Mühlberger and his group at the University of Innsbruck to meet with the Board about their work on handwriting.
  7. Addition of two new Board Members/expansion of Board. [Frederick]
  8. Other business. [All]
  9. Next meeting date. [All]

Attending members

* - lost connection to call early in the meeting


wrt agenda item 3. Matt gave some background on the Open Islamicate Texts Initiative (OpenITI). It began a few years ago at a conference in Leipzig, as a result of discussions between Matt, Maxim Romanov, Sarah Bowen Savant and others on the need for a machine-actionable scholarly corpus of Islamicate texts. Arabic materials were assembled and Persian texts, which are more rare, initially came from the Ganjoor poetry collection. The texts are pushed out on github in a Markdown format and there has been a realization that stronger OCR options are needed. Benjamin Kiessling, who had worked on OCR for German fraktur and was at the initial meeting, assisted in a series of pilot projects using his open source kraken system. The use of kraken has been very effective, and the tests have been published, with similar results reported from a recent project with JSTOR. OpenITI is interested in a highly retrainable and customizable OCR platform, and is working with the SHARIAsource project of Harvard Law School on the creation of a digital text pipeline tool called CorpusBuilder.

Frederick asked if there issues that Matt has observed which might make the use of ALTO problematic for Arabic materials. Matt has none to flag at this point but may know more in a few months. Work will begin using the pipeline on manuscripts in January. Clemens noted that his group has a similar workflow pipeline project that Ben has also been involved in, and will send the details to Matt. Frederick asked that Ahmed be included in the email and noted that his understanding is that the Library of Alexandria has encountered challenges in utilizing ALTO. He believes that most OCR processing there involves the use of Sakhr.

Matt reported that his work with JSTOR has largely involved hOCR, even though ALTO is produced natively by kraken. Clemens pointed out that there are tools for round tripping between hOCR and ALTO in the github repository he maintains for OCR conversions. This repository includes PAGE conversion tools and Clemens may take up the ALTO - PAGE issue, currently marked as high priority, in the context of the OCR-D project.

Art asked about the use of XSLT for OCR conversions. Clemens is not the biggest fan of using XSLT but recognizes its platform-independent advantage. Stefan noted that XSLT is the most effective when there is a straight mapping of elements but it is sometimes better to use dedicated importers and exporters.

wrt agenda item 4. The DATECH conference to be held in Bruxelles BEL on May 8-10, 2019 may be the strongest candidate for the Spring/Summer F2F meeting. Several Board members will be there and it may provide an opportunity to meet with members of Günter Mühlberger’s group at the University of Innsbruck to talk about the use of ALTO for handwriting.

wrt agenda item 5. Some specific issues were identified for discussion by the Board:

wrt agenda item 6. As mentioned in item 4, Günter Mühlberger’s group is likely to be at DATECH and there is a consensus among the Board that exploring the use of ALTO for handwriting is worth pursuing. Clemens noted that Günter and his team are moving away from static OCR encoding to confidence matrices for recognized characters and words, an approach similar to Google’s lattice processing. The single best hypothesis from an OCR engine may arise downstream as more confidence is created in the domain. Clemens and Stefan are both involved in the DATECH conference and will reach out to Günter to confirm availability.

wrt agenda item 7. Frederick outlined the strengths that the two recent candidates bring to the ALTO Board, including expertise in Chinese, and there was general agreement that the Board should be expanded to accommodate their membership. Frederick will send an email to Board members asking for an ACCEPT or REJECT to changing the membership criteria. Expanding a Board can make quorum problematic but the ALTO Board largely avoids these kind of governance issues by virtue of using github and email for voting. Once the Board expansion is approved, a vote will go ahead on the two candidates.

wrt agenda item 9. The next meeting is tentatively scheduled for January 25, 2019 at 9am EST.