board

2018-01-18 ALTO Board Meeting Minutes

Updated draft agenda for January 18, 2018 ALTO board teleconference. If you have changes or additions, please email them to everyone as soon as possible.

Find and tell a non-offensive, maybe self-deprecating joke before the meeting begins and/or after it ends. [All]
Choose a board chair. Each year on or before Dec 31, the board must choose a chair from amongst its members for the following year (see board rules). All (continuing) members are eligible. This is carried over from last meeting because several board members did not attend. [Clemens leads the discussion].
Report on ALTO version 4.0, the ALTO XML schema README, and its CC-BY-SA 4.0 license. [Joachim, Evelien, Clemens, and Stefan lead discussion.]
Report on registration of MIME type for ALTO. See issue 40. [Jean Philippe and Frederick lead discussion.]
Report on issue 32. [Art leads discussion.]
Report on ALTO reference examples; see issue 1. [Stefan leads discussion.]
Report (if needed) on integration of ALTO with the International Image Interoperability Framework (IIIF) isssue 45. [Jean Philippe and Clemens lead discussion]

View all open action items here.

Attending members

Joachim Bauer
Raju Buddharaju
Brian Geiger
Jukka Kervinen
Evelien Ket
Ralph Marschall
Jean-Philippe Moreux
Clemens Neudecker
Stefan Pletschacher
Ashok Popat
Art Rhyno
Nate Trail
Frederick Zarndt

wrt agenda item 2. Frederick will continue as Chair for 2018 and transition Art to the position for 2019.

wrt agenda item 3. Jo has done final formating on the schema for version 4 and showed the text of a draft announcement for the ALTO mailing list. Frederick noted the the creatives commons (CC) license should be included. Jo added it to the announcement and it will be added to the schema comments as well. It was agreed that there would be a public review for 2 weeks, and if no concerns are raised, there will be an official release. Jo’s announcement points to the github location for the schema.

Clemens asked about the README where schema changes are specified and there was some discussion on the layout of the github repository. In response to a question from Jo about the reference_samples folder, the intent was to have a place for use cases where images might be provided but not ALTO files. Clemens will create a new folder called novel_use_cases for this. It was agreed that the documentation was often lacking on the github site for the purposes of the various folders and Clemens volunteered to help sort this out.

Jo announced that he has received sample content with ALTO files from the National Library of Medicine in Chinese, with metadata generated from docWorks. It was agreed that the best location for this type of material is in the reference_samples folder, and that it should be organized by schema version. Jo asked about cases involving large files, the sample content resides in a 400MB zipped archive. The ALTO github site has limited space but Frederick has an enterprize Google Drive account with unlimited storage that can be used for linking to the content. Stefan noted that it is important for developers to be easily able to acquire full samples and there was some discussion about the value of providing METS as well as ALTO for showing the structure of the content.

It was recognized that the discussion had drifted into Agenda item 6 and Stefan asked about putting processes into place to validate samples uploaded to github. Jo has provide same samples generated by the Oxygen XML editor, and showed some of the possible parameters that can be used by Oxygen. For example, Oxygen can generate all elements, a minimum set, or be configured to produce a representative sample for specific elements. Jo will upload the settings file to github.

Sample images are important as well, and Jo noted that the work on gylphs is an area where sample images would be particularly useful for a developer. Ashok asked about the relationship between variants for gylphs as described in the 4.0 ALTO schema as it relates to image regions. Google’s internal representation uses a lattice structure where variants for the region of an image are supported. Stefan noted that there had been a long discussion around variant support in ALTO, and that the current syntax reflected the capabilites of common OCR engines. There is room for the variants support to evolve.

Jo pointed out that even the metrics provided by OCR engines vary, and that docWorks has to recalculate numbers from sources like the Abbyy SDK to come up with a common scale. Frederick reminded the group of issue 23, currently assigned to Nate and Ashok. The lattice structure used by Google is a computer representation that might not be workable for an XML syntax, but Ashok agreed to describe it in more detail at an upcoming meeting. He noted that Google has moved beyond isolated characters to the probabilities of certain groupings, “N” versus “RN”, for example, and exposes different segmentatiions using protocol buffers that can be serialized, allowing end users to apply domain-specific dictionaries.

Finishing up Agenda item 6, real-world examples from the European newspapers ground truth data set will be provided that include images and some of the extensive work done on Named-entity recognition (NER). Frederick asked Raju about Chinese samples that represent particularly challenging representations, for example, articles that end and then continue in a non-prescribed location in the text. Raju has some samples that might have copyright concerns, but agreed to try to make this content available.

wrt agenda item 4. Frederick reported that the text required by IETF for the MIME registration and the changes coming out of the face-to-face meeting in Dresden have been put into the document, and that uploading the file to IETF has been thwarted by errors coming from the IETF website. It was suggested that using a different browser or submitting by curl might get past this. Jo and Jean-Philippe will take a look, the text is available in the shared google folder. Frederick expects that the registration will be complete in the next couple of days.

wrt agenda item 5. Art described some of the difficulites in identifying baselines in written text, and supports the original proposal to the issue. Jo noted the need to address baselines as a result of the transcription tool that is widely used, particularly in German countries. The general support for handwriting in ALTO was discussed, and it was agreed that this should be part of a more philosophical discussion, perhaps at a face-to-face meeting. Frederick noted that the case of supporting OCR in video is another example of deciding what ALTO should expand to cover.

Frederick will put together a poll of upcoming conference locations for the next face-to-face meeting. Ashok pointed out that the DAS conference is coming up in April in Vienna. The group agreed that the next virtual meeting for the ALTO Board might best take place in the middle of February.