Version 16-01-2014
ALTO Board
Copyright © 2014 ALTO Board. All rights reserved.
No part of this publication may be reproduced, stored in databases, or transferred in any form (electronically, photo-mechanically, chemically, manually, or otherwise) without the express written permission of the ALTO Board. The software described in this manual is licensed software that may be used only in compliance with the licensing terms and conditions. The ALTO Board reserves the right to make changes to the content of this manual without notice. The ALTO Board makes no guarantee regarding the accuracy of the information provided in this manual. Microsoft, MS-DOS, and Windows are registered trademarks of the Microsoft Corporation. Parts of the software uses the Duden-Proof-Factory of the Brockhaus Duden Neue Medien GmbH for syllable separation.
Product or company names that are mentioned may be trademarks or registered trademarks of the respective company. The ALTO Board uses these names and trademarks in the following manual merely for explanatory purposes and for the benefit of the respective user, and such use does not imply trademark infringement. Under this software license, you are only permitted to reproduce materials that are not protected by copyright laws. This excludes only materials where you hold the copyright and/or legal permission to reproduce copyrighted materials. If you are uncertain about the copyright status of certain materials then please seek legal counsel. The ALTO Board holds no liability over copyright violations resulting from the use of this software.
Last updated: 15.01.2014
The ALTO Board
The ALTO version 2.1 introduces a new mechanism: tag (or annotation). The aim of this mechanism is to cover several change requests concerning the ALTO format:
This mechanism has been introduced in ALTO version 2.1 schema. In XML terms, a reference to a new element group "Tags/Tag" allows to add additional information to the contents which are referring to these tag elements:
<Tags>
<NamedEntityTag ID="NE15" LABEL="Location" DESCRIPTION="Lexington"/>
…
</Tags>
<Layout>
…
<String CONTENT="Lexington" WC="1.0" TAGREFS="NE15" HPOS… VPOS…>
…
</Layout>
For readibility purpose, IDs have been omitted in the following examples.
Before the actual character recognition starts, any OCR software analyzisis the page structure to identify regions. Modern OCR software supports various region types than ALTO could record: tables, graphics, music scores, etc. Some of these regions can be categorized for the OCR software to use a specific algorithm or process when recognizing their characteristics; they can also be identify with a manual labelling.
ALTO tag element: <LayoutTag>
Tag attributes:
ID
LABEL
TYPE
(optional)DESCRIPTION
(optional)URI
(optional)The intent of layout tagging is to identify types of content. Sometimes, specific types of content can also have a functional value (sidebars, separators, etc.) and the question of using a layout tag or a structural tag may be raised.
<ComposedBlock>, <TextBlock>,
<TextLine> or <String>
elements.
<LayoutTag LABEL="Sidebar"
DESCRIPTION="Textual box"/>
<LayoutTag LABEL="Table"
DESCRIPTION="Table of numbers "/>
<LayoutTag TYPE="Formula"
LABEL="MathFormula"
DESCRIPTION="4 maths formulas"/>
On <TextLine>
elements
On <String>
elements
<LayoutTag TYPE="Formula"
LABEL="ChemFormula"
DESCRIPTION="Ka formula"/>
<LayoutTag TYPE="Formula"
LABEL="PhyFormula"
DESCRIPTION="e and v formulas"/>
<LayoutTag
LABEL="TextStamped"
DESCRIPTION="PAR"/>
<LayoutTag TYPE="Typesetting"
LABEL="Handwriting"
DESCRIPTION="43g8"/>
<LayoutTag TYPE="Typesetting"
LABEL="Handwriting"
DESCRIPTION="N Roret"/>
<GraphicalElement>
<LayoutTag TYPE="Typesetting"
LABEL="Manuscript"
DESCRIPTION="Old French manuscript"/>
<LayoutTag TYPE="Typesetting"
LABEL="ScriptFonts"
DESCRIPTION="Script font content"/>
<LayoutTag TYPE="Typesetting"
LABEL="NonLatinFont"
DESCRIPTION="Chinese ideograms"/>""
<LayoutTag LABEL="Masterhead"
DESCRIPTION="L’AURORE – avril 1942"/>
<LayoutTag LABEL="Advertisements"
DESCRIPTION="AVIS"/>
<LayoutTag LABEL="SmallAds"
DESCRIPTION="Petites annonces"/>
<ComposedBlock>
or
<Illustration>
elements
<LayoutTag LABEL="Map"/>
<LayoutTag LABEL="Engraving"/>
<LayoutTag LABEL="Graphic"/>
<LayoutTag LABEL="Chart"/>
<LayoutTag LABEL="Linedrawing"/>
<LayoutTag LABEL="Photo"/>
<LayoutTag TYPE="Formula"
LABEL="ChemFormula"
DESCRIPTION="Aspartame formula"/>
<LayoutTag LABEL="MusicalScore"
DESCRIPTION="Maurice Ravel"/>
<ComposedBlock>, <GraphicalElement>
elements
<LayoutTag LABEL="TransitionSep"/>
<LayoutTag LABEL="FootnoteSep"/>
<LayoutTag LABEL="Stamp"
DESCRIPTION="Dépôt légal Vosges 1891"/>
<LayoutTag LABEL="DropCap"
DESCRIPTION="A"/>
<LayoutTag LABEL="Illegible"
DESCRIPTION="scan quality problem"/>
On <TextBlock>
or <TextLine>
elements.
<LayoutTag LABEL="Illegible"
DESCRIPTION="illegible word,
may be ‘Scribe’"/>
On <String>
element.
<LayoutTag LABEL="Noise"
DESCRIPTION="scan noise"/>
<LayoutTag LABEL="Unknown"
DESCRIPTION="text?"/>
The ALTO format captures the layout and the full text of a page. One intention of OCRing is full text retrieval. But full text retrieval may benefit from marking additional labelling elements such as page headers (running titles), page numbers, signature marks, etc. These are different type of structural elements than intellectual entities which record the intellectual structure of a document (The intellectual structure is not recorded in ALTO, but in a container format such as METS.)
If these elements are labelled, a full text search could either be restricted only to these elements, or the elements could be excluded from a search. A running title that appears on every page could e.g. manipulate the ranking of a resource or delivery invalid hits. Other labelled information such as the page number could be used for quality assurance purpose, document navigation, etc.
This structural tagging can be implemented in a mass digitisation workflow to extract structural or functional features (TOC entries, headings, headers, footnotes, article headings, etc.) or as an editing and correction facility for the improvement of already digitised books. Apart from METS/ALTO, enhanced PDFs and ebooks can also be generated if some structural information is available.
ALTO tag element: <StructureTag>
Tag attributes:
ID
LABEL
TYPE
(optional)DESCRIPTION
(optional)XmlData
(optional)URI
(optional)Structural tagging in ALTO can be considered as awkward (compared to the classic and recommanded METSALTO solution), as ALTO does not foresee the storing of any semantic information. But it could be useful:
<ComposedBlock>, <TextBlock>,
<TextLine>
or <String>
elements
<StructureTag TYPE="Functional"
LABEL="Cover"/>
<StructureTag TYPE="Functional"
LABEL="cover">
<XmlData>
<Scheme>epub</Scheme>
</XmlData>
</RoleTag>
Description of the vocabulary used.
<StructureTag TYPE="Functional"
LABEL="TitlePage"/>
<StructureTag TYPE="Functional"
LABEL="Catalog"/>
…
<StructureTag TYPE="Functional"
LABEL="CopyrightPage"/>
<StructureTag TYPE="Functional"
LABEL="Foreword"/>
<StructureTag TYPE="Functional"
LABEL="Notice"/>
…
<StructureTag TYPE="Functional"
LABEL="TOC"/>
Global tables of content
Internal tables of content (chapter beginnings, etc.)
<StructureTag TYPE="Structural"
LABEL="BodyMatter"/>
Beginning of the main content
<StructureTag TYPE="Functional"
LABEL="LOI"/>
<StructureTag TYPE="Functional"
LABEL="Appendix"/>
<StructureTag TYPE="Functional"
LABEL="LOT"/>
<StructureTag TYPE="Functional"
LABEL="Conclusion"/>
<StructureTag TYPE="Functional"
LABEL="Glossary"/>
<StructureTag TYPE="Functional"
LABEL="Bibliography"/>
…
<StructureTag TYPE="Functional"
LABEL="Index"/>
<StructureTag TYPE="Functional"
LABEL="RunningTitle"/>
<StructureTag TYPE="Structural"
LABEL="Part"
DESCRIPTION="I"/>
<StructureTag TYPE="Structural"
LABEL="Chapter"
DESCRIPTION="I"/>
<StructureTag TYPE="Structural"
LABEL="FullTitle"/>
<StructureTag TYPE="Structural"
LABEL="Title1"/>
<StructureTag TYPE="Structural"
LABEL="FullTitle"/>
<StructureTag TYPE="Structural"
LABEL="Title1"
DESCRIPTION="I"/>
<StructureTag TYPE="Structural"
LABEL="Title2"
DESCRIPTION="I"/>
<StructureTag TYPE="Structural"
LABEL="Title2"
DESCRIPTION="II"/>
<StructureTag TYPE="Functional"
TYPE="FootnoteReference"
DESCRIPTION="1"/>
<StructureTag TYPE="Functional"
LABEL="Footnote"
DESCRIPTION="1"/>
<StructureTag TYPE="Reference"
TYPE="ReferenceToFootnote"
DESCRIPTION="1"/>
<StructureTag TYPE="Functional"
LABEL="Marginalia"/>
<StructureTag TYPE="Functional"
LABEL="FigureCaption"/>
<StructureTag TYPE="Functional"
LABEL="FigureReference"
DESCRIPTION="9"/>
<StructureTag TYPE="Reference"
LABEL="ReferenceToFigure"
DESCRIPTION="9"/>
<StructureTag TYPE="Functional"
LABEL="TableCaption"/>
<StructureTag TYPE="Functional"
LABEL="TableReference"
DESCRIPTION="1"/>
<StructureTag TYPE="Reference"
LABEL="ReferenceToTable"
DESCRIPTION="1"/>
<StructureTag TYPE="Functional"
LABEL="PageNumber"
DESCRIPTION="937"/>
<StructureTag TYPE="Reference"
LABEL="ReferenceToPage"
DESCRIPTION="8"/>
<StructureTag TYPE="Functional"
LABEL="UL"/>
<StructureTag TYPE="Functional"
LABEL="OL"/>
<StructureTag TYPE="Functional"
LABEL="Masterhead"/>
<StructureTag TYPE="Functional"
LABEL="FullTitle"
DESCRIPTION="THE WINCHESTER NEWS"/>
<StructureTag TYPE="Functional"
LABEL="Date"
DESCRIPTION="October 31, 1910"/>
<StructureTag TYPE="Structural"
LABEL="SectionHeading"
DESCRIPTION="SOCIETY"/>
<StructureTag TYPE="Structural"
LABEL="SectionHeading"
DESCRIPTION="Advertisements"
/>
<StructureTag TYPE="Structural"
LABEL="Heading"
DESCRIPTION="Advertisement"/>
<StructureTag TYPE="Structural"
LABEL="SectionHeading"
DESCRIPTION="SmallAdds"/>
<StructureTag TYPE="Structural"
LABEL="ArticleTitle"/>
<StructureTag TYPE="Structural"
LABEL="ArticleSubTitle"/>
or
<StructureTag TYPE="Structural"
LABEL="Heading"/>
<StructureTag TYPE="Structural"
LABEL="SubHeading"/>
or
<StructureTag TYPE="Structural"
LABEL="ArticleTitle1"/>
<StructureTag TYPE="Structural"
LABEL="ArticleTitle2"/>
<StructureTag TYPE="Functional"
LABEL="Place"
DESCRIPTION="CLAY CITY, Ky"/>
<StructureTag TYPE="Functional"
LABEL="Date"
DESCRIPTION="Oct. 31."/>
<StructureTag TYPE="Functional"
LABEL="ArticleAuthor"
DESCRIPTION="Léon Lafage"/>
<RoleTag>
(see section 5.).
<StructureTag TYPE="Functional"
LABEL="Illustration"
DESCRIPTION="John W. Langley">
<XmlData>
<-- bibliographic metadatas of the photography -->
</XmlData>
<StructureTag/>
Named-entity recognition (NER) is a subtask of information extraction that seeks to locate and classify atomic textual elements into predefined categories (names of persons, organizations, locations, expressions of times, quantities, etc).
Representing the result of a NER processing in ALTO has several pros:
ALTO tag element: <NamedEntityTag>
Tag attributes:
ID
LABEL
TYPE
(optional)DESCRIPTION
(optional)URI
(optional)<ComposedBlock>, <TextBlock>,
<TextLine>
or <String>
elements
<!—Simple tagging -->
<NamedEntityTag LABEL="Location"
DESCRIPTION="Lexington"/>
<!— URI Authority -->
<NamedEntityTag LABEL ="Location"
DESCRIPTION="Lexington"
URI="http://www.geonames.org/4941935"/>
<! — Extra attributes -->
<NamedEntityTag LABEL ="Location"
DESCRIPTION="Kentucky">
<XmlData>
<NEC>1.0</NEC> <!-- NE confidence value -->
<Variants>Kenekuke Kentaki Kentákii
Hahoodzo Kèntòki Kentórk Kentuki
Kentukia Kentukio Kentukis Kentukki
Kentukki Kéntukki Shitati Khén-thap-kî
?e?t??? ??????? ????
</Variants>
</XmlData>
</NamedEntityTag>
<!— Multiple Authorities -->
<NamedEntityTag ID="15a"
LABEL="Location"
DESCRIPTION="Louisville"
URI="http://www.geonames.org/4299276"/>
<NamedEntityTag ID="15b"
LABEL="Location"
DESCRIPTION="Louisville"
URI="mygeonames:louisville"/>
…
<String CONTENT="Louisville" WC="1.0"
TAGREFS="15a 15b">
<!—Simple tagging -->
<NamedEntityTag LABEL="Person"
DESCRIPTION="Dr. Reynolds"/>
<!— URI Authority -->
<NamedEntityTag LABEL="Person"
DESCRIPTION="Mr. Norris"
URI="http://catalogue.bnf.fr/servlet/autorite?ID=11916925"/>
<!-- embedded authority description -->
<NamedEntityTag LABEL="Person"
DESCRIPTION="James M Bigstaff"><XmlData>
<MADS xmlns="http://www.loc.gov/mads/">
<authority>
<name type="personal" authority="naf">
<namePart>Bigstaff, James M</namePart>
<namePart type="date">1835-1882</namePart>
</name>
</authority>
</MADS></XmlData></NamedEntityTag>
<!—Simple tagging -->
<NamedEntityTag ID="O20"
LABEL="Organization"
DESCRIPTION="Central Kentucky Tobacco
Warehouse Company"/>
<!— Nested tags -->
<NamedEntityTag ID="L11"
LABEL="Location"
DESCRIPTION="Kentucky"/>
ALTO content:
…
<String CONTENT="Central"
TAGREFS="O20"/>
</TextLine>
<TextLine>
<String CONTENT="Kentucky"
TAGREFS="O20 L11"/>
<String CONTENT="Tobacco"
TAGREFS="O20">
<String CONTENT="Warehouse"
TAGREFS="O20"/>
<String CONTENT="Company"
TAGREFS="O20"/>
…
Role tagging allows to describe the people involved in the content creation: author, editor, etc. This can be done with an open vocabulary or a dedicated vocabulary, like the MARC Code List for Relators.
ALTO tag element: <RoleTag>
Tag attributes:
ID
LABEL
TYPE
(optional)DESCRIPTION
(optional)URI
(optional)A specific name tagged with a Role tag can also be identify with a NE tag.
<ComposedBlock>, <TextBlock>,
<TextLine>
or <String>
elements
<RoleTag
LABEL="Author"
DESCRIPTION="book’s author"/>
<RoleTag
LABEL="Illustrator"
DESCRIPTION="drawings creator"/>
<RoleTag
LABEL="Publisher"/>
<-- Description of the vocabulary used ->
<RoleTag LABEL="ill"
DESCRIPTION="drawings creator">
<XmlData>
<Scheme>marc:relators</Scheme>
</XmlData></RoleTag>
<RoleTag LABEL="aui"
DESCRIPTION="author of the introduction">
<XmlData>
<Scheme>marc:relators</Scheme>
</XmlData></RoleTag>
Any other kind of tags can be described.
ALTO tag element: <OtherTag>
Tag attributes:
ID
LABEL
TYPE
(optional)DESCRIPTION
(optional)URI
(optional)