ALTO tag use cases

Version 16-01-2014

ALTO Board

Introduction

The ALTO version 2.1 introduces a new mechanism: tag (or annotation). The aim of this mechanism is to cover several change requests concerning the ALTO format:

Layout labelling
ALTO format should record various region types (tables, graphics, maths formulas, music scores, etc.)
Logical Labelling Of Structural Elements
Structural tagging of documents offers a large number of benefits. Full text search can be done in a much more focused way, reprinting of digitised books could be made easier, etc.
Named Entity markup
Post-OCR processing on page images often includes named-entity recognition attempts.

XML schema

This mechanism has been introduced in ALTO version 2.1 schema. In XML terms, a reference to a new element group "Tags/Tag" allows to add additional information to the contents which are referring to these tag elements:

		

<Tags>
<NamedEntityTag ID="NE15" LABEL="Location" DESCRIPTION="Lexington"/>
…
</Tags>
<Layout>
…
<String CONTENT="Lexington" WC="1.0" TAGREFS="NE15" HPOS… VPOS…>
…
</Layout>
		
	

Good pratices, limitations

Use cases

For readibility purpose, IDs have been omitted in the following examples.

Layout tagging

Before the actual character recognition starts, any OCR software analyzisis the page structure to identify regions. Modern OCR software supports various region types than ALTO could record: tables, graphics, music scores, etc. Some of these regions can be categorized for the OCR software to use a specific algorithm or process when recognizing their characteristics; they can also be identify with a manual labelling.

XML schema

ALTO tag element: <LayoutTag>

Tag attributes:

ID
Tag id
LABEL
Region type
TYPE (optional)
Suppertype or category of the region type
DESCRIPTION (optional)
Description of the region, description of the region content, or the content itself
XmlData (optional)
Any XML encoded metadata helping to describe the tag, or the vocabulary used to describe the tag
URI (optional)
Any authority URI relevant for the tag

Good pratices, limitations

The intent of layout tagging is to identify types of content. Sometimes, specific types of content can also have a functional value (sidebars, separators, etc.) and the question of using a layout tag or a structural tag may be raised.

Text tags, to be applied on <ComposedBlock>, <TextBlock>, <TextLine> or <String> elements.

Frames, boxes, sidebars

<LayoutTag LABEL="Sidebar" 
DESCRIPTION="Textual box"/>
Tables

<LayoutTag LABEL="Table" 
DESCRIPTION="Table of numbers "/>
Mathematical formulas

<LayoutTag TYPE="Formula" 
LABEL="MathFormula"
DESCRIPTION="4 maths formulas"/>

On <TextLine> elements

On <String> elements

Chemical formulas (text)

<LayoutTag TYPE="Formula"
LABEL="ChemFormula"
DESCRIPTION="Ka formula"/>
Physics formulas

<LayoutTag TYPE="Formula"
LABEL="PhyFormula"
DESCRIPTION="e and v formulas"/>
Text content under a stamp

<LayoutTag 
LABEL="TextStamped"
DESCRIPTION="PAR"/>
See also “Stamps ” in the Graphical Element section
Handwriting contents, annotations, initials, etc.

<LayoutTag TYPE="Typesetting"
LABEL="Handwriting"
DESCRIPTION="43g8"/>

<LayoutTag TYPE="Typesetting"
LABEL="Handwriting"
DESCRIPTION="N Roret"/>
Handwriting could also be described with a <GraphicalElement>
Manuscripts

<LayoutTag TYPE="Typesetting"
LABEL="Manuscript"
DESCRIPTION="Old French manuscript"/>
Script fonts

<LayoutTag TYPE="Typesetting"
LABEL="ScriptFonts"
DESCRIPTION="Script font content"/>
			
Non-latin fonts

<LayoutTag TYPE="Typesetting"
LABEL="NonLatinFont"
DESCRIPTION="Chinese ideograms"/>""
Masterhead

<LayoutTag LABEL="Masterhead"
DESCRIPTION="L’AURORE – avril 1942"/>
Masterhead could also be described with a structural tag (see section 3.).
Advertisements, announcements, public relation pages

<LayoutTag LABEL="Advertisements"
DESCRIPTION="AVIS"/>
Small ads, obituaries, weather forecast, train tables, etc.

<LayoutTag LABEL="SmallAds"
DESCRIPTION="Petites annonces"/>

Graphical tags, to be applied on <ComposedBlock> or <Illustration> elements

Graphics, charts or graphs, linedrawings, maps, photographs, engravings, etc.

<LayoutTag LABEL="Map"/>

<LayoutTag LABEL="Engraving"/>

<LayoutTag LABEL="Graphic"/>
<LayoutTag LABEL="Chart"/>
<LayoutTag LABEL="Linedrawing"/>
<LayoutTag LABEL="Photo"/>
Chemical formulas (graphic)

<LayoutTag TYPE="Formula"
LABEL="ChemFormula"
DESCRIPTION="Aspartame formula"/>
Musical notations

<LayoutTag LABEL="MusicalScore"
DESCRIPTION="Maurice Ravel"/>

Other graphical tags to be applied on <ComposedBlock>, <GraphicalElement> elements

Ornaments, tail pieces
no need
Horizontal separator lines between paragraphs

<LayoutTag LABEL="TransitionSep"/>
Horizontal separator lines between text block and footnotes

<LayoutTag LABEL="FootnoteSep"/>
Stamps

<LayoutTag LABEL="Stamp"
DESCRIPTION="Dépôt légal Vosges 1891"/>
Dropped initials

<LayoutTag LABEL="DropCap"
DESCRIPTION="A"/>

Misc, to be applied on any elements

Illegible contents

<LayoutTag LABEL="Illegible"
DESCRIPTION="scan quality problem"/>

On <TextBlock> or <TextLine> elements.


<LayoutTag LABEL="Illegible"
DESCRIPTION="illegible word, 
may be ‘Scribe’"/>

On <String> element.

Noise

<LayoutTag LABEL="Noise"
DESCRIPTION="scan noise"/>
Noise regions: no real data, only artifacts on the document or scanner noise.
Unknown content

<LayoutTag LABEL="Unknown"
DESCRIPTION="text?"/>
To be used if the region type cannot be ascertained.

Structural tagging

The ALTO format captures the layout and the full text of a page. One intention of OCRing is full text retrieval. But full text retrieval may benefit from marking additional labelling elements such as page headers (running titles), page numbers, signature marks, etc. These are different type of structural elements than intellectual entities which record the intellectual structure of a document (The intellectual structure is not recorded in ALTO, but in a container format such as METS.)

If these elements are labelled, a full text search could either be restricted only to these elements, or the elements could be excluded from a search. A running title that appears on every page could e.g. manipulate the ranking of a resource or delivery invalid hits. Other labelled information such as the page number could be used for quality assurance purpose, document navigation, etc.

This structural tagging can be implemented in a mass digitisation workflow to extract structural or functional features (TOC entries, headings, headers, footnotes, article headings, etc.) or as an editing and correction facility for the improvement of already digitised books. Apart from METS/ALTO, enhanced PDFs and ebooks can also be generated if some structural information is available.

XML schema

ALTO tag element: <StructureTag>

Tag attributes:

ID
Tag id
LABEL
Structural tag
TYPE (optional)
Suppertype or category of the tag
DESCRIPTION (optional)
Description of the tag, description of the content, or the content itself
XmlData (optional)
Any XML encoded metadata helping to describe the tag or the vocabulary used to describe the tag.
URI (optional)
Any authority URI relevant for the tag element

Good pratices, limitations

Structural tagging in ALTO can be considered as awkward (compared to the classic and recommanded METSALTO solution), as ALTO does not foresee the storing of any semantic information. But it could be useful:

Text tags, to be applied on <ComposedBlock>, <TextBlock>, <TextLine> or <String> elements

Covers

<StructureTag TYPE="Functional"
LABEL="Cover"/>



<StructureTag TYPE="Functional"
LABEL="cover">
<XmlData>
<Scheme>epub</Scheme>
</XmlData>
</RoleTag>

Description of the vocabulary used.

Title pages

<StructureTag TYPE="Functional"
LABEL="TitlePage"/>
Frontmatter (foreword, prologue, notice to the reader, warning, advertisements, copyright page, publisher catalogs, etc.)

<StructureTag TYPE="Functional"
LABEL="Catalog"/>

…
<StructureTag TYPE="Functional"
LABEL="CopyrightPage"/>
<StructureTag TYPE="Functional"
LABEL="Foreword"/>
<StructureTag TYPE="Functional"
LABEL="Notice"/>
…
Tables of content

<StructureTag TYPE="Functional"
LABEL="TOC"/>

Global tables of content


Internal tables of content (chapter beginnings, etc.)

Body matter

<StructureTag TYPE="Structural"
LABEL="BodyMatter"/>

Beginning of the main content

Backmatter (afterword, appendix, illustrations list, tables list, conclusion, glossary, bilbiography, colophon, etc.)

<StructureTag TYPE="Functional"
LABEL="LOI"/>

<StructureTag TYPE="Functional"
LABEL="Appendix"/>
<StructureTag TYPE="Functional"
LABEL="LOT"/>
<StructureTag TYPE="Functional"
LABEL="Conclusion"/>
<StructureTag TYPE="Functional"
LABEL="Glossary"/>
<StructureTag TYPE="Functional"
LABEL="Bibliography"/>
…
Index

<StructureTag TYPE="Functional"
LABEL="Index"/>
Running titles

<StructureTag TYPE="Functional"
LABEL="RunningTitle"/>
Volumes, parts, chapters

<StructureTag TYPE="Structural"
LABEL="Part"
DESCRIPTION="I"/>

<StructureTag TYPE="Structural"
LABEL="Chapter"
DESCRIPTION="I"/>
Titles, subtitles

<StructureTag TYPE="Structural"
LABEL="FullTitle"/>

<StructureTag TYPE="Structural"
LABEL="Title1"/>

<StructureTag TYPE="Structural"
LABEL="FullTitle"/> 

<StructureTag TYPE="Structural"
LABEL="Title1"
DESCRIPTION="I"/>

<StructureTag TYPE="Structural"
LABEL="Title2"
DESCRIPTION="I"/>

<StructureTag TYPE="Structural"
LABEL="Title2"
DESCRIPTION="II"/>
Footnote references

<StructureTag TYPE="Functional"
TYPE="FootnoteReference"
DESCRIPTION="1"/>
Footnotes

<StructureTag TYPE="Functional"
LABEL="Footnote"
DESCRIPTION="1"/>
References to footnote

<StructureTag TYPE="Reference"
TYPE="ReferenceToFootnote"
DESCRIPTION="1"/>
Marginalias

<StructureTag TYPE="Functional"
LABEL="Marginalia"/>
Figure captions

<StructureTag TYPE="Functional"
LABEL="FigureCaption"/>
Figure references

<StructureTag TYPE="Functional"
LABEL="FigureReference"
DESCRIPTION="9"/>
References to figure

<StructureTag TYPE="Reference"
LABEL="ReferenceToFigure"
DESCRIPTION="9"/>
Table captions

<StructureTag TYPE="Functional"
LABEL="TableCaption"/>
Table references

<StructureTag TYPE="Functional"
LABEL="TableReference"
DESCRIPTION="1"/>
References to table

<StructureTag TYPE="Reference"
LABEL="ReferenceToTable"
DESCRIPTION="1"/>
Page numbers

<StructureTag TYPE="Functional"
LABEL="PageNumber"
DESCRIPTION="937"/>
Reference to page

<StructureTag TYPE="Reference"
LABEL="ReferenceToPage"
DESCRIPTION="8"/>
Bullets lists, numbered lists

<StructureTag TYPE="Functional"
LABEL="UL"/>

<StructureTag TYPE="Functional"
LABEL="OL"/>
Masterhead, imprint

<StructureTag TYPE="Functional"
LABEL="Masterhead"/>
Newspaper title

<StructureTag TYPE="Functional"
LABEL="FullTitle"
DESCRIPTION="THE WINCHESTER NEWS"/>
Issue number, date

<StructureTag TYPE="Functional"
LABEL="Date"
DESCRIPTION="October 31, 1910"/>
Section headings, rubrics, column titles, etc.

<StructureTag TYPE="Structural"
LABEL="SectionHeading"
DESCRIPTION="SOCIETY"/>
Advertisements, announcements, public relation pages

<StructureTag TYPE="Structural"
LABEL="SectionHeading"
DESCRIPTION="Advertisements"
/>
Individual content item of advertisements, announcements, public relation pages

<StructureTag TYPE="Structural"
LABEL="Heading"
DESCRIPTION="Advertisement"/>
Small ads, obituaries, weather forecast, train tables, etc.

<StructureTag TYPE="Structural"
LABEL="SectionHeading"
DESCRIPTION="SmallAdds"/>
Article titles, subtitles

<StructureTag TYPE="Structural"
LABEL="ArticleTitle"/>
<StructureTag TYPE="Structural"
LABEL="ArticleSubTitle"/>

or


<StructureTag TYPE="Structural"
LABEL="Heading"/>
<StructureTag TYPE="Structural"
LABEL="SubHeading"/>

or


<StructureTag TYPE="Structural"
LABEL="ArticleTitle1"/>
<StructureTag TYPE="Structural"
LABEL="ArticleTitle2"/>
Location, spatial information, place name

<StructureTag TYPE="Functional"
LABEL="Place"
DESCRIPTION="CLAY CITY, Ky"/>
Coverage note about a content
Date, dateline

<StructureTag TYPE="Functional"
LABEL="Date"
DESCRIPTION="Oct. 31."/>
Coverage note about a content
Article authors, copyright note, etc.

<StructureTag TYPE="Functional"
LABEL="ArticleAuthor"
DESCRIPTION="Léon Lafage"/>
Authors could also be tagged with a <RoleTag> (see section 5.).
Illustration

<StructureTag TYPE="Functional"
LABEL="Illustration"
DESCRIPTION="John W. Langley">
<XmlData>
<-- bibliographic metadatas of the photography -->
</XmlData>
<StructureTag/>
If the intellectual value of an illustration is high, it can be captured with a structural tag.

Named Entities tagging

Named-entity recognition (NER) is a subtask of information extraction that seeks to locate and classify atomic textual elements into predefined categories (names of persons, organizations, locations, expressions of times, quantities, etc).

Representing the result of a NER processing in ALTO has several pros:

Simplicity
no need to use another format for NE storage, no need to make links between this format and the ALTO block coordinates; no need to use a METS-ALTO approach.
Autonomy
contents and NE are stored in the same file.

XML schema

ALTO tag element: <NamedEntityTag>

Tag attributes:

ID
Tag id
LABEL
Named entity category (Person, Organization, Location, etc.)
TYPE (optional)
Suppertype or category of the named entity type
DESCRIPTION (optional)
Named entity textual content
XmlData (optional)
Any XML encoded metadata helping to describe the tag or the vocabulary used to describe the tag
URI (optional)
Any authority URI relevant for the tag element: NE authority repositories, gazetteers, etc

Good pratices, limitations

Text tags, to be applied on <ComposedBlock>, <TextBlock>, <TextLine> or <String> elements

Locations

<!—Simple tagging -->
<NamedEntityTag LABEL="Location"
DESCRIPTION="Lexington"/>

<!— URI Authority -->
<NamedEntityTag LABEL ="Location"
DESCRIPTION="Lexington"
URI="http://www.geonames.org/4941935"/>

<! — Extra attributes -->
<NamedEntityTag LABEL ="Location"
DESCRIPTION="Kentucky">
<XmlData>
<NEC>1.0</NEC> <!-- NE confidence value -->
<Variants>Kenekuke Kentaki Kentákii
Hahoodzo Kèntòki Kentórk Kentuki
Kentukia Kentukio Kentukis Kentukki
Kentukki Kéntukki Shitati Khén-thap-kî
?e?t??? ??????? ????
</Variants>
</XmlData>
</NamedEntityTag>

<!— Multiple Authorities -->
<NamedEntityTag ID="15a"
LABEL="Location"
DESCRIPTION="Louisville"
URI="http://www.geonames.org/4299276"/>
<NamedEntityTag ID="15b"
LABEL="Location"
DESCRIPTION="Louisville"
URI="mygeonames:louisville"/>
…
<String CONTENT="Louisville" WC="1.0"
TAGREFS="15a 15b">
Persons

<!—Simple tagging -->
<NamedEntityTag LABEL="Person"
DESCRIPTION="Dr. Reynolds"/>

<!— URI Authority -->
<NamedEntityTag LABEL="Person"
DESCRIPTION="Mr. Norris"
URI="http://catalogue.bnf.fr/servlet/autorite?ID=11916925"/>

<!-- embedded authority description -->
<NamedEntityTag LABEL="Person"
DESCRIPTION="James M Bigstaff"><XmlData>
<MADS xmlns="http://www.loc.gov/mads/">
<authority>
<name type="personal" authority="naf">
<namePart>Bigstaff, James M</namePart>
<namePart type="date">1835-1882</namePart>
</name>
</authority>
</MADS></XmlData></NamedEntityTag>
Organizations

<!—Simple tagging -->
<NamedEntityTag ID="O20"
LABEL="Organization"
DESCRIPTION="Central Kentucky Tobacco
Warehouse Company"/>

<!— Nested tags -->
<NamedEntityTag ID="L11"
LABEL="Location"
DESCRIPTION="Kentucky"/>

ALTO content:
…
<String CONTENT="Central" 
TAGREFS="O20"/>
</TextLine>
<TextLine>
<String CONTENT="Kentucky" 
TAGREFS="O20 L11"/>
<String CONTENT="Tobacco" 
TAGREFS="O20">
<String CONTENT="Warehouse" 
TAGREFS="O20"/>
<String CONTENT="Company" 
TAGREFS="O20"/>
…
Nested tags are implemented with multiple tag refs.

Role tagging

Role tagging allows to describe the people involved in the content creation: author, editor, etc. This can be done with an open vocabulary or a dedicated vocabulary, like the MARC Code List for Relators.

XML schema

ALTO tag element: <RoleTag>

Tag attributes:

ID
Tag id
LABEL
tag type or category
TYPE (optional)
Suppertype or category of the type
DESCRIPTION (optional)
Description of the tag, description of the tag content, or the content itself
XmlData (optional)
Any XML encoded metadata helping to describe the tag or the vocabulary used to describe the tag.
URI (optional)
Any authority URI relevant for the tag element.

Good pratices, limitations

A specific name tagged with a Role tag can also be identify with a NE tag.

Text tags, to be applied on <ComposedBlock>, <TextBlock>, <TextLine> or <String> elements

Title pages

<RoleTag
LABEL="Author"
DESCRIPTION="book’s author"/>

<RoleTag
LABEL="Illustrator"
DESCRIPTION="drawings creator"/>

<RoleTag
LABEL="Publisher"/>

<-- Description of the vocabulary used ->
<RoleTag LABEL="ill"
DESCRIPTION="drawings creator">
<XmlData>
<Scheme>marc:relators</Scheme>
</XmlData></RoleTag> 
Foreword, etc.

<RoleTag LABEL="aui"
DESCRIPTION="author of the introduction">
<XmlData>
<Scheme>marc:relators</Scheme>
</XmlData></RoleTag> 

Other tags

Any other kind of tags can be described.

XML schema

ALTO tag element: <OtherTag>

Tag attributes:

ID
Tag id
LABEL
tag type or category
TYPE (optional)
Suppertype or category of the type
DESCRIPTION (optional)
Description of the tag, description of the tag content, or the content itself
XmlData (optional)
Any XML encoded metadata helping to describe the tag or the vocabulary used to describe the tag.
URI (optional)
Any authority URI relevant for the tag element.

Reference documents