TEI Encoding Guidelines

About This Document

This document describes the markup practices for keyboarding/encoding vendors to follow when producing TEI documents for the University of Virginia Library.

Contents


I. Introduction

The DTD

The University of Virginia Library (U.Va. Library) uses a customization of TEI version P4.

Note: This document only describes encoding practices specific to U.Va. Library. It is not a tutorial on TEI encoding.

Obtaining the DTD: We recommend using URLs for validating XML documents produced for us, but if you prefer to use local copies of our DTD-related files rather than URLs, download copies of the files as follows:

Invoking the DTD: The main DTD driver file is tei2.dtd. The required U.Va.-specific modification files are
uva-dl-tei.ent and uva-dl-tei.dtd. Together they constitute a customization of TEI P4. To invoke the DTD, refer to the main DTD driver file in the DOCTYPE declaration, and include ENTITY declarations for the modification files in the internal subset:

<!DOCTYPE TEI.2 SYSTEM "http://text.lib.virginia.edu/dtd/tei/tei-p4/tei2.dtd" [
<!ENTITY % TEI.extensions.ent SYSTEM "http://text.lib.virginia.edu/dtd/tei/uva-dl-tei/uva-dl-tei.ent">
<!ENTITY % TEI.extensions.dtd SYSTEM "http://text.lib.virginia.edu/dtd/tei/uva-dl-tei/uva-dl-tei.dtd">
]>

or, if you are using local copies of our DTD-related files:

<!DOCTYPE TEI.2 SYSTEM "tei-p4/tei2.dtd" [
<!ENTITY % TEI.extensions.ent SYSTEM "uva-dl-tei/uva-dl-tei.ent">
<!ENTITY % TEI.extensions.dtd SYSTEM "uva-dl-tei/uva-dl-tei.dtd">
]>

As the preceding example declaration shows, the U.Va. Library modification files and the unmodified TEI P4 files can reside in different directories.

uva-dl-tei vs. TEI Lite: Encoders familiar with TEI Lite (or with unmodified TEI P4) will notice that our DTD (uva-dl-tei) is generally more rigorous or strict than TEI Lite. For example:

  • The content models for essential structural elements (text, group, front, body, and back) are significantly simplified.
  • All content must be encoded within a <div1> or lower-level division. No content is allowed directly within <front>, <body>, or <back>. The <div> and <div0> elements are not used.
  • uva-dl-tei disallows most of the logical elements normally used to encode the semantic meaning of highlighted words and phrases (such as emph, distinct, soCalled, term, mentioned). Instead, changes in typeface must be encoded with the appropriate physical element (using convenience elements <i> for italic, <b> for bold, etc.). The exception to this practice is foreign words or phrases, which should be encoded using the usual <foreign> element.
  • Many elements in uva-dl-tei have required attributes, where in TEI Lite the same attributes are optional (for example, the type attribute on <div#> elements).
  • Many attributes in uva-dl-tei have enumerated vocabularies of allowed values, where in TEI Lite the same attributes can take any value (again, the type attribute on <div#> elements is a good example).
  • In TEI Lite, the global rend attribute can contain any value. In uva-dl-tei, the global rend attribute is limited to indicating display (“block” or “inline”), alignment (“center”, “left”, or “right”), and indentation (“indent”, “indent2” etc., or “hang”) — not typographic changes. For example, where TEI Lite allows <p rend=”italic”>…</p> (or rend=”italics” or rend=”ital” or rend=”i”), uva-dl-tei requires <p><i>…</i></p>.
  • For elements that require additional values for the rend attribute, such as <hi>, the possible values are still limited to an enumerated list. If none of the allowed values is applicable, use the value “other” and use the other attribute to supply the needed value, as in <hi rend=”other” other=”…”>.

uva-dl-tei also adds several convenience elements not available in TEI Lite:

  • <i>, <b>, etc. for italics, bold, and other typographic changes
  • <cb> for column breaks, and <cols> for marking abrupt changes in columnar layout
  • <quotedLetter> for block quotations requiring an <opener> and/or <closer>
  • <ps> for postscripts in letters

General Guidelines

XML: All documents should be encoded in the XML expression of TEI (not SGML).

The transcription: The electronic text should contain an exact character-for-character transcription of the text of the print source.

Including all content: With very few exceptions (see Exceptions immediately below), all content from the print source must be included in the electronic text. All textual data must be included in the transcription, and all non-textual data must be included in the markup as figures.

Exceptions: The only exceptions to the “include all content” rule are:

  • Running page headers: Exclude the running headers that often appear at the top of each page in a printed book. (These headers are typically very repetitive and only contain content already available elsewhere in the electronic transcription, such as the title of the book or the title of the current chapter.)

    Note: In rare cases, running page headers will contain unique content (such as a summary of the content of the current page). In such cases U.Va. Library will require that the running headers be included in the electronic text. See Page Breaks for information on encoding running headers.

  • Gaps: Gaps in the transcription are necessary in some cases. See Omitted Words or Phrases.

Providing notes: While transcribing and encoding even the most typical, straightforward materials, some problems or uncertainties are likely to arise. Any unusual circumstances encountered in the transcription or encoding process should be noted in a document that gets delivered to U.Va. Library along with the encoded XML files themselves. We prefer to receive one such “Notes” document for each electronic text (rather than one document for an entire encoding project). The document should include information on these kinds of situations:

  • oddities, errors, or unusual features in the print source, and how they have been handled in the electronic text
  • non-obvious markup decisions, such as decisions on how to encode ambiguous or problematic parts of the printed text

II. Major Structure

Essential Structure

Follow the essential structural markup common to most TEI documents:

<TEI.2>
    <teiHeader>
        . . . [metadata section supplied by U.Va. Library to the keyboarding vendor]
    </teiHeader>
    <text>
        <front>
            . . . [front matter: title page, acknowledgments, dedication, table of contents, preface, etc.]
        </front>
        <body>
            . . . [main body of text]
        </body>
        <back>
            . . . [back matter: endnotes, index, appendices, etc.]
        </back>
    </text>
</TEI.2>

TEI header: The <teiHeader> element will be supplied by U.Va. Library to the keyboarding vendor.

Composite texts: In rare cases, U.Va. Library will request that a particular document be marked as a composite text, in which the usual <body> element is replaced with the <group> element, which then contains multiple <text> elements, each with its own <front>, <body>, and <back>. (This can occur with anthologies or collected works, where each work has its own front and/or back matter.)

<TEI.2>
    <teiHeader>
        . . . [metadata section supplied by U.Va. Library to the keyboarding vendor]
    </teiHeader>
    <text>
        <front> . . . [front matter for the collection] </front>
        <group>
            <text>
                <front> . . . [front matter of first text] </front>
                <body> . . . [main body of first text] </body>
                <back> . . . [back matter of first text] </back>
            </text>
            <text>
                <front> . . . [front matter of second text] </front>
                <body> . . . [main body of second text] </body>
                <back> . . . [back matter of second text] </back>
            </text>
        </group>
        <back> . . . [back matter for the collection] </back>
    </text>
</TEI.2>

This kind of structure should be used only when specifically requested by U.Va. Library for a particular document.


Structural Divisions

Starting with <div1>: Although TEI Lite allows them, we do not use the <div> or <div0> elements. Instead, top-level structural divisions must be encoded as <div1>.

All content within a div: The <front>, <body>, and <back> elements must contain only <div1> elements. That is, all content must be enclosed by a <div1> or lower-level division, not placed directly within the <front>, <body>, or <back> elements. This holds true even for page breaks.

Determining division boundaries: When determining the start- and end-points of divisions and the nesting of lower-level divisions, refer to the printed table of contents for the work. Typically the table of contents is an accurate guide to the hierarchical structure of the work.

type attribute: Each <div#> tag must have a type attribute indicating the kind of division. If the division has no obvious type, the generic term “section” may be used; if “section” has already been used for a higher-level division, use “subsection”. Typical type values for divisions within <body> are:

  • act
  • article
  • book
  • chapter
  • editorial — for opinion pieces in newspapers
  • entry — for journal entries or encyclopedia/dictionary entries
  • essay
  • letter
  • part
  • plates — one or more full-page illustrations, often unnumbered or numbered independently of main pagination
  • poem
  • scene
  • section
  • speech — for a transcript of an oration, not for a piece of dialog in a dramatic work (for which use <sp>)
  • story
  • subsection
  • volume
  • work

Some type values are appropriate within <body> or <front>, depending on context:

  • abstract
  • prologue

Others are appropriate within <body> or <back>, depending on context:

  • epilogue
  • notes
  • summary

n attribute: If the division is numbered or otherwise labeled in the print source, record the number or label in the n attribute. Typically the number associated with a division is obvious from the division’s header. If the division does not have a number associated with it, do not include the n attribute.

[example of a straightforward div structure]

Division headings: A division will usually (though not always) have a heading that announces it. The heading should, of course, be marked with <head>.

It is fairly common for a division to have more than one heading. In such cases, use multiple <head> elements, and include the type attribute to distinguish the headings and identify their roles. Use the same type values used for the <titlePart> element, namely “main”, “sub”, “desc” (for descriptive), and “alt” (for alternative).

When a division has only one <head>, it is necessarily the main one, so the type attribute is unnecessary and should not be used.

[example]

Half-titles, fly-titles, and divisional titles: A common feature in many books is a separate page containing the title of the work (or the title of a section of the work). There are three main types of such features:

  • A page preceding the title page and bearing the title of the work, perhaps with a series title or volume number, is a “half-title” and should be marked as <div1 type=”half-title”> within <front>.
  • A page similar to a half-title page but occuring between the front matter and the body is a “fly-title” and should be marked as <div1 type=”fly-title”> as the last division within <front> (not as the first division of the <body>).
  • A page, or just an initial heading preceding other headings, similar to a half-title but occuring within the body of the work, to announce the beginning of a major section, is a “divisional title”. In contrast to half-title and fly-title pages within the front matter, a divisional title should not be marked with its own <div#>. Instead, the divisional title should be incorporated into the <div#> that it precedes, as a <head> element.

[example showing a fly-title and a divisional title]


Informal divisions: It is fairly common (especially in poems, but also in prose works) to see informal divisions, indicated by a string of asterisks or periods, or by a horizontal line. Normally, such features do not indicate the beginning of a new <div#>. Instead, mark them as <ornament/> elements.

For strings of asterisks, periods, etc., set the type attribute to “characters”, and include the characters as the content of the <ornament> element. For example:

<ornament type="characters">* * * * * * * *</ornament>

For horizontal lines, set type to “line” and leave the content of the element empty:

 <ornament type="line"/> 

The same approach should be used for printer’s ornaments.

Quotations of verse within prose: In cases where a brief section of verse is quoted within a predominantly prose text, the verse should be marked as a block quotation (<q>), not as a distinct <div#>.


Front and Back Matter

Use <div1> to mark the main sections of the front matter and back matter. The exception to this rule is the main title page of the work, for which the <titlePage> element is used. Use the type attribute on the <div1> element to indicate the type of division. Typical values for front matter include:

  • acknowledgements
  • advertisement
  • book-plate
  • castlist — for list of characters preceding a dramatic work
  • chronology — for biographical or historical timelines
  • contents — for tables of contents and for lists of illustrations, etc.
  • dedication
  • errata — for lists of printing errors; also called corrigenda
  • fly-title — like a half-title page, but occurs between the
    front matter and the body; treat as last page of front matter (not
    first page of body)
  • foreword
  • frontispiece — technically, an illustration facing the title page; may also be used for any full-page illustration in the front matter, or for an illustration facing the first page of a major division within the body
  • half-title — a page preceding the title page bearing the title of the work, perhaps with a series title or volume number
  • introduction
  • masthead — a block of matter in a newspaper or other periodical indicating title of publication, address, list of editors or other contributors, etc.
  • preface

Typical type values for back matter include:

  • advertisement
  • appendix
  • bibliography
  • bio — for brief biographic sketches of authors or other contributors
  • colophon — a section at or near the end of a book, containing printing information such as name of printer (as distinct from publisher), typefaces used, etc.
  • glossary
  • index

Images for Book Spine, Covers, and Edges

Normally, the page images delivered to the keyboarding vendor by UVa Library will include images of the spine, front cover, back cover, top edge, bottom edge, and front edge. Each of these images should be marked as a <pb/> element immediately following the opening <front> tag. For example:

<front>
<pb n="Spine"/>
<pb n="Front Cover"/>
<pb n="Back Cover"/>
<pb n="Top Edge"/>
<pb n="Bottom Edge"/>
<pb n="Front Edge"/>

Title Page

Within <front>, the title page is, of course, marked up using the <titlePage> element rather than <div1>.


Title types: When using the <titlePart> element to mark the parts of the title, include the type attribute, assigning one of these values:

  • main — main title of the work
  • sub — subtitle
  • desc — descriptive title
  • alt — alternative title
  • volume — volume information

Include verso: The content on the verso (reverse side) of the title page should be included within the <titlePage> element.

Volume information: The volume information should go in a <titlePart type=”volume”>. This is true even if the volume information is separated from the title by the byline or other elements (<titlePart> is allowed outside <docTitle>).

. . . </docTitle>
<byline>By <docAuthor>BOOKER T. WASHINGTON</docAuthor></byline>
<titlePart type="volume">VOLUME I</titlePart>

[example: complete title page with verso]

[another example]


III. Genres

Letters

When encoding letters, prefaces, and other such personal writings, use the appropriate elements for the opening and closing sections:

  • opener — groups together dateline, salutation, etc. at the beginning of a letter or other division
  • closer — groups together salutation, signature, etc. at the end of a letter or other division

Openers and closers, in turn, contain one or more of these elements:

  • dateline — contains a brief description of the place, date, time, etc., the letter, preface, etc. was written
  • date — contains a date in any format. Use the value attribute to provide the date in standard yyyy-mm-dd format. If part of the date is unknown, omit that part.
    <date value="1901-08-04">4 Aug., 1901</date>
    <date value="1870-03">March 1870</date>
    
  • name — contains a proper name; use the type attribute to indicate “person” or “place”.
  • salute — salutation at the beginning (e.g. “Dear Sir”) or end (e.g. “Yours sincerely”) of a letter
  • signed — signature at the end of a letter, preface, etc.
  • ps — postscripts

[example: letter with opener and closer, including a postscript]


Verse

Line groups: All verse — including poems without distinct stanzas, as well as verse quoted within a block of prose — should be encoded with the <lg> element. The type attribute is required; if the lines of verse have no obvious grouping (such as “stanza”), use type=”group”.

Indentation: If a line of verse is indented more than the surrounding lines, use <l rend=”indent”>…</l>.[example]

Line breaks: When encoding verse it is important to distinguish between logical lines of verse and the physical presentation of those lines on the printed page. In cases where a line of verse is too long to fit on the printed page, and for that reason is continued on a second line, use <l> to mark the logical line of verse and <lb/> to mark the physical line break.[example]


Drama

Use the standard TEI elements for encoding dramatic works:

  • act and scene divisions should be marked with <div#> elements
  • speeches should be marked with <sp> elements, with speakers marked as <speaker>
  • stage directions should be marked with <stage>
  • castlists should be marked as <div# type=”castlist”>

Newspapers

[Newspaper markup example]

Files and file naming: Each issue (each day) of the newspaper should be encoded as a single <TEI.2> document contained in a single XML file. For the Cavalier Daily project, each file should be named “CavDaily_yyyymmdd.xml”. For example, the issue for March 11, 1969 would be named:

CavDaily_19690311.xml

In cases where the vendor creates page images from microfilm, the page images should be named with the same base filename as the issue to which the page images pertain, followed by a two-digit page sequence number. For example:

CavDaily_19690311_01.tif
CavDaily_19690311_02.tif
CavDaily_19690311_03.tif
CavDaily_19690311_04.tif

ENTITY declaration: When declaring the DTD, include an ENTITY declaration named NEWSPAPER with value “INCLUDE”. This enables the newspaper-specific features of the DTD, which are disabled by default.

<!DOCTYPE TEI.2 SYSTEM "tei-p4/tei2.dtd" [
<!ENTITY % TEI.extensions.ent SYSTEM "uva-dl-tei/uva-dl-tei.ent">
<!ENTITY % TEI.extensions.dtd SYSTEM "uva-dl-tei/uva-dl-tei.dtd">
<!ENTITY % NEWSPAPER "INCLUDE">

TEI header: For the Cavalier Daily project, use the following TEI header template. This template can be inserted into the TEI file without any modifications or enhancements:

<teiHeader>
<fileDesc>
<titleStmt>
<title>The Cavalier Daily</title>
</titleStmt>
<publicationStmt>
<publisher>University of Virginia Library</publisher>
</publicationStmt>
<sourceDesc><bibl/></sourceDesc>
</fileDesc>
<profileDesc>
<langUsage>
<language id="eng">English</language>
</langUsage>
</profileDesc>
</teiHeader>

Page layout: The columnar layout of a newspaper page is typically very complex, with multiple levels of column breaks. The column breaks should not be recorded. Instead, all page breaks must be recorded, and the articles and other content on that page should be transcribed in the order in which they appear on the page, moving primarily from top to bottom and secondarily from left to right.

In other words, transcribe all the pieces along the top of the page first, moving left to right; then the pieces in the various mid-sections (there will usually be more than one level) of the page, moving left to right; and finally the pieces along the bottom of the page, moving left to right.

[Example of page layout sequence]

Associating related sections: Because newspaper articles are often broken up and printed in two (or more) sections on different pages, it is necessary to associate the first section of the article with the subsequent section(s). Use the id, next and prev attributes on <div#> to achieve this.

< -- start tag for first part of article -->
<div1 type="article" id="a1.2" next="a3.1">

< -- start tag for second part of article -->
<div1 type="article" id="a3.1" prev="a1.2">

When assigning identifiers for the id attribute, use this scheme:

  • Begin the ID with “a” (for “article”)
  • Provide the page number, followed by a dot
  • Assign a sequence number to the article indicating whether it is the first article to be encoded for that page, the second article, the
    third, etc.

For example, the first article (or partial article) on the first page should be assigned id="a1.1". On the third page, the fourth article (or partical article) to be encoded for that page (moving from left to right and from top to bottom, as instructed above) should be assigned id="a3.4". Please assign an ID to all <div1 type=”article”> elements, even if the article is not split into two non-contiguous sections. (Other division types do not require an ID.)

When an article is broken into two or more sections on different pages, each section of the article must be enclosed in its own <div1> element. This is necessary because other articles or partial articles will occur between the two sections.

The “jump” line — the phrase indicating where to look for the continuation of the article — should be encoded in a <ref> element, with a target attribute containing the ID of the div1 containing the article continuation.

<ab type="ref"><ref target="a3.1">(see Players, p. 3)</ref></ab>
</div1>
<!-- Intervening content... -->
<div1 type="article" id="a3.1" prev="a1.2">
<pb n="3"/>
<head type="main">Players Present Grievances Tonight</head>
<ab type="ref" rend="center"><ref target="a1.2">(continued from p.1)</ref></ab>

Photographs: When encoding photographs, illustrations, or other graphic elements using <figure>, the <figure> should be placed in its own <div1> if the photograph is not contained within a particular article. (See example markup, photo on page 1 with caption “Bill Gibson Encourages His Players…”.) If the photograph appears within the text of a particular article (usually contained within a single column of text), then the <figure> should be encoded directly within the <div#> in which it occurs. (See markup example, photo on page 3 with caption “Coach Bill Gibson”.)

The credit or byline accompanying a photograph or other graphic should be encoded with a <byline> element within <figure>, for example:

<figure>
<byline>Joe Smith</byline>
<head>THE ROTUNDA AT DAWN</head>
<p>Students converse on steps as sun rises.</p>
</figure>

Division types: When encoding newspapers, the most common kind of major structural division will be “article” (not “story”, which is intended for encoding prose fiction short stories).

<div1 type="article">

A brief section (typically only one or two paragraphs) enclosed in a box and placed at the end of a column of text is called a “filler” and should be marked as <div1 type=”filler”>. (See newspaper markup example for two examples of this.)

The main heading on the top of the first page, showing the name of the newspaper, is called a “nameplate” and should be marked as <div1 type=”nameplate”>. The details of publication should be marked as follows:

volume number (sometimes expressed as a year, as in “79th Year”) <num type=”volume” value=”…”>…</num>
place of publication <name type=”place”>…</name>
date of publication <date value=”yyyy-mm-dd”>…</date>
issue number <num type=”number” value=”…”>…</num>
<text>
<body>
<div1 type="nameplate">
<pb/>
<head>THE CAVALIER DAILY</head>
<ab><num type="volume" value="79">79th YEAR</num> 
<name type="place">UNIVERSITY OF VIRGINIA, CHARLOTTESVILLE</name>, 
<date value="1969-03-11">TUESDAY, MARCH 11, 1969</date> 
<num type="number" value="92">NUMBER 92</num></ab>
</div1>

The blocks of content appearing on the editorial page of the newspaper — which is almost always the second page of the newspaper — require special division types:

  • The listing of newspaper staff should be marked as <div1 type=”masthead”>.

    For the Cavalier Daily project, the masthead is often divided into two parts, although both appear on the same page (page 2). In these cases, both parts should be marked as div1 type="masthead", but they should also have an id attribute and use the next and prev attributes to indicate that they are related:

    <div1 type="masthead" id="m1" next="m2">
    ...
    </div1>
    .
    .
    .
    <div1 type="masthead" id="m2" prev="m1">
    ...
    </div1>
    
  • The section with the heading “Letters to the Editor” should be marked as <div 1 type=”op-ed”>, while each individual letter should be marked <div2 type=”letter”>.
    <div1 type="op-ed">
    <head>Letters To The Editor</head>
    <div2 type="letter">
    <opener>
    <salute>Dear Sir:</salute>
    </opener>
    ...
    </div2>
    <-- more <div2 type="letter"> divisions here... -->
    </div1>
    
  • Other pieces on the editorial page should be marked as <div1 type=”op-ed”> (“op-ed” is short for opinion/editorial).

For the Cavalier Daily project, most issues will have a “University Notices” section and a classified advertisements section. These sections are handled similarly. The “University Notices” typically appear first and should be marked as <div1 type=”univ-notices”>. Each category, such as “TODAY” or “MISCELLANEOUS” should be marked as <div2 type=”section”>, and then each individual notice should be marked with <p>. For example:

<div1 type="univ-notices">
<head><i>University Notices</i></head>
<div2 type="section">
<head>TODAY</head>
<p>FELLOWSHIP of Christian Ath- <lb/>
letes meeting at 7:30 p.m., Wesley <lb/>
Foundation.</p>

Similary, the classifieds section should be marked as <div1 type=”classifieds”>. Each category, such as “FOR SALE” or “WANTED”, should be marked as <div2 type=”section”>, and then each individual classified ad should be marked as <p>. For example:

<div1 type="classifieds">
<head>CLASSIFIEDS</head>
<div2 type="section">
<head>FOR RENT</head>
<p>Apartment for rent...</p>

If, but only if, none of the division types described above is appropriate to a particular block of content, use the generic type “section” (or, if “section” has already been used for a higher-level division, use “subsection”).

Special kinds of gaps: When encoding newspapers, some kinds of content should be excluded from the electronic transcription (due to copyright restrictions or other editorial reasons not pertaining to physical damage to the print source). Rather than using <gap desc=”…” reason=”editorial”/> for these gaps, use the special convenience elements provided by the DTD:

<ad/> for advertisements
<cartoon/> for cartoons appearing in a single frame or box
<comicStrip/> for cartoons appearing in multiple frames or boxes
<puzzle/> for crossword or other puzzles
<wireArticle/> for articles with a wire-service credit
<wirePhoto/> for photographs with a wire-service credit

Wire-service articles and photographs are identified by one of the following phrases in the byline or dateline:

  • AP
  • Associated Press
  • UPI
  • United Press International
  • UP
  • United Press

Such phrases are usually enclosed in parentheses. The following article dateline is typical:

WASHINGTON (UPI)

Wire-service articles are a special case. Instead of replacing the entire article with a <wireArticle/> element (in the spot where a <div1 type=”article”> element would go if it were not a wire-service article), we want to capture the headline, but not the article content. For example:

<div1 type="article" id="a1.4">
<head>Finch Warns Cut-Off <lb/>
Of Grants to Rioters</head>
<wireArticle/>
<p/>
</div1>

[Newspaper markup example]


Journal Articles

[No special encoding requirements for journal articles at this time.]


Encyclopedias and Dictionaries

Encyclopedias

Encyclopedia entries typically consist mainly of prose paragraphs and do not normally pose any special markup issues. Each encyclopedia entry is a <div#> containing one or more headings followed by paragraphs.

Dictionaries

Dictionary entries should be encoded using the TEI additional tagset for print dictionaries. For detailed information on the use of these elements, see
Chapter 12, Print Dictionaries in the TEI Guidelines.

To enable the dictionary-specific features of the DTD, declare an entity named DICTIONARY with value “INCLUDE”, like this:

<!DOCTYPE TEI.2 SYSTEM "tei-p4/tei2.dtd" [
<!ENTITY % TEI.extensions.ent SYSTEM "uva-dl-tei/uva-dl-tei.ent">
<!ENTITY % TEI.extensions.dtd SYSTEM "uva-dl-tei/uva-dl-tei.dtd">
<!ENTITY % DICTIONARY "INCLUDE">

In the simplest case, a dictionary entry has minimal grammatical information and only one definition:

After-night, n.   The time after it becomes night.

<entry>
<form><orth><b>After-night,</b></orth></form>
<gramGrp><pos><i>n.</i></pos></gramGrp>
<def>The time after it becomes night.</def>
</entry>

An entry may include alternative spellings for the main word, in which case the alternative spellings should be marked with <orth type=”alt”>:

Again, conj.   Agen; agin: By the time
that, untill: “I’ll have
        it ready
agin you come.”

<entry rend="hang">
<form><orth><b>Again,</b></orth></form>
<gramGrp><pos><i>conj.</i></pos></gramGrp>
<form><orth type="alt"><i>Agen; agin:</i></orth></form>
<def>By the time that, untill:</def>
<eg><q rend="inline">"I'll have <lb/>
it ready <i>agin</i> you come."</q></eg>
</entry>

The preceding example also includes a usage example, marked with <eg>. The <eg> element does not allow character data; instead, <eg> must contain <q> (for examples with no attributed source) or <cit> (for examples that include an attribution of the author or source text).

Note: Because U.Va. Library normally uses <q> only for
block quotations, when using <q> in a dictionary entry please
indicate <q rend=”inline”>, as shown above.

More complex dictionary entries may include more than one form of the same word — that is, multiple homographs (words identical in spelling but different in meaning or pronunciation), each marked with <hom>. Entries may also include more than one meaning for the same word, in which case the information (definitions, examples, etc.) for each meaning should be grouped as a <sense>. If the senses are labeled with numbers or letters in the print source, include the label in the n attribute:

Against, prep.   In resistance to; or defense from “They
        marched against the Spaniards.” (2.) Opposite. “Over
        against a point called Sandy Point.” Against, conj. “Keep
        ’em against I come.”

<entry rend="hang">
<hom>
<form><orth><b>Against,</b></orth></form>
<gramGrp><pos><i>prep.</i></pos></gramGrp>
<sense>
<def>In resistance to; or defense from</def>
<eg><q rend="inline">"They <lb/>
marched <i>against</i> the Spaniards."</q></eg>
</sense>
<sense n="2">
(2.) <def>Opposite.</def>
<eg><q rend="inline">"Over <lb/>
<i>against</i> a point called Sandy Point."</q></eg>
</sense>
</hom>
<hom>
<form><orth>Against,</orth></form>
<gramGrp><pos><i>conj.</i></pos></gramGrp>
<eg><q rend="inline">"Keep <lb/>
'em <i>against</i> I come."</q></eg>
</hom>
</entry>

[page image for the preceding examples]

In cases where words with identical spellings (homographs) receive separate entries in the dictionary (rather than being included within a single entry), each entry should be marked as an <entry> as usual, but then the group of entries should be wrapped in a <superEntry> element. [example]


IV. Block-level Features

Block Quotations

The <q> element should be used to mark quotations that are set off typographically from the surrounding text (not quotations that are printed in the same typographic style and are indicated only by double quotation marks), as indicated by one or more of these typographic
changes:

  • set off from the surrounding text by line breaks
  • indented more than the surrounding text
  • in a smaller typeface than the surrounding text

The <q> element should never be used to replace quotation marks. If the quotation is both set off from the surrounding text and enclosed in quotation marks, use the <q> element and also include the quotation marks.

[example]

Block quotations with openers and closers: If a block quotation contains an opener and/or closer, as in the case of quoted letters, newspaper articles, etc., use the <quotedLetter> element. (The <q> element does not allow <opener> or <closer>.) [example]

See also Letters


Figures

Captions and associated text: When using the <figure> element to indicate non-textual content (illustrations, photographs, maps, etc.), use the <head> element to record the caption of the figure (if any). Use the <p> element to record text (if any) that
is associated with the figure but is not part of the caption. [example]


Printer’s ornaments: Printer’s ornaments do not qualify as figures. Instead, ornaments should be marked with:

<ornament type="ornament"/>

See also Informal divisions.


Tables

Header cells: For cells that contain a label or heading, rather than data, use <cell role=”label”>. (For cells containing data, there is no need to include the role attribute; “data” is the default.)

Spanning rows or columns: If a cell occupies more than one row or column, use the rows or cols attribute, respectively, on the <cell> start-tag. (This is equivalent to the use of the rowspan and colspan attributes on <td> in HTML.)

Tables vs. lists: In some cases, the choice between <table> and <list> may not be obvious, but typically any items of text that are intended to line up vertically should be encoded as a <table>. A table of contents, list of illustrations, etc. should almost always be marked up as a table. Ask U.Va. Library for further guidance if needed.

[example]


Lists

Note that lists can be nested (a list <item> can contain a <list>). A common use of nested lists is for indexes where each entry contains indented sub-entries.
[example]


Notes

Note reference vs. note body: By note reference we mean the anchor point for the annotation within the flow of the main text, typically indicated with a superscript number or symbol. By note body we mean the content of the annotation. The most common locations for the note body are:

  • the bottom margin of the page (footnotes)
  • a separate “Notes” section at the end of a major structural division (endnotes)
  • the left or right margin of the page (marginal notes)

Marking the note reference

In cases where the note reference is indicated by a number or symbol, as is almost always true of footnotes and endnotes, use <ref> to encode the note reference. [example]

In cases where no number or other referencing symbol is present — as is common for marginal notes, where the physical placement of the note on the page indicates which line or paragraph the annotation refers to — use <ptr/> to supply an anchor point for the annotation. , the encoder must supply a sequential number for the annotation and record this number in the n attribute. Also record this number in the n attribute of the corresponding <note>.–> [example]

Whether using <ref> or <ptr/>, always use the target attribute, the value of which must match the id attribute of the corresponding <note>.

Marking the note body

Use <note> to encode the note body.

Location within the XML document: With the exception of endnotes, which are already located in a separate section and should not be moved, the <note> element should occur at the point of the note’s attachment in the main text — that is, immediately after the <ref> or <ptr/> element.

Note symbols: When the note body includes the referencing symbol (a number, *, †, etc.), record this symbol using <seg type=”note-symbol”> as the first element within <note>.

[example]

Required attributes: Always include the id attribute, which must contain an ID that is unique within the XML document, and the place attribute, indicating the placement of the note on the printed page:

  • end — note appears in a separate division containing all the notes for a given section
  • foot — note appears in bottom margin
  • left — note appears in left margin
  • right — note appears in right margin
  • head — note appears in top margin
  • inline — note appears within the main body of the text
  • above — note appears between the lines of the main text, above the line to which it refers
  • below — note appears between the lines of the main text, below the line to which it refers

Note: When creating IDs for notes, use a simple, human-readable numbering scheme. For notes that are already numbered in the print source, include the number in the ID. For example:

  • If the notes are numbered sequentially throughout the entire work, use “n1”, “n2”, etc. (where “n” is short for “note”).
  • If the note numbering starts over at 1 in each chapter, create a unique ID by including the chapter number as well as the note number. For example, the note IDs for the third chapter would be “n3.1”, “n3.2”, etc.
  • If the note numbers or symbols start over on each page, create a unique ID by numbering the notes sequentially within each chapter. For example, if the 35th footnote in chapter VI is indicated by an asterisk (*), the note body would be marked: <note id="n6.35" place="foot"><ns>*</ns>...</note>

Unanchored notes: If the note body has no corresponding referencing symbol (notes for which <ptr/> is used for the note reference, rather than <ref>; typically marginal notes), include the anchored attribute with a value of “no”. [example]

Multiple references to a single note: In cases where a printed note (typically a footnote) is pointed to by more than one note reference, the printed note should be transcribed once (immediately following the first <ref> element), with the remaining <ref> elements pointing to that single <note> element. The <note> element should not be repeated for each <ref> element. [example]


Other Features

Arguments, bibliographic citations, epigraphs, and trailers should be encoded as such using the appropriate TEI elements.

For example, an epigraph containing a quotation and a citation of its source should be marked up in this manner:

<epigraph>
<cit>
<q>"I have sworn upon the altar of God <lb/>
eternal hostility against every form of tyranny <lb/>
over the mind of man."</q>
<bibl><author> &mdash; <i>Thomas Jefferson.</i></author></bibl>
</cit>
</epigraph>

An argument (a leading section containing a summary of the content that follows it) is often presented as a series of topics separated by long dashes. An argument should be marked as an <argument>, not as a second <head>. [example]


Phrase-level Features

Changes in Typeface

With one exception, changes in typeface should be marked with the appropriate physical element, not with a logical element such as <emph>, <title>, <term>, <mentioned>, etc. The exception is foreign phrases, which should be marked with <foreign> (see Foreign Phrases). When marking changes in typeface, use the following elements:

  • <i> … </i> : italics
  • <b> … </b> : bold
  • <u> … </u> : underline
  • <sup> … </sup> : superscript
  • <sub> … </sub> : subscript
  • <smcap> … </smcap> : Small Caps
  • <hi rend=”…”> … </hi> : All other typographic changes

Note on <smcap>: Text that is printed in small caps should be transcribed using both upper-case and lower-case letters, not all upper-case letters.
[example]

When using <hi> for changes in typeface, use one of the following values for the rend attribute:

  • gothic
  • line-through
  • overline
  • red-letter
  • roman – assumed and not normally necessary
  • other – indicate rendering using the other attribute, as
    in: <hi rend="other" other="...">

[example of gothic typeface]


Alignment and Indentation

When indicating alignment or indentation, use the globally available rend attribute, either on structural elements (<p>, <l>, <cell>, <item>, etc.) or on <hi>, as appropriate to the situation.

For indicating alignment, the rend value may be:

  • center
  • left – assumed and not normally necessary
  • right

For indicating indentation, rend may be:

  • indent
  • indent2, indent3, indent4, indent5 – for cases where more than one level of indentation needs to be recorded (Use these values sparingly, and only when “indent” has already been used. Normally these values are only needed when encoding lines of verse.)
  • hang – for hanging indentation

Default alignment: Some elements have a presumed or default alignment and do not normally require explicit alignment markup:

  • head – center
  • table – center
  • figure – center
  • elements within <titlePage> – center
  • trailer – center
  • ornament – center
  • dateline – right
  • salute within <opener> – left
  • salute within <closer> – right, with some indentation toward the left
  • signed – right
  • all other elements – left

These elements should contain alignment markup only when the layout of the element on the printed page differs from the defaults listed above.


Foreign Phrases

Using <foreign>: Words or phrases that are both (a) typographically distinct (usually in italics), and (b) not in the main language of the text (almost always English), should be marked with the <foreign> element. Whenever possible, include the lang attribute, using one of the standard ISO 639-2 three-character language codes. Occasionally, the language
will not be obvious, in which case encode the phrase with <foreign> but without the lang attribute. Commonly used ISO 639-2 codes include:

fre French
ger German
grc Greek, ancient (to 1453)
gre Greek, modern (1453- )
heb Hebrew
ita Italian
lat Latin
rus Russian
spa Spanish

Each language identified by a lang attribute (on <foreign>, or on any other element) must be declared in a <language> element within <teiHeader><profileDesc><langUsage> in order for the XML document to validate.

[example]

Retaining typographic distinction: Using the <foreign> element does not obviate the need to encode the change in typeface. Since foreign phrases are usually italicized, typical markup for a foreign phrase will be: <foreign lang=”…”><i>…</i></foreign>

Non-roman characters: For languages requiring non-roman characters, see Non-roman characters.


Punctuation

Standard keyboard punctuation: Most common punctuation characters can and should be represented using their ASCII keyboard characters:

  • exclamation point !
  • dollar sign $
  • percent sign %
  • asterisk *
  • opening and closing parentheses ( )
  • hyphen –
  • opening and closing square brackets [ ]
  • opening and closing braces { }
  • colon :
  • semicolon ;
  • double quotation mark “
  • single quotation mark and apostrophe ‘
  • comma ,
  • period .
  • solidus (forward slash) /
  • question mark ?

Non-ASCII punctuation: Non-ASCII punctuation marks must be represented using either standard character entities or numeric XML character references. For example:

  • ampersand &amp; (as required by XML)
  • em dash &mdash; or — (long dash)
  • en dash &ndash; or – (medium-length dash, often used to indicate a range, e.g. 1783&ndash;1804)

An ellipse — a series of dots or asterisks indicating deliberately omitted text — should be indicated by a series of keyboard-character periods or asterisks. Simply use the same number of periods or asterisks used in the print source.

If the print source contains an exceptionally long space that needs to be preserved (for example, to indicate a word deliberately omitted by the author), use a series of em space (&emsp; or  ) characters.

Many other marks of punctuation are available in the iso-num.ent, iso-pub.ent, and iso-tech.ent character entity sets. See Special Characters below.

Spacing between sentences: Use one space character between sentences, not two, regardless of the apparent spacing in the print source.


VI. Reference Systems

Page Breaks

Use <pb/> to mark page breaks — that is, to mark the point at which a page begins.

Always at top of page: The <pb/> element should always be placed at the top or beginning of the page, regardless of the position of the printed page number in the print source.

Page numbers: If the page contains a printed page number, record it in the n attribute; if not, do not include the n attribute. Exception:For <pb/> elements indicating spine, covers, and edges, include the n with an appropriate label, such as n="Front Cover". See Images for Book Spine, Covers, and Edges.

Always within a div: Page breaks must be placed within a <div#> element, never between divisions. Therefore, when a division starts on a new page, the <pb/> is the first element in the division, immediately following the opening <div#> tag (preceding even the division <head>, if there is one). For example:

</div2>
<div2 type="chapter" n="II">
<pb/>
<head>II&mdash;APPELLATIONS.</head>

Blank pages: There must be one <pb/> element for every page in the set of page images for the work. This is true even for pages that have no text or other printed content on them.

If the blank page occurs between divs, place the blank page’s <pb/> element as the last page of the preceding div, not as the first page of the new div.

<!-- end of last chapter --></p>
<pb/> <!-- blank page between last chapter and bibliography -->
</div1>
</body>
<back>
<div1 type="bibliography">
<pb/>
<head>BIBLIOGRAPHY</head>

Running page headers: Normally running page headers should be excluded from the electronic text (see Exceptions). In some cases, however, U.Va. Library will specifically request that the running headers be preserved for a particular book. To encode the running headers, use <fw type=”header”> within <pb>:

<pb n="99"><fw type="header">APPEAL TO CHURCHES OF MASSACHUSETTS</fw></pb>

Column Breaks

Use the <cb/> element to mark column breaks — that is, to mark the point at which a column of text begins. Of course, many books have a single-column layout, in which case it is not necessary to mark the column at all. Other materials, however, such as dictionaries, encyclopedias, newspapers, and journals, are commonly printed in multiple columns and require the use of <cb/> to indicate the layout.

Always at top: Like page breaks (<pb/>), which should always be placed at the top of the page, <cb/> should always mark the top or beginning of the column of text.

n attribute: For <cb/>, use the n attribute to record the number of the column on the page. If each page contains two columns, the first (leftmost) column on each page is <cb n=”1″/>, and the second column is <cb n=”2″/>.

Mixed column layouts: In cases where the number of columns changes mid-page, use the <cols/> element to indicate the point at which the number of columns changes. Use the n attribute to indicate the number of columns in the section that follows the <cols/> tag. For example, if the page layout shifts from single-column to double-column in the midst of the page, use <cols n=”2″/> to indicate the point at which double-column layout begins (and then use <cb n=”1″/> and <cb n=”2″/> to mark the columns, as usual). At the point where the layout shifts back to single-column text, use <cols n=”1″/> (after which no <cb/> elements are necessary, since the layout is single-column). [example]

Note: A division <head> followed by a multi-column layout does not indicate a mixed-column layout and does not require <cols n=”…”/>.
[example]


Line Breaks

Line breaks in running prose should be preserved in the electronic transcription by marking the end of each printed line with the standard <lb/> element.


Special Considerations

Illegible Words or Phrases

Use <unclear> to mark passages that cannot be transcribed with certainty, as happens when a character/word/phrase is presenton the physical page but is unreadable (due to a printing error, a reader’s markings, mold spots, etc.).

When marking an illegible character/word/phrase as unclear, transcribe the readable characters, omit the unreadable characters, and mark the entire word or phrase with <unclear>…</unclear>. For example:

Correct:

lost, wounded or captured in a fruitless and <lb/> <unclear>opeless</unclear> assault. <lb/> 

Wrong:

lost, wounded or captured in a fruitless and <lb/> <unclear/>opeless assault. <lb/> 

Note: If one or more illegible characters come immediately before or after an end-of-line hyphen, include the entire hyphenated word within the <unclear> element. For example:

Correct:

cleared his throat, and quietly withdrew, <unclear>maintain- <lb/> ing</unclear> to the last his unprejudiced demeanour.</p>

Wrong:

cleared his throat, and quietly withdrew, maintain- <lb/> <unclear>ing</unclear> to the last his unprejudiced demeanour.</p>

If a word or phrase is so illegible that none of its characters can be read with certainty, use an empty <unclear/> element to mark the point at which the word or phrase occurs:

instead of his sargeant; and therefore no regiment <lb/> <unclear/> to be seen in which there are not soldiers in <lb/>

Omitted Words or Phrases

If <unclear> is not applicable, use <gap/> to mark any section (character, word, passage, page, etc.) that is being omitted from the transcription. There are two reasons for such omissions:

  • The section is missing from the physical page — typically due to a partially torn page or a completely missing page.
  • The section has been excluded deliberately, at the request of U.Va. Library — typically for non-roman characters, in a language outside the vendor’s capabilities. In this case, see
    Non-roman characters.

When using the <gap/> element, include the desc (description) and reason attributes. The reason attribute accepts these values:

  • editorial – the section is omitted deliberately at the request of U.Va. Library (for example, non-roman characters)
  • damage – the section is omitted because of damage to the physical page (for example, a partially torn page)
  • missing – the section is missing entirely (for example, a completely missing page) [rarely used]
  • other – indicate the reason in the “other” attribute, as in: <gap desc="..." reason="other" other="..."/> [rarely used]

Example:

<gap desc="page 43, line 17 to end of page" reason="damage"/>

Arbitrary Sections

In texts with complex structure or layout, the encoder is likely to encounter block-level sections or phrase-level passages that are difficult to fit into any of the standard TEI elements. In such cases, it may be best to take advantage of TEI’s elements for arbitrary sections:

  • <ab> – (anonymous block) occurs at the block level (at same level as <p>, <table>, <list>, etc.)
  • <seg> – (segment) occurs at the phrase level (within <p>, table cells, list items, etc.)

Both of these elements accept the type attribute with any value (no predefined vocabulary).

Although these elements should be used sparingly, they are very useful when genuinely needed.

IMPORTANT: It is better to use <ab> or <seg>, when appropriate, than to inject inappropriate markup — such as <div#> elements that do not truly reflect the major structural divisions of the work, or <p> elements that are not really paragraphs — for the sake of “making it parse.”

If a work contains a particularly problematic feature for which the preferred encoding is not clear, ask U.Va. Library for further guidance.


Special Characters

When the XML document contains non-ASCII characters, convert those characters to ASCII using one of the following methods:

  • Use a numeric XML character reference indicating a Unicode character, using hexadecimal notation
  • Use a named character entity (which resolves to the same Unicode character, using hexadecimal notation)

For example, in a transcription containing the following:

afford us a glimpse of its præscience.</p>

the line can be represented using a hexidecimal character reference:

afford us a glimpse of its præscience.</p>

or using an entity:

afford us a glimpse of its pr&aelig;science.</p>

When using entities, note that character entity sets are not declared in the external DTD. Instead, you will need to declare and invoke any entity sets required by a given document in the document’s internal subset. For example:

<?xml version='1.0'?>
<!DOCTYPE TEI.2 SYSTEM "http://text.lib.virginia.edu/dtd/tei/tei-p4/tei2.dtd" [
<!ENTITY % TEI.extensions.ent SYSTEM "http://text.lib.virginia.edu/dtd/tei/uva-dl-tei/uva-dl-tei.ent">
<!ENTITY % TEI.extensions.dtd SYSTEM "http://text.lib.virginia.edu/dtd/tei/uva-dl-tei/uva-dl-tei.dtd">

<!ENTITY % ISOlat1 SYSTEM "http://text.lib.virginia.edu/charent/iso-lat1.ent"> %ISOlat1;
<!ENTITY % ISOlat2 SYSTEM "http://text.lib.virginia.edu/charent/iso-lat2.ent"> %ISOlat2;
<!ENTITY % ISOnum  SYSTEM "http://text.lib.virginia.edu/charent/iso-num.ent"> %ISOnum;
<!ENTITY % ISOpub  SYSTEM "http://text.lib.virginia.edu/charent/iso-pub.ent"> %ISOpub;
<!ENTITY % ISOtech SYSTEM "http://text.lib.virginia.edu/charent/iso-tech.ent"> %ISOtech;
]>

If you prefer to use local copies of our DTD-related files rather than URLs, download copies of our character entity sets here:

uva-charent.zip

Do not use your own, local versions of the ISO 8879 (SGML) entity sets, as our versions may include corrections, as well as a supplementary set (uva-supp.ent) containing characters not available in the standard sets.

When working with character entity sets, you may want to refer to the following resource:

XML Character Entities
http://www.oasis-open.org/docbook/specs/wd-docbook-xmlcharent-0.3.html

Conveniently lists the characters in the standard ISO 8879 entity sets, with graphics showing the typical appearance of each character.

When working with Unicode characters in general, you may want to refer to the following resource:

Unicode Code Charts
http://www.unicode.org/charts/

Contains links to PDF code charts of Unicode characters, organized by category (that is, by Unicode block).

Non-roman characters: When transcribing words/phrases/sections containing non-roman characters, use markup appropriate to the vendor’s capabilities:

  • If the language is within the vendor’s capabilities to transcribe, the characters should be transcribed using either the appropriate character entities or XML character references with hexadecimal Unicode values (see Special Characters above). This is often the case for languages such as Greek, Hebrew, Russian, etc. that require non-roman characters, but are alphabetic, not ideographic.
    Greek Refer to the iso-grk1.ent character entities, supplemented as needed by the accented characters in iso-grk2.ent
    Hebrew Use the Hebrew block of Unicode (0590 – 05FF)
    Russian Use the Cyrillic block of Unicode (0400 – 04FF)
  • If the language is not within the vendor’s capabilities to transcribe, omit the characters from the transcription, using <gap desc="..." reason="editorial"/>. Use one <gap/> element for each unbroken section of content that is omitted from the electronic transcription.

    Example:

    <gap desc="Chinese characters" reason="editorial"/>
    

    See also: Omitted Words or Phrases

Characters not in Unicode: In some cases, a particular character may not be available in Unicode. It is often possible, however, to indicate the needed character by using the Unicode combining diacritics (see Unicode blocks 0300-036F, 1DC0-1DFF, FE20-FE2F).

[example]

Unknown characters: If the transcriptionists cannot identify a character, use a <gap/> element to indicate that a character has been omitted from the transcription:

<gap desc="unknown character" reason="editorial"/>

See also: Omitted Words or Phrases


Final Steps

Line endings: If the XML files contain two-byte line endings (carriage return + linefeed), please convert them to Unix-style, one-byte line endings (linefeed character) before delivering the finished files to U.Va. Library.