(c) Ian Lancashire 1994
Other Representative Poetry Indexes
Related MaterialsRepresentative Poetry root files are encoded in Standard Generalized Markup Language (SGML), an ISO-sponsored syntax for the tagging of electronic documents, but are converted to Hypertext Markup Language (HTML) for display on the World Wide Web. SGML tags are useful for the exchange of electronic documents across many different computers and for the enrichment of texts with information. Although employing the syntax of SGML, HTML tags are few in number and are designed to display files for browsing on the World-Wide Web.
You have access on the World Wide Web only to the HTML encoded file.
SGML-encoded Root Files
Here is a typical SGML-encoded file, Blake's "Little Lamb":
SGML syntax requires that any tag be delimited or enclosed by angle or diamond brackets. Inside SGML tags, you will always find an element name (e.g., `pmdv1'), and sometimes attributes for that element (e.g., `type="poem"'). Most tags come in pairs, an opener and a closer. The closer tag normally consists only of the element name, without attributes, preceded by a slash (e.g., `'). Each tag names a feature that applies to all words between the opener tag and the closer tag. Thus the sequence
<heading> THE LAMBindicates that the words `THE LAMB' are a heading. Tags relate to tagged text as variables do to values.
</heading>
Some tags used in the poem `Little Lamb' mark divisions that nest inside one another. Thus the top-most division (`pmdv1') marks the entire poem. Its closer tag appears at the end. This tag contains the middle division, designating the stanzas repeated in the poem (tagged as `pmdv2'). The lowest level is the verse line, which always appears inside a stanza and is generally repeated (`pmdv3').
This literary structure, then, has verse lines always and only occurring within stanzas, and stanzas always and only occurring within poems.
There are two anchor tags:
<a href="utel_rp_sourcebib.html#Blake 1789">Source reference.</a>These mark cross-references. The anchor tag with an `href' attribute gives an address of another file to which the enclosed text refers. The anchor tag with a `name' attribute provides an address for any number of cross-references to it. Thus the first anchor tag refers to a file with another anchor tag having a `name' attribute with the value `Blake 1789'. The second illustrative anchor tag may be used by other anchor tags with `href' attributes as the target for a cross-reference. Here the `href' tag enables a reader to move quickly from the short source reference to a full bibliographical citation in a separate document; and the `name' tag marks the first line of the poem so that a first-line index has a target at which to aim a cross-reference.
<a name="8"> { }{ } { }{ } Little Lamb, who made thee?</a>
Finally, note that Representative Poetry uses a few letters or other symbols not available in many computer character sets. These special characters are represented either by simple codes within single braces, or by SGML entities. Which notation appears depends on the browser. SGML root files always use the brace notation.
For example, the single space is tagged as `{ }' in Blake's `Little Lamb' above but is converted to ` ' (the HTML tag for a non-breaking single space) for the purpose of display by HTML browsers. Some browsers, like Lynx, support few special characters. Others, like Mosiac and Netscape, support most of the following entity references.
Here are the most important special characters:
Greek is transliterated into English letters and placed within
<lang> ... </lang> reference tags. Other characters,
e.g., e-macron and oe-ligature, have no entity references yet in the
basic Latin character set used above.
HTML Markup
This simple encoding system employs SGML (Standard Generalized Markup Language) syntax. HTML is partly indebted to the Text-Encoding Initiative Guidelines, edited by Lou Burnard and Michael Sperberg-McQueen (1994).
The online literature on HTML is sizable.
HTML tags employed in Representative Poetry may be either single, such as <br> (indicating a line-break) or <p> (indicating a paragraph), or paired. The single tags stand for several unprintable characters and, when interpreted by Lynx, act on the text directly. Where the tag <br> appears, for instance, a line-break occurs. The paired tags surround a passage of text and characterize it in some way. The most important tag-pair, <html> <html>, encloses the entire text. Note that the closing tag of this pair is identical to the opening tag except for the added virgule or forward-slash /. This feature characterizes all HTML paired tags.
Here follows an HTML file generated by a sed script on the SGML file of Blake's `Little Lamb'.
Here are the HTML tags employed in this library:
The NCSA's
Beginner's Guide to HTML gives easy-to-understand instructions on
how to encode a document with these and other HTML tags.
COCOA- or TACT-style Markup
A third tagging system exists, suitable for DOS-based text-analysis software like Oxford Concordance Program and TACT. These tags normally consist of an opening angle bracket, an unchanging variable (e.g., "author" in "<author Shakespeare>"), a space, a changing value (e.g., "Shakespeare"), and a closing angle bracket.
In the following list, COCOA- or TACT-style tag values are indicated by "xxxx".
Although simple, this tagging scheme attempts to characterize the text faithfully.
All COCOA- and TACT-style tags are single. They hold true until
another tag of the same type--with a different value--appears. Thus
the tag
The "xxxx" value of the <tt xxxx> tag gives the type of text for
all words following (until the next <tt xxxx> tag). These values include
"epigraph", "nt:leftmargin" (note in left margin), "RPheading" (poem
title or heading in Representative Poetry), "RPsubheading"
(poem subtitle or subheading in RP), "RPheadingno" (the number
given to the canto, stanza, etc., in RP), "sppfx" (speech
prefix occurring in dramatic poems), "stagedir" (stage direction), and
"text" (the poem itself). This tag ensures that text-analysis programs
can retrieve words according to type.
Every line in a poem has a prefixed number that identifies both
the in-sequence number of the poem in RP, and the line number
in the poem. In any <lx yyyy> tag, "x" is the in-sequence number and
"yyyy" the poem's line-nunber. Besides serving as a useful method
of reference, exhaustive lineation of this kind ensures that the entire
poem is included in the file. Any disruption in lineation indicates a
corrupted file. Remove this lineation at your own risk.
A standard reference for any word or phrase retrieved from
this file may be taken from the current values of the
<author>, <poemtitle>, <subtitle>, <copytext>, and <lx> tags.
These give the poet's name, the title and subtitle of the poem,
the volume and page reference in Representative Poetry, and
the verse line-number.
The poems in Representative Poetry are split into poem files,
but any library in English literary history consists of
books and manuscripts, not authors. For this reason, the
RP author files
include, from time to time, hypertext links to the earliest copy
of the source on which their texts are based.
For example, Representative Poetry includes John Dryden's poem
entitled "To the Pious Memory of the Accomplished Young Lady Mrs. Anne
Killigrew." This poem first appeared in Killigrew's Poems,
published posthumously in 1686. An electronic copy of this text,
encoded in HTML but also following a more complicated encoding system for
Renaissance texts, also exists online. HTML <a> tags permit us
to move back and forth between the modernized edition in
Representative Poetry and the source edition of 1686.
The RP author-files give much of the best poetry
in English literature up to the late 19th century, but they also
lead readers back into the source literature so that they may make
their own selection. The hypertext links possible in an electronic
library only do--albeit faster, less expensively, and more conveniently--
what the editors end-of-volume did.
None of these programs is exemplary in form or structure, but they all
did exactly what I wanted them to do.
The first example is a UNIX script that generates the `poet files' in
the electronic Representative Poetry. First I edited a
template for this kind of file and stored it as `rptemplate0.html'.
These included simple codes for types of information, e.g., `xxx
(aaa)" for the poet's name, followed by life-dates.
Then I put the following script into a text file. For each poet in the
collection, the script copies the template into a file `0' and runs on
it a sed script whose name begins with `0', follows with the poet's
name, and ends with the extension `.ctrl'. The transformed text is
output to the `poet file' (e.g., `arnold0.html').
In the Context of an Electronic Library
Sample Tagging Programs
It makes sense to automate as much of the work in preparing an online
library as possible. By preparing scripts (collections of commands
that can be executed in sequence automatically, like DOS batch files),
sed (the UNIX stream editor), perl (an easy-to-use
programming language), fgrep, and other UNIX utilities may be
set to transform large numbers of files or to extract information from
them for indexes. Here are some examples.
#script to make poet0.html files
copy rptemplate0.html 0
sed -f "0arnold.ctrl" 0 > arnold0.html
del 0
copy rptemplate0.html 0
sed -f "0blake.ctrl" 0 > blake0.html
del 0
copy rptemplate0.html 0
sed -f "0browning.ctrl" 0 > browning0.html
del 0
....
The `0arnold.ctrl' file follows. It contains simple editing commands
to make five substitutions and to append a list of anchor tags for a
poem index.
#sed script to insert fields in the poet0 file
s/xxx/Matthew Arnold/
s/yyy/H. Kerpneck/
s/zzz/arnold/
s/bbb/Arnold/
s/aaa/1822-1888/
/<ol>/a\
<li><a href="utel_rp_poems_arnold14.html">Bacchanalia</a>\
<li><a href="utel_rp_poems_arnold6.html">Consolation</a>\
<li><a href="utel_rp_poems_arnold21.html">Dover Beach</a>\
<li><a href="utel_rp_poems_arnold19.html">Immortality</a>\
<li><a href="utel_rp_poems_arnold15.html">Isolation: To Marguerite</a>\
<li><a href="utel_rp_poems_arnold7.html">Lines Written in Kensington Gardens</a>\
<li><a href="utel_rp_poems_arnold3.html">Memorial Verses April 1850</a>\
<li><a href="utel_rp_poems_arnold23.html">Palladium</a>\
<li><a href="utel_rp_poems_arnold11.html">Philomela</a>\
<li><a href="utel_rp_poems_arnold10.html">Requiescat</a>\
<li><a href="utel_rp_poems_arnold22.html">Rugby Chapel</a>\
<li><a href="utel_rp_poems_arnold4.html">Self-Dependence</a>\
<li><a href="utel_rp_poems_arnold2.html">Shakespeare</a>\
<li><a href="utel_rp_poems_arnold12.html">Sohrab and Rustum</a>\
<li><a href="utel_rp_poems_arnold13.html">Stanzas from the Grande Chartreuse</a>\
<li><a href="utel_rp_poems_arnold8.html">The Buried Life</a>\
<li><a href="utel_rp_poems_arnold1.html">The Forsaken Merman</a>\
<li><a href="utel_rp_poems_arnold5.html">The Future</a>\
<li><a href="utel_rp_poems_arnold9.html">The Scholar-Gipsy</a>\
<li><a href="utel_rp_poems_arnold17.html">Thyrsis: A Monody, to Commemorate the Author's Friend, Arthur Hugh Clough</a>\
<li><a href="utel_rp_poems_arnold16.html">To Marguerite: Continued</a>\
<li><a href="utel_rp_poems_arnold20.html">Worldly Place</a>\
<li><a href="utel_rp_poems_arnold18.html">Youth and Calm</a>
The first perl program renumbers verse lines in sequence where
it finds the string `00', restarting at `1' again each time it happens
on either `h1' or `subhead' tags. During execution the program asks
one for the input and output filenames. These could be given in
parameters, but I wanted to be quite sure what I was doing and so used
this somewhat tedious procedure. The input file contains all poems by
a given author.
#!/usr/bin/perl
print "Unnumbered filename?\n";
$a = <STDIN>;
chop ($a);
print "Is `$a' the right filename? (y/n)\n";
chop($answer = <STDIN>);
if ($answer eq "n") {
print "Unnumbered filename, eh?\n";
$a = <STDIN>;
chop ($a);
} else {
print "ok\n";
}
print "Output numbered filename? (y/n)\n";
chop($b = <STDIN>);
print "Is `$b' the right output file? (y/n)\n";
chop($answer = <STDIN>);
if ($answer eq "n") {
print "Numbered filename, eh?\n";
$b = <STDIN>;
chop ($b);
} else {
print "ok";
}
open (IN,$a);
open (OUT,">$b");
$n = 1;
while (<IN>) {
if (/<h3/) {
$n = 1;
print OUT $_;
} elsif (/<subhead/) {
$n = 1;
print OUT $_;
} elsif (/ 00>/) {
$target = index($_, "00>");
substr($_, $target, 3) = "$n>";
print OUT $_;
++$n;
} else {
print OUT $_;
}
}
close(IN);
close(OUT);
The second perl program extracts single poems from this
numbered file (as output by the preceding program) and writes each
to its own author file, numbered in sequence. Note that I supply the
input filename on the command line as the first argument. In this way
I could make a script containing commands to extract the poems from
all the author files at one stroke.
#!/usr/bin/perl
$period = rindex($ARGV[0], ".");
$head = substr($ARGV[0],0,$period);
$tail = substr($ARGV[0],$period+1);
print "head= '$head', tail = '$tail'\n";
$n = 0;
while (<>) {
if (/<h1>/) {
++$n;
$poemfilename = "$head$n.sgml";
open(POEMFILE, ">$poemfilename");
print POEMFILE $_;
} else {
open(POEMFILE, ">>$poemfilename");
print POEMFILE $_;
}
close(POEMFILE);
}
Department of English, New College
Centre for Computing in the Humanities
Robarts Library
University of Toronto
Toronto, Ont. M5S 1A1