Wednesday, February 8, 2012

Digital scholarship must be technology-agnostic


As smart phones and tablets assume an ever-larger role in browsing the web, “responsive design” has become a hot topic among web designers. How far is it possible to design a single web site that can adapt its display depending on the characteristics of the reading device? Are there times when it’s simply necessary to maintain separate resources for phones vs. large-screen computers?

Designers of digital scholarship face even more demanding requirements. We know that we will replace our digital technologies, but it’s part of our responsibilities to preserve and transmit the scholarly record we work with. Our predecessors have not always set an ideal example for us. The work of Hellenistic scholars of the Iliad like Aristarchus of Samothrace was originally composed for papyrus scrolls. By the time of our earliest complete manuscripts of the Iliad, the tenth and eleventh century, the standard form of “publication” was the codex, or manuscript book. In a large codex, the wide margins offered invitingly convenient space to annotate the Iliadic text with selected notes from earlier scholars, as we see in the famous Venetus A manuscript.

As a consequence, virtually all ancient scholarship on the Iliad ceased to be copied as separate texts, and is today known to us only from the snippets preserved in these marginal notes, or scholia. The convenience of this early “hypertext” technology led directly to the loss of important scholarly work.

This illustrates a fundamental and somewhat paradoxical principle that should guide all our work on digital scholarship: it must be technology-agnostic. Well designed digital work will be machine-actionable, but will also be capable of expressing its content when moved to other media, even non-digital media.

One area where we must apply this principle rigorously is in our citation practice. It is tempting to yield to the convenience of using a URL to refer to on-line work: after all, with a URL we can immediately see some kind of response in a web browser.

But this convenience is as dangerous as the medieval scribes’ use of the margins of manuscripts for scholia. URLs are addresses: they will change or vanish; more fundamentally, the web that they point to will ultimately vanish (and, on a time scale that looks back to Aristarchus of Samothrace and other scholars of the library at Alexandria, it will certainly vanish sooner rather than later).

I’ve worked over the past several years with colleagues at the Center for Hellenic Studies to develop a URN notation for citing texts. (Some formal documentation is beginning to appear here ) URNs offer a formally specified notation for referring to some kind of resource, without reference to any particular technology. One of my favorite examples is the ISBN, which can be expressed with URN syntax. Many computer applications work with ISBNs: sales clerks in book stores read them with bar-code scanners, and you can search Amazon or bookfinder.com by ISBN for example. But until a few years ago, I routinely filled out request forms at my college bookstore by hand-writing ISBNs on a paper form, and they functioned perfectly well in that analog environment.

The Canonical Text Service URN (or CTS URN), like an ISBN, is a formally specified machine-parseable reference, but at the same time a simple text string that can be read by human beings and used outside of a digital environment. I have successfully disseminated URNs using chalk on blackboards, and pen on the back of a napkin. But since a CTS URN is also machine actionable, it can be passed in to a Canonical Text Service to retrieve cited passages of text. When our form of citation is not tied to a specific technology, we are free to imagine previously unforeseen re-uses of that material. Would it be handy if the printed copy of a book you want to carry with you were augmented with URNs represented as QR codes you could point your smart phone at to read a cited text? I don’t know, but it would not be difficult to implement. The QR code at the top of this blog entry represents the CTS URN

urn:cts:greekLit:tlg0012.tlg001:1.1

Here is a link passing the same URN to a Canonical Text Service.

Saturday, February 4, 2012

Ancient Greek is broken

It is 2012, and it is not possible to edit an original document from archaic or classical Greece digitally.

The inscriptions recording the construction of the Parthenon cannot be edited digitally; the Athenian Tribute Lists reflecting the annual payments members of the Delian League made to Athens in the fifth century B.C.E. cannot be edited digitally; votive offerings to Apollo at Delphi, dipinti on classical Greek pottery, graffiti scratched by Greek mercenaries on the colossal statues at Abu Simbel in Egypt — none can be edited digitally.

We are prevented from fully and accurately editing archaic and classical Greek by inadequate or erroneous technical standards defining the representation of languages, writing systems and digital character encodings. Unlike Claude Rains’ famously pretended reaction in Casablanca, I am genuinely shocked that most of the standards keeping us from editing classical Greek have been adopted unmodified from recommendations by professional classicists. (Think about that the next time you want to evaluate the state of digital scholarship in the humanities.)

Each of these three shortcomings is worth discussing separately, so I plan to post more detailed comments on them individually, but here is a brief summary of the problem.

1. Language

A text must identify what languages its content represents. We do that with International Standards Organizations (ISO) codes for language. The registration authority for the ongoing work to develop a comprehensive set of three-letter codes for languages is SIL
International
.

While some languages codes are organized in families (so that related dialects or languages can be recognized by software to process the contents appropriately), archaic and classical Greek are lumped under a single grc code. (This at least is an improvement on the previous iso639–2 list of codes where Mycenaean Greek written in Linear B could not be distinguished from classical Greek!)

We tell students reading Plato that the text is in the Attic dialect, and would not ask them to consider interpretations that are only possible in other dialects. The string τό, for example, might be a form of the relative pronoun in Ionic Greek, but in Attic it can only be the definite article (“the”).

We should treat our software equally kindly, by encoding explicitly the dialectical variant of ancient Greek used in a text.

2. Writing system

If we are editing an ancient Greek document, we must identify the document’s writing system, since archaic and classical Greek city states used a variety of distinct alphabets. In 403 BC, the Athenians voted to adopt a as their official writing system the alphabet used in Ionia, replacing the Attic alphabet they had used up to that time. The language spoken in Athens did not change, but the writing system did.

The Ionian alphabet is the direct ancestor of the modern Greek alphabet. In this alphabet, the letter epsilon represents a short vowel that is contrasted with a long vowel represented by the letter eta. In the classic Attic alphabet, on the other hand, the two sounds that were distinct in the Ionian alphabet were represented by the single letter epsilon. A glyph essentially identical in appearance to the Ionic eta instead represented a consonant, pronounced like a modern English H (or like the “rough breathing” in modern writing of ancient Attic). Any reader (or any computer program) that tries to interpret a text written in the “old Attic” alphabet as though it were written in the modern, Ionic alphabet will fail spectacularly, even though the language is unchanged.

ISO standard 15924 defines codes to identify the writing system of a text. The current version includes no way to distinguish archaic and classical Greek alphabets from the alphabet of modern printed texts.

3. Digital character set

Once we have identified the language and the writing system of our text, we have to record its contents. The Unicode consortium defines the standard that is by far the most comprehensive and widely supported digital character set today.

Of the sections of the Unicode specification that I have looked at closely, few are as misconceived as the ancient Greek section. I’ll save a fuller catalog of its problems for a separate post, but can briefly contrast one example of the clean design of the Arabic section of Unicode.

In Arabic, a single letter might have distinct forms when written separately, initially, medially or finally. A free-standing letter kaf ك looks quite different from the first letter of the word “book”

كتاب

for example. Software following the Unicode specification can represent all instances of kaf with the same code point: the different letter forms are treated as presentational variants depending on the position of the letter in relation to other letters.

Now use this tool to search the Unicode specification for the term “sigma”. We have two distinct upper-case sigmas, and no fewer than three lower case sigmas, with a lunate form and a terminal sigma being given distinct code points.

While medial and terminal sigma are, like the different forms of Arabic kaf, contextually determined variant glyphs, lunate sigma is simply a font choice used by editors who do not wish to distinguish a final form of sigma from other forms (often because they are editing fragmentary texts like papyri where it might be difficult to decide where word breaks occur in a handful of isolated letters). In all cases, an editor should be able to encode a simple sigma, and searching or parsing of the digital text would work on any form of sigma, while publishers who preferred the papyrologists’ lunate form of the letter could use a font with that glyph for sigma; publishers preferring a text with the two traditional print forms could use a font with a variant form of
terminal sigma.

Because of the false definition of lunate sigma as a distinct character, however, you now have to check manually for lunate forms of sigma versus other forms of sigma if you want to parse or search a text encoded in Unicode Greek. Do you want to do that? Do you want to rely on the authors of your software having to do that?

Solutions?

International standards processes are slow. While it’s reasonable for standards bodies like ISO to rely on the recommendations of professional organizations with expertise in a specific domain, in a field like classics this can be problematic. The American Philological Association is a professional organization often thought to represent the field of classics, but its role in recommendations to international standards like the Unicode consortium, and its complete
absence from discussion like the ongoing revision of international language codes suggest that, because of the what I’ve called the recursive arithmetic of tenure, it institutionalizes conventional wisdom and obsolete assumptions, and helps sustain cargo-cult scholarship.

But in recent months we’ve seen example after example of traditional institutions that have been overtaken by motivated groups using the internet to organize. Can we form enough of an on-line community to move better standards through ISO and the Unicode Consoritum, in alliance with or independent from existing professional groups?


Friday, February 3, 2012

Unplanned reuse


There’s really only one thing you can do with a book: read it. You can learn from it, cite it or feel that your life has been changed by it, but you can’t directly reuse it (well, apart from making it an
accessory piece of furniture, but that doesn’t make use of the contents of the book). One of the distinctive differences of digital scholarship is that, if it is well designed, it can be used for purposes the original author may not have foreseen. The original author may even discover unintended reuse for digital work, as I did recently.

I had been working on an image service using a URN notation to retrieve and view images of the famous Archimedes Palimpsest. Using a URN like

urn:cite:hmt:chsimg.081v–088r_Arch03v_Sinar_pseudo_no-veil

the service lets you do things like

  • Retrieve a binary image at a given size. . This is bifolio 81v–88r at 50 pixels wide.

  • Retrieve a region of interest . This extracts from the same image a region with a mathematical figure, the construction of Archimedes, Floating Bodies 1.proposition.1
  • open a pannable/zoomable version of the image in a web browser, either with or without a highlighted region of interest. Try these two links to the same bifolio illustrated in the static images above:
    1. with no highlighted region
    2. including highlighting of the mathematical figure

For a course I taught in English translation, I put together a text service, allowing you to retrieve passages of text by canonical reference. With a URN like this

urn:cts:greekLit:tlg0552.tlg008.chs03:1.proposition.1

the service lets you retrieve archival XML source for a passage. This request gets the XML source for Archimedes, Floating Bodies, postulate 1 — not necessarily a thing of beauty to the casual reader of Archimedes. But it’s trivial to associate an XSLT stylesheet to format the archival XML for reading in a browser, so here is the same passage associated with stylesheet for easy reading.

At some point, the penny dropped, and I realized it would also be trivial to mash up the two services. When I started work on the image service, I had not imagined that the digital images of the Greek palimpsest would be of any interest to Greekless readers of Archimedes, but the mathematical figures in the manuscript are extremely important even if you’re reading Thomas Heath’s public-domain English translation.

A minor addition to the XSLT stylesheet uses the markup indicating the presence of canonically identified figures in Heath’s translation to embed references to the image service.

Try this view of book 1, proposition 1, where any reader (Greek scholar or not) now gets to follow the text in Heath’s translation together with images in the only surviving Greek manuscript of Floating Bodies. Images of regions are embedded in the text, and are linked to the zoomable view of the whole bifolio.