Monday, August 19, 2013

Milk and honey in Leipzig

I took part this month in the Leipzig "Open Philology Workshop" organized by Greg Crane.  While I was only able to participate in some of the changing three-ring circus of events, I got a view onto the promised land.  Out of the many highlights of the workshop, here are three that are individually significant, and, taken together, will have enormous consequences for classicists.

1. A billion words of Greek

I worked with a large team planning to digitize a billion words of Greek.  Thanks in no small part to work by Bruce Robertson and Federico Boschetti improving OCR of polytonic Greek, we designed a detailed work flow automating many of the steps in moving from a physical volume in a library, to an openly licensed, citable, digital edition.

We live in a very different world than just a few years ago.  When the costs of digitization were extremely high, both private interests (like publishers) and academic projects (even projects with the sponsorship of professional organizations and funding from national agencies) successfully persuaded individuals and libraries to give up their scholarly freedom (along with, of course, exorbitant licensing fees) for access to proprietary data banks of texts.  Without the same barriers of cost, we can now insist instead on digital corpora comprising the kinds of texts we should always have demanded: structured for scholarly citation, and licensed for scholarly reuse.  At this point, whether the Billion Words project literally achieves its goal of digitizing 10^9 words of Greek over the next five years is immaterial:  when the first digital edition comes out of that pipeline, we can begin to put behind us the historically brief but shameful aberration when we thought it was acceptable to trade away our freedom to read and share classical texts in exchange for more convenient access to ancient Greek for a privileged few.

2. Perseus lexical inventory and morphology services

Bridget Almas and Marie-Claire Beaulieu are extending the Perseus lexical inventory and morphological services to keep each in sync with the other as they are dynamically edited.

This is exceptionally important, and indeed urgent, precisely because of the Billion Words project.  As the contents of its new digital editions can be automatically tested, we will be able to extend the lexicon when unattested material appears, and improve the morphological analyzer when it fails to recognize valid forms. Not only will the Billion Words project improve the lexical inventory and morphological analyzer:  repeating automated testing of the Billion Words corpus with the iteratively updated inventory and analyzer will allow the Billion Words project to state with unprecedented clarity what levels of validation each work in its corpus has passed.

3. A text citation tool

I was caught completely by surprise by Hugh Cayless' work on a javascript tool letting users select arbitrary pieces of (or even points in) a TEI document displayed in a web browser.  While the CTS URN notation can easily express such arbitrary ranges of text, the challenges in building an interface highlighting spans of text that can cross multiple XML element boundaries and that might start and end in elements that do not constitute well-formed XML are so difficult that I would have said it was impossible to implement practically for real, complex texts.

Characteristically, Hugh showed a working implementation that was visually appealing, very responsive, and worked flawlessly on exceptionally complex passages from Servius' commentary on the Aeneid.  So much for my scepticism.  Equally characteristically, while Hugh's initial use case was a very limited application, he recognized the generality of the problem he had solved, and plans to fork the citation tool as a separate project that can express selections as CTS URNs.  Chris Blackwell and I look forward to packaging Hugh's TEI Text Citation Tool along with Chris' Image Citation Tool as part of the standard suite of CITE services and utilities that we work with on the Homer Multitext project.

A whole greater than the sum of the parts

Bruce Robertson, Bridget Almas, and Hugh Cayless have long track records as three of the most talented contributors to the digital study of classics I have ever seen, so I suppose it is unsurprising that they would each, yet again, contribute something remarkable.  What was different in Leipzig in August, 2013, was the synergy that their work illustrates.  The internet can facilitate many kinds of collaboration, but nothing can fully replicate what happens when people sit in the same room, talk over coffee or dinner, and have unscheduled opportunities to follow up easily in further face to face conversations.   While each of the three highlights I've chosen here deserves more discussion in future posts, consider their connections to each other:  we can see the real beginnings of a vast digital corpus of Greek;  the corpus is being automatically tested, and related to a citation-based inventory of Greek vocabulary, and to a morphological analyzer that can relate surface forms in the texts to lexical entities in the inventory;  the moment the digital edition appears, a UI that runs in any web browser will let users cite any part of the corpus with technology-independent canonical citations.

Is there another discipline in the humanities that offers this kind of digital foundation in 2013?  Perhaps, but I am not familiar with anything rivaling what I saw happening in Leipzig.




No comments: