Friends, Romans, countrymen: why haven’t we
built a text-mining tool to facilitate one of our favourite occupations within
the field, namely the spotting of intertext?
Most of the time, I imagine most of us
establish intertext by dint of knowing the one (or two, or three) author(s) of
our specialisms extremely well – well enough that certain phrases may trigger
our recognition so that we connect the two. Sometimes the idea of intertext has
to be discarded if no meaningful engagement can be argued; sometimes we craft
entire articles around what the allusion by one author to the words of another
can mean. I’m not saying there is anything wrong with this. What I am wondering
is whether we are missing opportunities.
The current approach produces results which,
though valid in the individual instances they are establishing, cannot say
anything about the extent of borrowing and engagement between two authors more
widely. We may recognize, if we know our Plautus well, a Ciceronian borrowing
of him in a particular speech that we’re working on, but there it ends. Is it
the only speech in which he engages with Plautus? Does he do this frequently?
Does he draw on some passages or works more than others? Does he borrow from Plautus,
and within that from particular works, more than he borrows from other authors
which predate him? The questions are much more difficult to answer without a
deep and long immersion in both texts, with no guarantee of further results.
(I’m not saying this would be ‘wasted’ time in the absence of results, but
within the current realities of academic knowledge production and career
assessment, time must to a certain
extent ‘pay off’.)
People have written theses around
constructing these sorts of relationships on the basis of individual occurrences,
and may even in recent years have started feeding their authors of choice into
text parsing programmes to help with the work, but the results are still
limited to a pre-selected set of authors which we either know well or suspect
of such intertextual engagement, and therefore informed by our bias and
expectations from the beginning.
So why can’t we build a database tool that
makes connections across all of Latin literature, and then select from the
results what strikes us as interesting? Granted, it would throw up a lot that
is misleading or of no use on account of being meaningless -- fixed expressions
being one of the obvious pitfalls (if 80% of authors from 3rd
century BC Plautus to 4th century AD Ausonius used the phrase
‘getting up at the crack of dawn’ we can probably safely discard the
possibility of intertext between them all). But by widening the net I suspect
we would notice a lot that no one has before, simply because not many of us are
sufficiently closely conversant with more than a handful of authors or texts in
the course of our lifetimes. We would also eliminate the possibility that we’d
just ‘missed’ stuff, even within our own field of expertise.
What I would greatly enjoy using, should it
ever exist, would be an online data-churning resource for the entire body of
Latin classical texts (yes, the chronological parameters would require some
thinking) which compares syntax and vocab across these texts and flags up
passages of similarity, possibly with a percentage indicator judging how great
the overlap is. There are already programmes designed to flag up plagiarism, such
as CrossCheck/iThenticate,
which do similar things. I never dealt with them much during my time in publishing,
nor with Turnitin in
my academic teaching so far, so sadly I don’t know much about the detail.
Of course these programmes operate slightly
differently: you put in a chunk of text, say a submitted journal article or a
student’s work, and the programme compares this chunk to the whole body of
articles in its database. But even if we simply copied this format, this would
already be extremely useful for classical scholars who, as I said above, still
start from a point where they pre-select the material they’d like to work with.
After all, I know of no projects centered
around questions as broad as ‘let’s see what comes up if we compare everthing ever’. I suspect the
processing power required to do this, depending on the database size, would be
too much, and building an intelligible and user-friendly interface for viewing the
results would be impossible. But ‘let’s see what comes up if I copy and paste
Tacitus’ description of the battlefield of the Varian disaster in the
Teutoburgerwald’ would already produce wider-reaching results than the human
mind is capable of and I, for one, would be extremely interested in seeing
them. They machine would not draw your conclusions for you, but you could get
straight to the interpreting without having to slosh through the data in a
human, and thus flawed, manner.
(Amusingly, about a year ago I was at my
funder the AHRC’s annual skills training conference,
about which I blogged here.
I remember having a conversation with an artist from Falmouth named Dane Watkins at the time, who
had an interest in such things and told me that as I was already doing some of
this stuff anyway (as indeed I was, in the third chapter of my thesis, which I
actually wrote first and must have told him about at the time) I should do it
more systematically. I wish I remembered more of the conversation, but it
clearly didn’t prompt me into action or even serious consideration at the
time.)
So why doesn’t this kind of resource exist
yet? I’ve had a long old think about this, and I think it’s possible to split the
reasons into the two broad categories of ‘technical’ and ‘emotional’.
Classicists don’t tend to be techies. As a
species we are not educated into becoming masters of digital skills, but into
linguists, historians, translators, and critical thinkers in a way which has
not so far required any (or not much) external mediation between us and our
source material. With the advent of Digital Humanities thankfully many people
have woken up to the possibilities and importance of IT to the field, with
lovely results such as my personal favourites ORBIS or PHI Latin Texts. Though pretty low-grade
in its looks (I don’t know about the back-end), I am also rather fond of the
texts on Perseus,
as well as its dictionary
and word study tools. The
most exciting one of all that I’ve come across is Diogenes,
made by classics scholar and digital humanist Dr P. J. Heslin at the
University of Durham, which draws on various other databases of classical texts
to allow for rather complicated searching, such as (in Dr Heslin’s own example
from this
talk) all possible declensions of the word ‘Caesar’ in the works of Cicero,
or only certain declensions of the word ‘Athena’ in texts marked in the
database as belonging to the genre ‘epic’. What I can’t ascertain at the
moment, because I don’t yet understand how to use the thing, is whether it is
set up to do what I have outlined above, even though that seems to me to be
only a moderate expansion of what it already seems to do.
So why haven’t
we gone for full-on text-mining, given that there are some of us who have
the technical skills and others who do not have clearly found suitable partners
to help them build things? This brings me to the ‘emotional’ reasons.
I’ve already said we do intertext all the
time, and we like making connections. Is there, however, some unspoken or even
unconscious feeling that these things should be spotted through hard graft
rather than automated comparison? Do we feel it's somehow at odds with the
literary nature of these texts? Does it demean their art to involve a level of
automation in our engagement with them? Do we fear it may undermine past research
on intertextual connections which have been argued to be unique but may in fact
occur elsewhere, in authors one doesn't happen to be familiar with or
interested in? Do we think it would throw up too much that is irrelevant so
that it would be too labour-intensive to use? Have others had this idea (I
struggle to imagine they haven’t) but haven't had the time, money, inclination,
technical skills, right contacts to pursue it? Is someone somewhere already
working on this in silence?
Or is it because we like our current method
of selecting the direction of our research before starting it, as opposed to
seeing what the data throws up and then selecting what we'd like to get our
teeth into? Would narrowing down what to follow up on be too difficult? An admittedly
quick google (but, in fact, on DuckDuckGo,
because they don’t track your searches) on text mining in the humanities threw
up only these two examples, on text mining ancient Chinese
medical literature and classical
music scores.
Why are we not doing this? (If we are,
please tell me.)
IT savvy friends have assured me that it
wouldn’t be very difficult at all to build such a data-churning tool. Much of
our raw data is already out there in digital form (Perseus has most of current
Latin literature texts uploaded, as does PHI, as does even a bare-bones,
quick-reference resource such as The
Latin Library), and it’s not as if we have to worry about copyright.
I imagine the fact that Latin is a declined
language could be a bit of a problem, but there must be ways round that. Texts
on Perseus, for example, are coded so that if you click on any word in the
Latin text it will take you through to a list identifying the word’s morphology.
Which means the back end of the resource recognizes word stems and dictionary
entry forms as well as their possible modifications when declined or conjugated.
It would have to be explored whether building a supra-engine for text mining
which drew on the databases of these already existing resources and their
databases is possible and whether the institutions which host them would give
permission, or under what conditions/for what remuneration they would do so.
The Diogenes ‘Thanks’
page refers to Perseus’ Creative Commons licencing which has allowed Dr Heslin
to draw on their database, so again I can’t imagine it would be very hard.
If sufficient collaborations could not be
established, having to duplicate the data entry work would be a disadvantage in
terms of the time and money required to set it up. But at the same time a
resource not reliant on the others’ continued hosting of compatible (!)
databases into the future would have the advantage of complete control over
both its future and its design. A newbuild could make its coding open source and
its licencing format CC-BY-NC,
allowing others to borrow (for example for other fields of literature), but not
for commercial purposes.
Techie people with a classics background
must be hard to find, so there would still be a large and important role for
classical ‘editors’ to test successive developments of the resource in order to
help refine the rules which determine results so that the output would be as
accurate and relevant as possible. Presumably, they’d also need to have people
on hand to explain to them, nevermind to the programme, how Latin actually
works.
Is this the stuff that postdocs are made
of?
I need to do more research and then have a
long hard think. But really, I need to get on and write the fourth chapter of
my thesis.
(**The answer to the question of how I came
to have these thoughts is longer and less interesting than these loose
thoughts, but briefly: a Tacitean passage I’m working on struck me as very
Caesarian in ‘feel’. Leaving aside the difficulty of establishing which
criteria to adopt in order to potentially verify this, I also realized that I
didn’t have the time to get to know Caesar as well as I do Tacitus during the
scope of my current project, and this led me to think of the idea of the
digital tool I am ruminating on in this blog post.**)
No comments:
Post a Comment