ACAD - A Cambridge Alumni Databaseby J. L. Dawson(presented at the conference ALLC/ACH 2000, University of Glasgow)Introduction
This project began in 1987 as a means of publishing an updated version of these volumes of biographies of Cambridge alumni. Some 20,000 cards of addenda and corrigenda had been accumulated over the years. Archival research had unearthed much more detail, and many more names, for the period up to 1500, and in 1963 Emden published his two-volume Biographical Register. Together, these twelve volumes cover approximately 180,000 names, with some overlap. It goes without saying that all this information is of the utmost importance for historical research, covering as it does a large proportion of the religious, legal, administrative, medical, and royal appointments in Britain, the Empire, and the Colonies, as well as many other countries. A good deal of social history is also included, albeit patchily. However, all these publications have a great defect for research: there is no index. In the 1970s, a group at Oxford did some analysis work on Emden's Biographical Registers of both Oxford and Cambridge[1]. However, they were not intending to reproduce the entire text in the form of a database. They remarked: "Even with modern systems of interrogation it is not certain that all available information will be included in the computerized database, and most projects will call for some degree of abbreviation and formatting of the data." They also warned: "The correct interpretation of the entries in the register called for a knowledge not only of the internal workings of the university and of academic terminology ... but also of a wide range of other fields, notably royal and ecclesiastical administration."
We set about the creation of an on-line database to make all this information accessible. Other sources, such as the Tripos Lists (lists of degrees awarded), and College Registers (especially those of the women's colleges, which were ignored by Venn) have been included. Funding for the project has been supplied by the American Friends of Cambridge University, and by several colleges. "Tripos" is a word unique to Cambridge University. Originally it was the title of the Bachelor of Arts appointed to dispute with the candidates for degrees, and he was so called from the three-legged stool on which he sat. Later, the term became applied to the university's final honours examinations. For many years we were unable to find a simple and reliable way to put the data into machine-readable form. Venn's books are in small hand-set type, printed on thick rough paper, and are full of italics, all of which proved completely intractable to the OCR packages available until recently. By chance, just as we had found suitable technology to cope with Venn's printing, we discovered that Ancestry.com had already prepared machine-readable versions of most of the volumes of Part II, for genealogical research. They have recently put the remaining volumes of Venn into the computer. Emden's Biographical Register, the Tripos Lists, and the registers of the women's colleges have proved relatively easy to read using OCR and the services of a good proofreader.
This shows the total numbers of names in the sources analysed so far. At first sight, the high number of men taking the Tripos who are not immediately matched by entries in Venn ([T] only) is alarming (all of the Tripos entries should appear in Venn, as the Tripos List was one of the sources the Venns used). However, making a cursory manual inspection of the letter A mismatches reduces the number there from 189 to 21. The main problem is one of dates: Venn occasionally includes a man who was admitted after 1900, so the name mismatches around the 19031905 period are very suspect. Many of the other mismatches are accounted for by the following types of problem.
A typical entry from Emden looks like this (with references abbreviated):
and has the following structure:
My first attempts at analysis were written in Perl[2], a widely available string-handling language which allows complex regular expressions. (A regular expression is just a pattern like a piece of algebra which is used to match parts of the data and extract those parts which can vary.) SHOW p.8b The complexity of the regular expressions needed for the recognition of large-scale structures such as these entries uses too much memory in Perl, and the programs frequently failed. At Cambridge we have a locally-written programmable text editor called NE[3] which has good regular expression handling. It may seem a retrograde step to use a one-off local program like NE in preference to a widely used standard such as Perl, but in our case only the product (the tagged text) is useful; the process used to make the product is different for each text analysed, so the ephemeral nature of the programs is not significant. It was clear that some type of formal, structured, but readable output would be needed in the first instance. This could then be converted automatically as input to any required database package. SGML provides an adequate structure for these needs, and is widely used by publishers of machine-readable databases. Glancing at the pages of Venn's biographies gives a first impression that they are very regular, with keywords such as "Matric." and "School" clearly signalling well-structured phrases. However, this is only what the human eye and brain make of the material! When an attempt is made to parse these sentences automatically, all sorts of horrors arise. In order to find out exactly what constructions are used in the text, I made concordances. Part of the concordance for the word "grace" is shown below and illustrates that, although the basic structure is similar, there are a lot of very different cases to consider when parsing.
First attempts at analysis were very heuristic, but served to clarify the problems in my mind. Writing a DTD (Document Type Definition) for the SGML structure was then very helpful, as it forced me to take decisions about nesting of fields, etc. Initially, my regular expressions tried to match complete events, including place names and dates, but the programs ran out of time or store, and the regular expression processor found the structure too complex to analyse.
References were extracted and replaced by numbered keys, as the structure of references is not to be analysed. Dates are central to the analysis, as they are almost unambiguously recognizable, and occur in almost every event. The day / month-word / year or month-word / day / year structure is converted into the form year : month-number : day so that it is readily searchable and sortable later.
No attempt is made to modernise dates, so 15845 may mean either the range of years 1584 to 1585, or part of the old-style year from 1 January to 24 March 1585. Venn himself says of dates: "In taking a date from an ordinary history of the popular kind, we often do not know what the author means. Has he simply copied some contemporary record parish register, tombstone, etc. or has he tacitly substituted the modern reckoning? Wherever we can determine which he has done we have substituted the double date in order to avoid confusion. Sometimes, however, this is not possible, and then we have to leave the exact date ambiguous." We also find what might be termed pseudo-dates, such as "till death", "forthwith", "MT" (standing for Michaelmas Term). These must be converted into a suitable numerical form (which in some cases involves looking ahead for a subsequent date). Having tagged all dates, they then act as natural right-hand delimiters of many events, which greatly simplifies subsequent processing.
It was always obvious that cross-references between entries would be needed, to allow for constructions such as: "Banks, Henry. ... Brother of Ralph (1716)." or "Perhaps the same as the next." These will be coded by creating a unique identifier for each person, based on their name and perhaps the first recorded date, with an additional number to disambiguate similar names. It later became clear that cross-references within entries are also needed, to cope with things like: "V. of Walton ... notwithstanding Walton".
In a construction such as: The other problem is the sheer idiosyncratic peculiarity of (mainly English) place names. Just as a sample, we have:
and to add to the confusion between religious titles and place names, we have:
and even
Thankfully, Emden has carefully distinguished between Durham the city and diocese, which is never abbreviated, and Durham the county, which is always abbreviated to "Dur.". Venn, however, uses the full word Durham for the city, the county, the diocese, and the school! Looking at the list of place names ending in "Park", we see that Sir Denis Park might not be happy to be treated as a place name, and those of you unfamiliar with the history of African exploration may like to speculate on the place "Mungo Park".
In an attempt to pre-tag place names in a similar way to dates, I adopt this algorithm: Place Name Algorithm
A place name is one of
SHOW pp.17 & 18 Point out: <ER> Religious Event; <keye> event key; <IR> Internal event, religious; <cond> cross-reference to earlier event key; <EG> Event, topic "grace" or dispensation; Walton not written out in full (needs manual linkage); [LIN] counties inserted in square brackets where not given in the text (more than half of all place names have no county our Bartholomew's Gazetteer is well-thumbed, and our Swiss assistant's knowledge of British Empire Geography has greatly enlarged!) <cond>this provision [p.18] again, manual linkage back to earlier entry (key 60) Show Fig.1 Point out: (Only about 30% of entries have an age mentioned.) Relatively few entering under the age of 17. Older entrants tended to be men already in holy orders. Show Fig.2 Point out: Trinity and St John's were by far the largest colleges. (Trinity is still the largest.) General increase in the size of the colleges reflects the increase in size of the whole university. omit Fig.3 Show Fig.4 Point out: huge increases in all of these smallest colleges from 1880. Huge influx into Queens' in 1820s. Almost no admissions to King's (1750s); Queens' (1870s); Sidney Sussex (1890s). We don't yet know why [remember that these results are based on surnames A-C only]. Show Fig.5 Most Cambridge colleges owned the "livings" of several parishes, and were therefore entitled to appoint their own nominees to religious appointments there. We thought it would be interesting to trace the counties in which Cambridge influence was most significant. Warning: these results should really be normalized by the total number of parishes per county. Show Fig.6 Now we turn to the later Tripos results. These are interesting because they show the beginning of the rise of proper teaching and examinations in science. (The earliest Triposes were in Mathematics and Classics only.) Magdalene's concentration on Classics; Trinity Hall's dominance in Law; Queens' in Mathematics. Show Fig.7 The Mathematics Tripos figures for the first hundred years are interesting, mainly because of the incredibly low take-up of the Tripos by some of the colleges. (Until 1751 King's had the privilege of exemption from university examinations this in fact prevented its scholars from competing in the Tripos; and Trinity Hall the lawyers' college). Show Fig.8 (vertically) The total number of Tripos exams taken in the last 60 years of our figures shows a distribution different from that of the sizes of the colleges. King's (not then a large college, and very late starting to take the Tripos) figures relatively highly.
ConclusionWe have made a start on processing the lives of some 150,000 people, spanning almost seven centuries. I hope that I have been able to demonstrate some small part of the information which will become available. Then it's up to you to pursue your own particular interest, using the tools which we have prepared. But don't hold your breath! There is a vast amount to do before we finish this project and publish the database. Linkage of names, identification of places, and coercing awkward narratives into a structured form will require a lot of human intervention, as well as a lot of programming skill. The result will be an incredibly valuable tool for historical research, which will serve many future generations of scholars. Click here to enter the ACAD system. Contact: LLCC@ucs.cam.ac.uk
|