ACAD - A Cambridge Alumni Database

by J. L. Dawson

(presented at the conference ALLC/ACH 2000, University of Glasgow)


Venn: Alumni Cantabrigienses, a biographical list of all known students, graduates and holders of office at the University of Cambridge, from the earliest times to 1751, 4 vols (1922–27).

Venn: Alumni Cantabrigienses, . . . 1752–1900, 6 vols (1940–54).

Emden: A Biographical Register of the University of Cambridge to 1500 (1963).

This project began in 1987 as a means of publishing an updated version of these volumes of biographies of Cambridge alumni. Some 20,000 cards of addenda and corrigenda had been accumulated over the years. Archival research had unearthed much more detail, and many more names, for the period up to 1500, and in 1963 Emden published his two-volume Biographical Register. Together, these twelve volumes cover approximately 180,000 names, with some overlap.

It goes without saying that all this information is of the utmost importance for historical research, covering as it does a large proportion of the religious, legal, administrative, medical, and royal appointments in Britain, the Empire, and the Colonies, as well as many other countries. A good deal of social history is also included, albeit patchily. However, all these publications have a great defect for research: there is no index.

In the 1970s, a group at Oxford did some analysis work on Emden's Biographical Registers of both Oxford and Cambridge[1]. However, they were not intending to reproduce the entire text in the form of a database. They remarked: "Even with modern systems of interrogation it is not certain that all available information will be included in the computerized database, and most projects will call for some degree of abbreviation and formatting of the data."

They also warned: "The correct interpretation of the entries in the register called for a knowledge not only of the internal workings of the university and of academic terminology ... but also of a wide range of other fields, notably royal and ecclesiastical administration."

  Men Women
Emden (1198–1500)7594 
Venn II, vols 1–4 (1752–1900): A–O45629 
Tripos Lists (men 1748–1910, women 1873–1910)20757 1657
Additional material only recently available (approx. figures)
Girton & Newnham Admissions Registers [women's colleges] (1869–1900)  2000
Venn I, vols 1–4 (1261–1751)76000 
Venn II, vols 5–6 (1752–1900): P–Z14371 
totals90371 2000

We set about the creation of an on-line database to make all this information accessible. Other sources, such as the Tripos Lists (lists of degrees awarded), and College Registers (especially those of the women's colleges, which were ignored by Venn) have been included. Funding for the project has been supplied by the American Friends of Cambridge University, and by several colleges.

"Tripos" is a word unique to Cambridge University. Originally it was the title of the Bachelor of Arts appointed to dispute with the candidates for degrees, and he was so called from the three-legged stool on which he sat. Later, the term became applied to the university's final honours examinations.

For many years we were unable to find a simple and reliable way to put the data into machine-readable form. Venn's books are in small hand-set type, printed on thick rough paper, and are full of italics, all of which proved completely intractable to the OCR packages available until recently.

By chance, just as we had found suitable technology to cope with Venn's printing, we discovered that had already prepared machine-readable versions of most of the volumes of Part II, for genealogical research. They have recently put the remaining volumes of Venn into the computer.

Emden's Biographical Register, the Tripos Lists, and the registers of the women's colleges have proved relatively easy to read using OCR and the services of a good proofreader.

Number of Names in Sources Analysed So Far
(men with surnames A-O only)
Venn II, vols 1–4 (1752–1900): A–O [V]   45629
Tripos Lists (1748–1900): A–O [T]    7594
total [V] + [T] 59557
Names in [T] only (of which 13 ambiguous) 1280
Names in [V] only (of which 1267 ambiguous) 31289
Names once in [T] and once in [V] 21706
Ambiguous names in both [T] and [V] 5249
Wrongly entered 33
total 59557

This shows the total numbers of names in the sources analysed so far. At first sight, the high number of men taking the Tripos who are not immediately matched by entries in Venn ([T] only) is alarming (all of the Tripos entries should appear in Venn, as the Tripos List was one of the sources the Venns used).

However, making a cursory manual inspection of the letter A mismatches reduces the number there from 189 to 21. The main problem is one of dates: Venn occasionally includes a man who was admitted after 1900, so the name mismatches around the 1903–1905 period are very suspect.

Many of the other mismatches are accounted for by the following types of problem.

Variant spellings of surnames or first names
Abbott, James Raymond de Montmorency
Abbott, James Reymond De Montmorency
Abbot, William
Abbot or Abbott, William
Adams, Francis
Adams, Frank
Airey, John Alfred Lumb
Airey, John Alfred [Lumb]
Archer, Charles Goodwin (Goodwyn)
Archer, Charles Goodwyn
Atkinson, Edward Dupré
Atkinson, Edward Dupre
Sometimes a name has been misread or misquoted
Ames, Edward Cecil
Ames, Edwin Cecil
Arnold, William Laughton
Arnold, William Langton
Styles and titles of the nobility are a problem
Alford, Viscount
(but what was his name?)
Allsopp, Samuel Charles (Hindlip, Lord)
Allsopp, Samuel Charles
Arabic and Indian names seem to cause trouble
Aftab Ahmad Khan
Ahmád Khán, Sahibzádá Aftab
Ahmed, Nizam-uddin
Ahmed or Ahmad, Nizam-Uddin
Sometimes the surname is just changed
Ackers [formerly Coops], James
Ackers, James
Jobson [post Archbold], William Arthur
Archbold, William Arthur Jobson
Sometimes the old and new are combined
Andrewes [post Uthwatt], Henry
Andrewes-Uthwatt, Henry
Sometimes a double-barreled surname is hyphenated
Austen Leigh, Richard Arthur
Austen-Leigh, Richard Arthur
Sometimes parts are hyphenated
Armytage, Joseph North Green
Green-Armytage, Joseph North Green
And a warning for anyone searching for Hodgson or Atkinson ancestors!
Hodgson (post Archer-Hind), Richard Dacre
Archer-Hind, Richard Dacre
Atkinson, Johnson (Busfield, J.)
Busfield or Busfeild, Johnson Atkinson

A typical entry from Emden looks like this (with references abbreviated):

Dawson, John (Dauson).*
Entered in C.L. ET 1484;
grace that study for 6 yr in C. and Cn.L. suffice for entry in Cn.L. gr. 1488-9;
Inc. C.L., adm. June 1490 [Ref1];
R. of Debden, Essex, clk, adm. 17 May 1484; till death [Ref2].
Died 1492.
Will dated 10 Aug. 1492; proved 12 Feb. 1493 [Ref3].
Requested burial in S. Michaels, Cambridge.
    *  =  this man also appears in Venn, part I
    C.L.  =  Civil Law
    ET  =  Easter Term (i.e. summer term)
    grace  =  dispensation
    Cn.L.  =  Canon Law
    C. and Cn.L.=Civil and Canon Law
    gr.  =  granted
    Inc.  =  Incepted (took degree)
D.C.L.  =  Doctor of Civil Law
    R.  =  Rector
    clk  =  clerk (i.e. in holy orders)

and has the following structure:

headingDawson, John . . .
event 1Entered in C.L. . . .
event 2grace that study . . .
. . .
where each event in general comprises:
topice.g. academic
typee.g. entered
place (sometimes)Debden, Essex
date(s)adm. 17 May 1484
reference(s)e.g. [Ref2]

My first attempts at analysis were written in Perl[2], a widely available string-handling language which allows complex regular expressions. (A regular expression is just a pattern – like a piece of algebra – which is used to match parts of the data and extract those parts which can vary.)

The complexity of the regular expressions needed for the recognition of large-scale structures such as these entries uses too much memory in Perl, and the programs frequently failed.

At Cambridge we have a locally-written programmable text editor called NE[3] which has good regular expression handling. It may seem a retrograde step to use a one-off local program like NE in preference to a widely used standard such as Perl, but in our case only the product (the tagged text) is useful; the process used to make the product is different for each text analysed, so the ephemeral nature of the programs is not significant.

It was clear that some type of formal, structured, but readable output would be needed in the first instance. This could then be converted automatically as input to any required database package. SGML provides an adequate structure for these needs, and is widely used by publishers of machine-readable databases.

Glancing at the pages of Venn's biographies gives a first impression that they are very regular, with keywords such as "Matric." and "School" clearly signalling well-structured phrases. However, this is only what the human eye and brain make of the material! When an attempt is made to parse these sentences automatically, all sorts of horrors arise. In order to find out exactly what constructions are used in the text, I made concordances. Part of the concordance for the word "grace" is shown below and illustrates that, although the basic structure is similar, there are a lot of very different cases to consider when parsing.

Concordance of phrases "grace . . . gr." from Emden
(i.e. all the following lines are preceded by the word "grace" and followed by "gr." meaning "granted")
about his proceeding B.Th.
ad eundem
allowing him 3 terms towards responding to the question
at Cambridge to enter in C.L.
concerning entering the Sentences
concerning entry in Cn.L.
concerning his form for entering the Sentences
concerning his commencement in Th.
concerning inception
concerning proceeding B.Th.
concerning qualification for inception in Arts
concerning qualifications for proceeding D.Th.
concerning lectura posteriorum
exempting him from lecturing on the Sentences after entry
exempting him from taking part in processions and other public ceremonies, as at Oxford
for admission in Th.
for admission to incept in Th.
for entering the Sentences
for entry in Cn.L.
for entry in Mus.
for exemption on proceeding D.Th.
for exemptions in proceeding D.Th.
for incepting D.Cn.L.
for inception as D.Th.
for inception
for inception in Th.
for incorpn as M.D.
for incorpn
in respect of inception in Arts

First attempts at analysis were very heuristic, but served to clarify the problems in my mind. Writing a DTD (Document Type Definition) for the SGML structure was then very helpful, as it forced me to take decisions about nesting of fields, etc.

Initially, my regular expressions tried to match complete events, including place names and dates, but the programs ran out of time or store, and the regular expression processor found the structure too complex to analyse.

Explanation of the DTD
document comprises one or more entry
entry comprises name
   uniquepersonal key
namecomprises  surname
   possibleVenn marker
aliascomprises one or moresurname(s)
cross-referencecomprises either one or more surname(s)
   orpersonal key
eventcomprises one of the types 
 academic donation religious
 family grant warning
 job legal land
   unknown additional
 in general, events contain some or all of
    event key
   either  college
   or  institution
   or  school
   or  religious inst.
   or  place
   and usuallydate(s)
religious inst.comprises possiblechurch
placecomprises possiblecity
datecomprises possibleaction
and up to three dates may appear with each event (e.g. appointed . . .; still in . . .; resigned . . .)

References were extracted and replaced by numbered keys, as the structure of references is not to be analysed.

Dates are central to the analysis, as they are almost unambiguously recognizable, and occur in almost every event. The day / month-word / year or month-word / day / year structure is converted into the form year : month-number : day so that it is readily searchable and sortable later.

day  month-word  year 21 Oct. 1763 (Emden)
month-word  day,  year Oct. 21, 1763 (Venn)
year : month-number : day 1763:10:21 
Various phrases are used to modify dates. These include:
  bef.  by  c.  still in
which are incorporated into the date structure as modifiers.
So:  before June 29th 1475      becomes
Dates may also be associated with actions, such as:
adm.  vac.  exch.   which are also absorbed into the date.
So:  adm. c. Feb. 16th 1584–5    becomes
Events tend to have up to three dates:
adm. May 14th 1631, still in Sep. 1634, vac. before 1637.
  <dat2><act2>still in<val2>1634:09</dat2>
(here, the closing tags have been omitted for clarity).

No attempt is made to modernise dates, so 1584–5 may mean either the range of years 1584 to 1585, or part of the old-style year from 1 January to 24 March 1585. Venn himself says of dates: "In taking a date from an ordinary history of the popular kind, we often do not know what the author means. Has he simply copied some contemporary record – parish register, tombstone, etc. – or has he tacitly substituted the modern reckoning? Wherever we can determine which he has done we have substituted the double date in order to avoid confusion. Sometimes, however, this is not possible, and then we have to leave the exact date ambiguous."

We also find what might be termed pseudo-dates, such as "till death", "forthwith", "MT" (standing for Michaelmas Term). These must be converted into a suitable numerical form (which in some cases involves looking ahead for a subsequent date).

Having tagged all dates, they then act as natural right-hand delimiters of many events, which greatly simplifies subsequent processing.

Banks, Henry. ... Brother of Ralph (1716).
Perhaps the same as the next.
V. of Walton, 1384; included in Univ. roll for pap. graces as petitioner for a cany of S. Pauls, London, notwithstanding Walton.

It was always obvious that cross-references between entries would be needed, to allow for constructions such as: "Banks, Henry. ... Brother of Ralph (1716)." or "Perhaps the same as the next." These will be coded by creating a unique identifier for each person, based on their name and perhaps the first recorded date, with an additional number to disambiguate similar names.

It later became clear that cross-references within entries are also needed, to cope with things like: "V. of Walton ... notwithstanding Walton".

In a construction such as:
  V. of Reed, Herts., from 1729
we have a good clue that "Reed" is a place name because of its being immediately followed by the abbreviated county name "Herts." Would that all cases were so simple! Both Venn and Emden omit counties in many of the following situations: the place is in Cambridgeshire or London; the county is the same as in the immediately preceding place mentioned; the place is the seat of a bishop or archbishop; the place is "well-known" (that is, well-known to scholars of early church history); or they simply didn't know!

The other problem is the sheer idiosyncratic peculiarity of (mainly English) place names. Just as a sample, we have:

  • Bradwell juxta Mare (Essex)
  • Tavy S. Peter (Devon)
  • S. Martins le Grand (London)
  • Great S. Marys (Cambridge)
  • Stow cum Quy (Cambs)
  • S. Andrew by the Wardrobe (London)
  • Havering-atte-Bower (Essex)
  • S. Mary Somerset (London)
  • Cley-next-the-Sea (Norfolk)

and to add to the confusion between religious titles and place names, we have:

  • Swaffham Prior (Cambs)
  • Bishop Auckland (Durham)
  • Vernhams Dean (Hants)

and even

  • Dean Prior (Devon)

Thankfully, Emden has carefully distinguished between Durham the city and diocese, which is never abbreviated, and Durham the county, which is always abbreviated to "Dur.". Venn, however, uses the full word Durham for the city, the county, the diocese, and the school!

Looking at the list of place names ending in "Park", we see that Sir Denis Park might not be happy to be treated as a place name, and those of you unfamiliar with the history of African exploration may like to speculate on the place "Mungo Park".

Names ending "Park"
Finchcox Park
Finningley Park
Finsbury Park
Fitzroy Park
Foliejon Park
Fontmell Park
. . .
Hill Park
Holland Park
Holme Park
Holton Park
Howbury Park
Hungershall Park
Hyde Park
. . .
Mapperley Park
Middleton Park
Mildmay Park
Mileham Park
Monkstown Park
Moor Park
Moorend Park
Mount Park
Mungo Park
Mymms Park
Nevill Park
Neville Park
. . .
Princes Park
Prinknash Park
Queen's Park
Ravensbury Park
Ravenscourt Park
Ravenscroft Park
Raynes Park
Regent's Park
Richmond Park
Rotton Park
Roupell Park
Rushton Park
Salperton Park
Sefton Park
Sheen Park
Sibton Park
Sir Denis Park
Skelbrook Park
Sneyd Park
Somerby Park
South Park
Southall Park
Springfield Park
St Helen's Park
Stanmore Park
Stansted Park
Stapleton Park
Stock Park
Stoll Park
Stone Park
Stonebridge Park
Storrs Park
Stratton Park
Streatham Park
Sunbridge Park
Sutton Park
Swinton Park
Tehidy Park
Tew Park
Theobalds Park
Tollington Park
Totteridge Park
Toxteth Park
Trent Park
Tufnell Park
Tugdee Park
Upper Hyde Park
Upper Park
Upton Park
Vanbrugh Park
Victoria Park
Walters Park
. . .
Wexham Park
. . .
Wratting Park

In an attempt to pre-tag place names in a similar way to dates, I adopt this algorithm:

Place Name Algorithm

A place name is one of
     (a) a single capitalized word.
        (i) an initial capitalized word
                followed by any number of
        (ii) capitalized word
        (iii) in, and, the, de, le, cum, at, on,
            super, sub, with, by, juxta,
            portion, moiety
                terminated by
        (iv) a single capitalized word.
where a "word" may contain the characters hyphen and apostrophe.

Only about 30% of entries have an age mentioned. Relatively few entering under the age of 17. Older entrants tended to be men already in holy orders.

We have made a start on processing the lives of some 150,000 people, spanning almost seven centuries. I hope that I have been able to demonstrate some small part of the information which will become available. Then it's up to you to pursue your own particular interest, using the tools which we have prepared.

But don't hold your breath! There is a vast amount to do before we finish this project and publish the database. Linkage of names, identification of places, and coercing awkward narratives into a structured form will require a lot of human intervention, as well as a lot of programming skill. The result will be an incredibly valuable tool for historical research, which will serve many future generations of scholars.

