A Cambridge Alumni Database by J. L. Dawson
(presented at the conference ALLC/ACH 2000, University of Glasgow)
Venn: Alumni Cantabrigienses, a biographical list of all known students, graduates and holders of office at the University of Cambridge, from the earliest times to 1751, 4 vols (192227).
Venn: Alumni Cantabrigienses, . . . 17521900, 6 vols (194054).
Emden: A Biographical Register of the University of Cambridge to 1500 (1963).
This project began in 1987 as a means of publishing an updated version of these volumes of biographies of Cambridge alumni. Some 20,000 cards of addenda and corrigenda had been accumulated over the years. Archival research had unearthed much more detail, and many more names, for the period up to 1500, and in 1963 Emden published his two-volume Biographical Register. Together, these twelve volumes cover approximately 180,000 names, with some overlap.
It goes without saying that all this information is of the utmost importance for historical research, covering as it does a large proportion of the religious, legal, administrative, medical, and royal appointments in Britain, the Empire, and the Colonies, as well as many other countries. A good deal of social history is also included, albeit patchily. However, all these publications have a great defect for research: there is no index.
In the 1970s, a group at Oxford did some analysis work on Emden's Biographical Registers of both Oxford and Cambridge. However, they were not intending to reproduce the entire text in the form of a database. They remarked: "Even with modern systems of interrogation it is not certain that all available information will be included in the computerized database, and most projects will call for some degree of abbreviation and formatting of the data."
They also warned: "The correct interpretation of the entries in the register called for a knowledge not only of the internal workings of the university and of academic terminology ... but also of a wide range of other fields, notably royal and ecclesiastical administration."
|Venn II, vols 14 (17521900): AO||45629|
|Tripos Lists (men 17481910, women 18731910)||20757||1657|
|Additional material only recently available (approx. figures)|
|Girton & Newnham Admissions Registers [women's colleges] (18691900)||2000|
|Venn I, vols 14 (12611751)||76000|
|Venn II, vols 56 (17521900): PZ||14371|
We set about the creation of an on-line database to make all this information accessible. Other sources, such as the Tripos Lists (lists of degrees awarded), and College Registers (especially those of the women's colleges, which were ignored by Venn) have been included. Funding for the project has been supplied by the American Friends of Cambridge University, and by several colleges.
"Tripos" is a word unique to Cambridge University. Originally it was the title of the Bachelor of Arts appointed to dispute with the candidates for degrees, and he was so called from the three-legged stool on which he sat. Later, the term became applied to the university's final honours examinations.
For many years we were unable to find a simple and reliable way to put the data into machine-readable form. Venn's books are in small hand-set type, printed on thick rough paper, and are full of italics, all of which proved completely intractable to the OCR packages available until recently.
By chance, just as we had found suitable technology to cope with Venn's printing, we discovered that Ancestry.com had already prepared machine-readable versions of most of the volumes of Part II, for genealogical research. They have recently put the remaining volumes of Venn into the computer.
Emden's Biographical Register, the Tripos Lists, and the registers of the women's colleges have proved relatively easy to read using OCR and the services of a good proofreader.
|Number of Names in Sources Analysed So Far
(men with surnames A-O only)
|Venn II, vols 14 (17521900): AO||[V] 45629|
|Tripos Lists (17481900): AO||[T] 7594|
|total [V] + [T]||59557|
|Names in [T] only (of which 13 ambiguous)||1280|
|Names in [V] only (of which 1267 ambiguous)||31289|
|Names once in [T] and once in [V]||21706|
|Ambiguous names in both [T] and [V]||5249|
This shows the total numbers of names in the sources analysed so far. At first sight, the high number of men taking the Tripos who are not immediately matched by entries in Venn ([T] only) is alarming (all of the Tripos entries should appear in Venn, as the Tripos List was one of the sources the Venns used).
However, making a cursory manual inspection of the letter A mismatches reduces the number there from 189 to 21. The main problem is one of dates: Venn occasionally includes a man who was admitted after 1900, so the name mismatches around the 19031905 period are very suspect.
Many of the other mismatches are accounted for by the following types of problem.
|Variant spellings of surnames or first names|
|Abbott, James Raymond de Montmorency
Abbott, James Reymond De Montmorency
Abbot or Abbott, William
|Airey, John Alfred Lumb
Airey, John Alfred [Lumb]
|Archer, Charles Goodwin (Goodwyn)
Archer, Charles Goodwyn
|Atkinson, Edward Dupré
Atkinson, Edward Dupre
|Sometimes a name has been misread or misquoted|
|Ames, Edward Cecil
Ames, Edwin Cecil
|Arnold, William Laughton
Arnold, William Langton
|Styles and titles of the nobility are a problem|
(but what was his name?)
|Allsopp, Samuel Charles (Hindlip, Lord)
Allsopp, Samuel Charles
|Arabic and Indian names seem to cause trouble|
|Aftab Ahmad Khan
Ahmád Khán, Sahibzádá Aftab
Ahmed or Ahmad, Nizam-Uddin
|Sometimes the surname is just changed|
|Ackers [formerly Coops], James
|Jobson [post Archbold], William Arthur
Archbold, William Arthur Jobson
|Sometimes the old and new are combined|
|Andrewes [post Uthwatt], Henry
|Sometimes a double-barreled surname is hyphenated|
|Austen Leigh, Richard Arthur
Austen-Leigh, Richard Arthur
|Sometimes parts are hyphenated|
|Armytage, Joseph North Green
Green-Armytage, Joseph North Green
|And a warning for anyone searching for Hodgson or Atkinson ancestors!|
|Hodgson (post Archer-Hind), Richard Dacre
Archer-Hind, Richard Dacre
|Atkinson, Johnson (Busfield, J.)
Busfield or Busfeild, Johnson Atkinson
A typical entry from Emden looks like this (with references abbreviated):
|Dawson, John (Dauson).*|
|Entered in C.L. ET 1484;|
|grace that study for 6 yr in C. and Cn.L. suffice for entry in Cn.L. gr. 1488-9;|
|Inc. C.L., adm. June 1490 [Ref1];|
|R. of Debden, Essex, clk, adm. 17 May 1484; till death [Ref2].|
|Will dated 10 Aug. 1492; proved 12 Feb. 1493 [Ref3].|
|Requested burial in S. Michaels, Cambridge.|
|* = this man also appears in Venn, part I|
|C.L. = Civil Law|
|ET = Easter Term (i.e. summer term)|
| grace = dispensation
Cn.L. = Canon Law
C. and Cn.L.=Civil and Canon Law
gr. = granted
|Inc. = Incepted (took degree)|
|D.C.L. = Doctor of Civil Law|
| R. = Rector
clk = clerk (i.e. in holy orders)
and has the following structure:
|heading||Dawson, John . . .|
|event 1||Entered in C.L. . . .|
|event 2||grace that study . . .|
|. . .|
|where each event in general comprises:|
|place (sometimes)||Debden, Essex|
|date(s)||adm. 17 May 1484|
My first attempts at analysis were written in Perl, a widely available string-handling language which allows complex regular expressions. (A regular expression is just a pattern like a piece of algebra which is used to match parts of the data and extract those parts which can vary.)
The complexity of the regular expressions needed for the recognition of large-scale structures such as these entries uses too much memory in Perl, and the programs frequently failed.
At Cambridge we have a locally-written programmable text editor called NE which has good regular expression handling. It may seem a retrograde step to use a one-off local program like NE in preference to a widely used standard such as Perl, but in our case only the product (the tagged text) is useful; the process used to make the product is different for each text analysed, so the ephemeral nature of the programs is not significant.
It was clear that some type of formal, structured, but readable output would be needed in the first instance. This could then be converted automatically as input to any required database package. SGML provides an adequate structure for these needs, and is widely used by publishers of machine-readable databases.
Glancing at the pages of Venn's biographies gives a first impression that they are very regular, with keywords such as "Matric." and "School" clearly signalling well-structured phrases. However, this is only what the human eye and brain make of the material! When an attempt is made to parse these sentences automatically, all sorts of horrors arise. In order to find out exactly what constructions are used in the text, I made concordances. Part of the concordance for the word "grace" is shown below and illustrates that, although the basic structure is similar, there are a lot of very different cases to consider when parsing.
|Concordance of phrases "grace . . . gr." from Emden|
|(i.e. all the following lines are preceded by the word "grace" and followed by "gr." meaning "granted")|
|about his proceeding B.Th.|
|allowing him 3 terms towards responding to the question|
|at Cambridge to enter in C.L.|
|concerning entering the Sentences|
|concerning entry in Cn.L.|
|concerning his form for entering the Sentences|
|concerning his commencement in Th.|
|concerning proceeding B.Th.|
|concerning qualification for inception in Arts|
|concerning qualifications for proceeding D.Th.|
|concerning lectura posteriorum|
|exempting him from lecturing on the Sentences after entry|
|exempting him from taking part in processions and other public ceremonies, as at Oxford|
|for admission in Th.|
|for admission to incept in Th.|
|for entering the Sentences|
|for entry in Cn.L.|
|for entry in Mus.|
|for exemption on proceeding D.Th.|
|for exemptions in proceeding D.Th.|
|for incepting D.Cn.L.|
|for inception as D.Th.|
|for inception in Th.|
|for incorpn as M.D.|
|in respect of inception in Arts|
First attempts at analysis were very heuristic, but served to clarify the problems in my mind. Writing a DTD (Document Type Definition) for the SGML structure was then very helpful, as it forced me to take decisions about nesting of fields, etc.
Initially, my regular expressions tried to match complete events, including place names and dates, but the programs ran out of time or store, and the regular expression processor found the structure too complex to analyse.
|Explanation of the DTD|
|document||comprises||one or more||entry|
|alias||comprises||one or more||surname(s)|
|cross-reference||comprises||either one or more||surname(s)|
|event||comprises||one of the types|
|in general, events contain some or all of|
|and up to three dates may appear with each event (e.g. appointed . . .; still in . . .; resigned . . .)|
References were extracted and replaced by numbered keys, as the structure of references is not to be analysed.
Dates are central to the analysis, as they are almost unambiguously recognizable, and occur in almost every event. The day / month-word / year or month-word / day / year structure is converted into the form year : month-number : day so that it is readily searchable and sortable later.
|day month-word year||21 Oct. 1763||(Emden)|
|month-word day, year||Oct. 21, 1763||(Venn)|
|year : month-number : day||1763:10:21|
| Various phrases are used to modify dates. These include:
bef. by c. still in
which are incorporated into the date structure as modifiers.
So: before June 29th 1475 becomes
Dates may also be associated with actions, such as:
adm. vac. exch. which are also absorbed into the date.
So: adm. c. Feb. 16th 15845 becomes
Events tend to have up to three dates:
adm. May 14th 1631, still in Sep. 1634, vac. before 1637.
(here, the closing tags have been omitted for clarity).
No attempt is made to modernise dates, so 15845 may mean either the range of years 1584 to 1585, or part of the old-style year from 1 January to 24 March 1585. Venn himself says of dates: "In taking a date from an ordinary history of the popular kind, we often do not know what the author means. Has he simply copied some contemporary record parish register, tombstone, etc. or has he tacitly substituted the modern reckoning? Wherever we can determine which he has done we have substituted the double date in order to avoid confusion. Sometimes, however, this is not possible, and then we have to leave the exact date ambiguous."
We also find what might be termed pseudo-dates, such as "till death", "forthwith", "MT" (standing for Michaelmas Term). These must be converted into a suitable numerical form (which in some cases involves looking ahead for a subsequent date).
Having tagged all dates, they then act as natural right-hand delimiters of many events, which greatly simplifies subsequent processing.
| Banks, Henry. ... Brother of Ralph (1716).
Perhaps the same as the next.
V. of Walton, 1384; included in Univ. roll for pap. graces as petitioner for a cany of S. Pauls, London, notwithstanding Walton.
It was always obvious that cross-references between entries would be needed, to allow for constructions such as: "Banks, Henry. ... Brother of Ralph (1716)." or "Perhaps the same as the next." These will be coded by creating a unique identifier for each person, based on their name and perhaps the first recorded date, with an additional number to disambiguate similar names.
It later became clear that cross-references within entries are also needed, to cope with things like: "V. of Walton ... notwithstanding Walton".
In a construction such as:
V. of Reed, Herts., from 1729
we have a good clue that "Reed" is a place name because of its being immediately followed by the abbreviated county name "Herts." Would that all cases were so simple! Both Venn and Emden omit counties in many of the following situations: the place is in Cambridgeshire or London; the county is the same as in the immediately preceding place mentioned; the place is the seat of a bishop or archbishop; the place is "well-known" (that is, well-known to scholars of early church history); or they simply didn't know!
The other problem is the sheer idiosyncratic peculiarity of (mainly English) place names. Just as a sample, we have:
- Bradwell juxta Mare (Essex)
- Tavy S. Peter (Devon)
- S. Martins le Grand (London)
- Great S. Marys (Cambridge)
- Stow cum Quy (Cambs)
- S. Andrew by the Wardrobe (London)
- Havering-atte-Bower (Essex)
- S. Mary Somerset (London)
- Cley-next-the-Sea (Norfolk)
and to add to the confusion between religious titles and place names, we have:
- Swaffham Prior (Cambs)
- Bishop Auckland (Durham)
- Vernhams Dean (Hants)
- Dean Prior (Devon)
Thankfully, Emden has carefully distinguished between Durham the city and diocese, which is never abbreviated, and Durham the county, which is always abbreviated to "Dur.". Venn, however, uses the full word Durham for the city, the county, the diocese, and the school!
Looking at the list of place names ending in "Park", we see that Sir Denis Park might not be happy to be treated as a place name, and those of you unfamiliar with the history of African exploration may like to speculate on the place "Mungo Park".
|Names ending "Park"|
| Finchcox Park
. . .
. . .
. . .
| Princes Park
Sir Denis Park
St Helen's Park
| Stoll Park
Upper Hyde Park
. . .
. . .
In an attempt to pre-tag place names in a similar way to dates, I adopt this algorithm:
Place Name Algorithm
A place name is one of
(a) a single capitalized word.
(i) an initial capitalized word
followed by any number of
(ii) capitalized word
(iii) in, and, the, de, le, cum, at, on,
super, sub, with, by, juxta,
(iv) a single capitalized word.
where a "word" may contain the characters hyphen and apostrophe.
SHOW pp.17 & 18 Point out: <ER> Religious Event; <keye> event key; <IR> Internal event, religious; <cond> cross-reference to earlier event key; <EG> Event, topic "grace" or dispensation; Walton not written out in full (needs manual linkage);
[LIN] counties inserted in square brackets where not given in the text (more than half of all place names have no county our Bartholomew's Gazetteer is well-thumbed, and our Swiss assistant's knowledge of British Empire Geography has greatly enlarged!) <cond>this provision [p.18] again, manual linkage back to earlier entry (key 60)
Show Fig.1 Point out: (Only about 30% of entries have an age mentioned.) Relatively few entering under the age of 17. Older entrants tended to be men already in holy orders.
Show Fig.2 Point out: Trinity and St John's were by far the largest colleges. (Trinity is still the largest.) General increase in the size of the colleges reflects the increase in size of the whole university.
Show Fig.4 Point out: huge increases in all of these smallest colleges from 1880. Huge influx into Queens' in 1820s. Almost no admissions to King's (1750s); Queens' (1870s); Sidney Sussex (1890s). We don't yet know why [remember that these results are based on surnames A-C only].
Show Fig.5 Most Cambridge colleges owned the "livings" of several parishes, and were therefore entitled to appoint their own nominees to religious appointments there. We thought it would be interesting to trace the counties in which Cambridge influence was most significant. Warning: these results should really be normalized by the total number of parishes per county.
Show Fig.6 Now we turn to the later Tripos results. These are interesting because they show the beginning of the rise of proper teaching and examinations in science. (The earliest Triposes were in Mathematics and Classics only.) Magdalene's concentration on Classics; Trinity Hall's dominance in Law; Queens' in Mathematics.
Show Fig.7 The Mathematics Tripos figures for the first hundred years are interesting, mainly because of the incredibly low take-up of the Tripos by some of the colleges. (Until 1751 King's had the privilege of exemption from university examinations this in fact prevented its scholars from competing in the Tripos; and Trinity Hall the lawyers' college).
Show Fig.8 (vertically) The total number of Tripos exams taken in the last 60 years of our figures shows a distribution different from that of the sizes of the colleges. King's (not then a large college, and very late starting to take the Tripos) figures relatively highly.
We have made a start on processing the lives of some 150,000 people, spanning almost seven centuries. I hope that I have been able to demonstrate some small part of the information which will become available. Then it's up to you to pursue your own particular interest, using the tools which we have prepared.
But don't hold your breath! There is a vast amount to do before we finish this project and publish the database. Linkage of names, identification of places, and coercing awkward narratives into a structured form will require a lot of human intervention, as well as a lot of programming skill. The result will be an incredibly valuable tool for historical research, which will serve many future generations of scholars.
My first attempt at processing part of the letter 'G' of Venn part II is available here but should not be taken as in any way final.