ACAD - A Cambridge Alumni Database

by J. L. Dawson

(presented at the conference ALLC/ACH 2000, University of Glasgow)

Introduction

Venn: Alumni Cantabrigienses, a biographical list of all known students, graduates and holders of office at the University of Cambridge, from the earliest times to 1751, 4 vols (1922–27).

Venn: Alumni Cantabrigienses, . . . 1752–1900, 6 vols (1940–54).

Emden: A Biographical Register of the University of Cambridge to 1500 (1963).

This project began in 1987 as a means of publishing an updated version of these volumes of biographies of Cambridge alumni. Some 20,000 cards of addenda and corrigenda had been accumulated over the years. Archival research had unearthed much more detail, and many more names, for the period up to 1500, and in 1963 Emden published his two-volume Biographical Register. Together, these twelve volumes cover approximately 180,000 names, with some overlap.

It goes without saying that all this information is of the utmost importance for historical research, covering as it does a large proportion of the religious, legal, administrative, medical, and royal appointments in Britain, the Empire, and the Colonies, as well as many other countries. A good deal of social history is also included, albeit patchily. However, all these publications have a great defect for research: there is no index.

In the 1970s, a group at Oxford did some analysis work on Emden's Biographical Registers of both Oxford and Cambridge[1]. However, they were not intending to reproduce the entire text in the form of a database. They remarked: "Even with modern systems of interrogation it is not certain that all available information will be included in the computerized database, and most projects will call for some degree of abbreviation and formatting of the data."

They also warned: "The correct interpretation of the entries in the register called for a knowledge not only of the internal workings of the university and of academic terminology ... but also of a wide range of other fields, notably royal and ecclesiastical administration."

Sources
	Men	Women
Emden (1198–1500)	7594
Venn II, vols 1–4 (1752–1900): A–O	45629
Tripos Lists (men 1748–1910, women 1873–1910)	20757	1657
totals	73980	1657

Additional material only recently available (approx. figures)
	Men	Women
Girton & Newnham Admissions Registers [women's colleges] (1869–1900)		2000
Venn I, vols 1–4 (1261–1751)	76000
Venn II, vols 5–6 (1752–1900): P–Z	14371
totals	90371	2000

We set about the creation of an on-line database to make all this information accessible. Other sources, such as the Tripos Lists (lists of degrees awarded), and College Registers (especially those of the women's colleges, which were ignored by Venn) have been included. Funding for the project has been supplied by the American Friends of Cambridge University, and by several colleges.

"Tripos" is a word unique to Cambridge University. Originally it was the title of the Bachelor of Arts appointed to dispute with the candidates for degrees, and he was so called from the three-legged stool on which he sat. Later, the term became applied to the university's final honours examinations.

For many years we were unable to find a simple and reliable way to put the data into machine-readable form. Venn's books are in small hand-set type, printed on thick rough paper, and are full of italics, all of which proved completely intractable to the OCR packages available until recently.

By chance, just as we had found suitable technology to cope with Venn's printing, we discovered that Ancestry.com had already prepared machine-readable versions of most of the volumes of Part II, for genealogical research. They have recently put the remaining volumes of Venn into the computer.

Emden's Biographical Register, the Tripos Lists, and the registers of the women's colleges have proved relatively easy to read using OCR and the services of a good proofreader.

Number of Names in Sources Analysed So Far (men with surnames A-O only)
Venn II, vols 1–4 (1752–1900): A–O	[V] 45629
Tripos Lists (1748–1900): A–O	[T] 7594
total [V] + [T]	59557

Names in [T] only (of which 13 ambiguous)	1280
Names in [V] only (of which 1267 ambiguous)	31289
Names once in [T] and once in [V]	21706
Ambiguous names in both [T] and [V]	5249
Wrongly entered	33
total	59557

This shows the total numbers of names in the sources analysed so far. At first sight, the high number of men taking the Tripos who are not immediately matched by entries in Venn ([T] only) is alarming (all of the Tripos entries should appear in Venn, as the Tripos List was one of the sources the Venns used).

However, making a cursory manual inspection of the letter A mismatches reduces the number there from 189 to 21. The main problem is one of dates: Venn occasionally includes a man who was admitted after 1900, so the name mismatches around the 1903–1905 period are very suspect.

Many of the other mismatches are accounted for by the following types of problem.

Variant spellings of surnames or first names
Abbott, James Raymond de Montmorency Abbott, James Reymond De Montmorency
Abbot, William Abbot or Abbott, William
Adams, Francis Adams, Frank
Airey, John Alfred Lumb Airey, John Alfred [Lumb]
Archer, Charles Goodwin (Goodwyn) Archer, Charles Goodwyn
Atkinson, Edward Dupré Atkinson, Edward Dupre
Sometimes a name has been misread or misquoted
Ames, Edward Cecil Ames, Edwin Cecil
Arnold, William Laughton Arnold, William Langton
Styles and titles of the nobility are a problem
Alford, Viscount (but what was his name?)
Allsopp, Samuel Charles (Hindlip, Lord) Allsopp, Samuel Charles
Arabic and Indian names seem to cause trouble
Aftab Ahmad Khan Ahmád Khán, Sahibzádá Aftab
Ahmed, Nizam-uddin Ahmed or Ahmad, Nizam-Uddin
Sometimes the surname is just changed
Ackers [formerly Coops], James Ackers, James
Jobson [post Archbold], William Arthur Archbold, William Arthur Jobson
Sometimes the old and new are combined
Andrewes [post Uthwatt], Henry Andrewes-Uthwatt, Henry
Sometimes a double-barreled surname is hyphenated
Austen Leigh, Richard Arthur Austen-Leigh, Richard Arthur
Sometimes parts are hyphenated
Armytage, Joseph North Green Green-Armytage, Joseph North Green
And a warning for anyone searching for Hodgson or Atkinson ancestors!
Hodgson (post Archer-Hind), Richard Dacre Archer-Hind, Richard Dacre
Atkinson, Johnson (Busfield, J.) Busfield or Busfeild, Johnson Atkinson

A typical entry from Emden looks like this (with references abbreviated):

Notes
Dawson, John (Dauson).*
Entered in C.L. ET 1484;
grace that study for 6 yr in C. and Cn.L. suffice for entry in Cn.L. gr. 1488-9;
Inc. C.L., adm. June 1490 [Ref1];
D.C.L.
R. of Debden, Essex, clk, adm. 17 May 1484; till death [Ref2].
Died 1492.
Will dated 10 Aug. 1492; proved 12 Feb. 1493 [Ref3].
Requested burial in S. Michaels, Cambridge.

* = this man also appears in Venn, part I
C.L. = Civil Law
ET = Easter Term (i.e. summer term)
grace = dispensation Cn.L. = Canon Law C. and Cn.L.=Civil and Canon Law gr. = granted
Inc. = Incepted (took degree)
D.C.L. = Doctor of Civil Law
R. = Rector clk = clerk (i.e. in holy orders)

and has the following structure:

heading Dawson, John . . .
event 1 Entered in C.L. . . .
event 2 grace that study . . .
. . .
where each event in general comprises:
topic e.g. academic
type e.g. entered
place (sometimes) Debden, Essex
date(s) adm. 17 May 1484
reference(s) e.g. [Ref2]

My first attempts at analysis were written in Perl[2], a widely available string-handling language which allows complex regular expressions. (A regular expression is just a pattern – like a piece of algebra – which is used to match parts of the data and extract those parts which can vary.)

SHOW p.8b

The complexity of the regular expressions needed for the recognition of large-scale structures such as these entries uses too much memory in Perl, and the programs frequently failed.

At Cambridge we have a locally-written programmable text editor called NE[3] which has good regular expression handling. It may seem a retrograde step to use a one-off local program like NE in preference to a widely used standard such as Perl, but in our case only the product (the tagged text) is useful; the process used to make the product is different for each text analysed, so the ephemeral nature of the programs is not significant.

It was clear that some type of formal, structured, but readable output would be needed in the first instance. This could then be converted automatically as input to any required database package. SGML provides an adequate structure for these needs, and is widely used by publishers of machine-readable databases.

Glancing at the pages of Venn's biographies gives a first impression that they are very regular, with keywords such as "Matric." and "School" clearly signalling well-structured phrases. However, this is only what the human eye and brain make of the material! When an attempt is made to parse these sentences automatically, all sorts of horrors arise. In order to find out exactly what constructions are used in the text, I made concordances. Part of the concordance for the word "grace" is shown below and illustrates that, although the basic structure is similar, there are a lot of very different cases to consider when parsing.

Concordance of phrases "grace . . . gr." from Emden
(i.e. all the following lines are preceded by the word "grace" and followed by "gr." meaning "granted")

about his proceeding B.Th.
ad eundem
allowing him 3 terms towards responding to the question
at Cambridge to enter in C.L.
concerning entering the Sentences
concerning entry in Cn.L.
concerning his form for entering the Sentences
concerning his commencement in Th.
concerning inception
concerning proceeding B.Th.
concerning qualification for inception in Arts
concerning qualifications for proceeding D.Th.
concerning lectura posteriorum
exempting him from lecturing on the Sentences after entry
exempting him from taking part in processions and other public ceremonies, as at Oxford
for admission in Th.
for admission to incept in Th.
for entering the Sentences
for entry in Cn.L.
for entry in Mus.
for exemption on proceeding D.Th.
for exemptions in proceeding D.Th.
for incepting D.Cn.L.
for inception as D.Th.
for inception
for inception in Th.
for incorpn as M.D.
for incorpn
in respect of inception in Arts

First attempts at analysis were very heuristic, but served to clarify the problems in my mind. Writing a DTD (Document Type Definition) for the SGML structure was then very helpful, as it forced me to take decisions about nesting of fields, etc.

Initially, my regular expressions tried to match complete events, including place names and dates, but the programs ran out of time or store, and the regular expression processor found the structure too complex to analyse.

Explanation of the DTD
document	comprises	one or more	entry

entry	comprises		name
		possible	alias
		unique	personal key
		either	cross-reference
		or	event

name	comprises		surname
		possible	forename(s)
		possible	Venn marker

alias	comprises	one or more	surname(s)

cross-reference	comprises	either one or more	surname(s)
		or	personal key

event	comprises	one of the types
	academic	donation	religious
	family	grant	warning
	job	legal	land
		unknown	additional
	in general, events contain some or all of
			qualifier
			age
			condition
			event key
			type
		either	college
		or	institution
		or	school
		or	religious inst.
		or	place
		and usually	date(s)

religious inst.	comprises	possible	church
		possible	place

place	comprises	possible	city
		possible	county
		possible	country

date	comprises	possible	action
		possible	modifier
		possible	value
and up to three dates may appear with each event (e.g. appointed . . .; still in . . .; resigned . . .)

References were extracted and replaced by numbered keys, as the structure of references is not to be analysed.

Dates are central to the analysis, as they are almost unambiguously recognizable, and occur in almost every event. The day / month-word / year or month-word / day / year structure is converted into the form year : month-number : day so that it is readily searchable and sortable later.

Dates
day month-word year	21 Oct. 1763	(Emden)
month-word day, year	Oct. 21, 1763	(Venn)
year : month-number : day	1763:10:21
Various phrases are used to modify dates. These include: bef. by c. still in which are incorporated into the date structure as modifiers. So: before June 29th 1475 becomes <dat><mod>bef</mod><val>1475:06:29</val></dat> Dates may also be associated with actions, such as: adm. vac. exch. which are also absorbed into the date. So: adm. c. Feb. 16th 1584–5 becomes <dat><act>adm</act><mod>c</mod> <val>1584–5:02:16</val></dat> Events tend to have up to three dates: adm. May 14th 1631, still in Sep. 1634, vac. before 1637. <dat><act>adm<val>1631:05:14</dat> <dat2><act2>still in<val2>1634:09</dat2> <dat3><act3>vac<mod3>bef<val3>1637</dat> (here, the closing tags have been omitted for clarity).

No attempt is made to modernise dates, so 1584–5 may mean either the range of years 1584 to 1585, or part of the old-style year from 1 January to 24 March 1585. Venn himself says of dates: "In taking a date from an ordinary history of the popular kind, we often do not know what the author means. Has he simply copied some contemporary record – parish register, tombstone, etc. – or has he tacitly substituted the modern reckoning? Wherever we can determine which he has done we have substituted the double date in order to avoid confusion. Sometimes, however, this is not possible, and then we have to leave the exact date ambiguous."

We also find what might be termed pseudo-dates, such as "till death", "forthwith", "MT" (standing for Michaelmas Term). These must be converted into a suitable numerical form (which in some cases involves looking ahead for a subsequent date).

Having tagged all dates, they then act as natural right-hand delimiters of many events, which greatly simplifies subsequent processing.

Cross-references
Banks, Henry. ... Brother of Ralph (1716). Perhaps the same as the next. V. of Walton, 1384; included in Univ. roll for pap. graces as petitioner for a cany of S. Pauls, London, notwithstanding Walton.

It was always obvious that cross-references between entries would be needed, to allow for constructions such as: "Banks, Henry. ... Brother of Ralph (1716)." or "Perhaps the same as the next." These will be coded by creating a unique identifier for each person, based on their name and perhaps the first recorded date, with an additional number to disambiguate similar names.

It later became clear that cross-references within entries are also needed, to cope with things like: "V. of Walton ... notwithstanding Walton".

In a construction such as:
V. of Reed, Herts., from 1729
we have a good clue that "Reed" is a place name because of its being immediately followed by the abbreviated county name "Herts." Would that all cases were so simple! Both Venn and Emden omit counties in many of the following situations: the place is in Cambridgeshire or London; the county is the same as in the immediately preceding place mentioned; the place is the seat of a bishop or archbishop; the place is "well-known" (that is, well-known to scholars of early church history); or they simply didn't know!

The other problem is the sheer idiosyncratic peculiarity of (mainly English) place names. Just as a sample, we have:

Bradwell juxta Mare (Essex)
Tavy S. Peter (Devon)
S. Martins le Grand (London)
Great S. Marys (Cambridge)
Stow cum Quy (Cambs)
S. Andrew by the Wardrobe (London)
Havering-atte-Bower (Essex)
S. Mary Somerset (London)
Cley-next-the-Sea (Norfolk)

and to add to the confusion between religious titles and place names, we have:

Swaffham Prior (Cambs)
Bishop Auckland (Durham)
Vernhams Dean (Hants)

and even

Dean Prior (Devon)

Thankfully, Emden has carefully distinguished between Durham the city and diocese, which is never abbreviated, and Durham the county, which is always abbreviated to "Dur.". Venn, however, uses the full word Durham for the city, the county, the diocese, and the school!

Looking at the list of place names ending in "Park", we see that Sir Denis Park might not be happy to be treated as a place name, and those of you unfamiliar with the history of African exploration may like to speculate on the place "Mungo Park".

Names ending "Park"
Finchcox Park Finningley Park Finsbury Park Fitzroy Park Foliejon Park Fontmell Park . . . Hill Park Holland Park Holme Park Holton Park Howbury Park Hungershall Park Hyde Park . . . Mapperley Park Middleton Park Mildmay Park Mileham Park Monkstown Park Moor Park Moorend Park Mount Park Mungo Park Mymms Park Nevill Park Neville Park . . .	Princes Park Prinknash Park Queen's Park Ravensbury Park Ravenscourt Park Ravenscroft Park Raynes Park Regent's Park Richmond Park Rotton Park Roupell Park Rushton Park Salperton Park Sefton Park Sheen Park Sibton Park Sir Denis Park Skelbrook Park Sneyd Park Somerby Park South Park Southall Park Springfield Park St Helen's Park Stanmore Park Stansted Park Stapleton Park Stock Park	Stoll Park Stone Park Stonebridge Park Storrs Park Stratton Park Streatham Park Sunbridge Park Sutton Park Swinton Park Tehidy Park Tew Park Theobalds Park Tollington Park Totteridge Park Toxteth Park Trent Park Tufnell Park Tugdee Park Upper Hyde Park Upper Park Upton Park Vanbrugh Park Victoria Park Walters Park . . . Wexham Park . . . Wratting Park

In an attempt to pre-tag place names in a similar way to dates, I adopt this algorithm:

Place Name Algorithm

A place name is one of

     (a) a single capitalized word.

     (b)
        (i) an initial capitalized word
                followed by any number of
        (ii) capitalized word
        (iii) in, and, the, de, le, cum, at, on,
            super, sub, with, by, juxta,
            portion, moiety
                terminated by
        (iv) a single capitalized word.

where a "word" may contain the characters hyphen and apostrophe.

SHOW pp.17 & 18 Point out: <ER> Religious Event; <keye> event key; <IR> Internal event, religious; <cond> cross-reference to earlier event key; <EG> Event, topic "grace" or dispensation; Walton not written out in full (needs manual linkage);

[LIN] counties inserted in square brackets where not given in the text (more than half of all place names have no county – our Bartholomew's Gazetteer is well-thumbed, and our Swiss assistant's knowledge of British Empire Geography has greatly enlarged!) <cond>this provision [p.18] again, manual linkage back to earlier entry (key 60)

Show Fig.1 Point out: (Only about 30% of entries have an age mentioned.) Relatively few entering under the age of 17. Older entrants tended to be men already in holy orders.

Show Fig.2 Point out: Trinity and St John's were by far the largest colleges. (Trinity is still the largest.) General increase in the size of the colleges reflects the increase in size of the whole university.

omit Fig.3

Show Fig.4 Point out: huge increases in all of these smallest colleges from 1880. Huge influx into Queens' in 1820s. Almost no admissions to King's (1750s); Queens' (1870s); Sidney Sussex (1890s). We don't yet know why [remember that these results are based on surnames A-C only].

Show Fig.5 Most Cambridge colleges owned the "livings" of several parishes, and were therefore entitled to appoint their own nominees to religious appointments there. We thought it would be interesting to trace the counties in which Cambridge influence was most significant. Warning: these results should really be normalized by the total number of parishes per county.

Show Fig.6 Now we turn to the later Tripos results. These are interesting because they show the beginning of the rise of proper teaching and examinations in science. (The earliest Triposes were in Mathematics and Classics only.) Magdalene's concentration on Classics; Trinity Hall's dominance in Law; Queens' in Mathematics.

Show Fig.7 The Mathematics Tripos figures for the first hundred years are interesting, mainly because of the incredibly low take-up of the Tripos by some of the colleges. (Until 1751 King's had the privilege of exemption from university examinations – this in fact prevented its scholars from competing in the Tripos; and Trinity Hall – the lawyers' college).

Show Fig.8 (vertically) The total number of Tripos exams taken in the last 60 years of our figures shows a distribution different from that of the sizes of the colleges. King's (not then a large college, and very late starting to take the Tripos) figures relatively highly.

Conclusion

We have made a start on processing the lives of some 150,000 people, spanning almost seven centuries. I hope that I have been able to demonstrate some small part of the information which will become available. Then it's up to you to pursue your own particular interest, using the tools which we have prepared.

But don't hold your breath! There is a vast amount to do before we finish this project and publish the database. Linkage of names, identification of places, and coercing awkward narratives into a structured form will require a lot of human intervention, as well as a lot of programming skill. The result will be an incredibly valuable tool for historical research, which will serve many future generations of scholars.

Click here to enter the ACAD system.

Contact: LLCC@ucs.cam.ac.uk