Women at Cambridge
The Literary & Linguistic Computing Centre (part of the University of Cambridge Computing Service) is helping to prepare a searchable database of everyone who has ever been academically associated with Cambridge University from 1200 to 1900 (see /acad/ for details and /acad/dawson.html for a paper describing the project). This will cover the lives of some 150,000 people, spanning some seven centuries.
The aim is to help historians and archivists (particularly of the University and Colleges) and genealogists.
The sources of most of this information are the following:
Venn: Alumni Cantabrigienses, a biographical list of all known students, graduates and holders of office at the University of Cambridge, from the earliest times to 1751, 4 vols (1922-27).
Venn: Alumni Cantabrigienses, . . . 1752-1900, 6 vols (1940-54).
Emden: A Biographical Register of the University of Cambridge to 1500 (1963).
The 10 volumes of Venn were put into machine-readable form by the Mormons, primarily to help members of their Church with genealogy.
Additional material comprises 20,000 handwritten cards of addenda and corrigenda to Venn, handwritten additions and corrections to Emden, and Tripos results (taken from Previté-Orton, Index to Tripos Lists 1748-1910).
Most importantly, we have added the matriculation and staff registers of Girton and Newnham Colleges. Despite the presence of women students and staff in Cambridge from about 1869, the Venn volumes completely ignored them.
The end product of the analysis of all these biographies will be a set of XML-coded files, probably using the specifications of the Encoded Archival Context Initiative (EAC). This, however, is too verbose for initial coding, so a simpler scheme has been devised for editing the data.
The Venn entries (which range from three lines to half a column each) were carefully and consistently edited, and automatically processing them is reasonably straightforward (though many things need hand-tweaking after the automatic steps have done their best). Unfortunately, the Girton and Newnham registers have been heavily, inconsistently, and ambiguously abbreviated, which makes automatic processing much more difficult.
Hand-editing of all these women's entries is now complete a cause for celebration.
In our editing, particular attention has been paid to the needs of genealogists. All names have (where possible) been completed and tagged, and aliases and cross-references have been inserted for women's married names and other changes of name. This has, in the case of the entries for the men's colleges, increased the number of searchable names more than two-and-a-half times.
Automatic processing of entries begins with tagging all dates, as these are almost unambiguously recognizable. (A few people named 'May' or 'August', or labelled 'Jun.' are partially turned into dates, but this is easily reversed!)
Next come place names. If only all places had been entered with a county, life would have been so much simpler! Only about half of all place names have a county mentioned, so our assistant has diligently added county (and in some cases country) abbreviations to all unlabelled places. Her knowledge of British and Empire geography is now much expanded! A simple algorithm is then used to amalgamate all words which could form part of a place name (ending with a county abbreviation), and tag them as places. In one sample of 2,764 men, there are 13,837 place names mentioned.
Finally, key words such as 'born', 'died', 'matriculated', 'son of' are used to identify the beginnings of what we call Events. An Event comprises an Action, a Person, a Place, and a Date (almost any part of an Event may be omitted, of course). The Person part of an Event can itself contain other Events, so the resulting structure is quite complex.
Entries such as 'Assistant Mistress at Wimbledon High School 1883-84, Bradford Girls' Gr. School 1884-91' have to be split and the 'Assistant Mistress' tag repeated. All this makes the tagged files more than three times as large as the original data.
We have made a start on processing the lives of some 150,000 people, spanning more than seven centuries. But don't hold your breath! There is a vast amount to do before we finish this project and publish the database. Linkage of names, identification of places, and coercing awkward narratives into a structured form will require a lot of human intervention, as well as a lot of programming skill. The result will be an incredibly valuable tool for historical and genealogical research, which will serve many future generations of scholars and family historians.
But the women are (almost) finished!!
Dr John Dawson