NON-TRADITIONAL AUTHORSHIP ATTRIBUTION STUDIES IN EIGHTEENTH CENTURY LITERATURE. STYLISTICS STATISTICS AND THE COMPUTER
Abstract
Non-traditional authorship attribution studies are those attribution studies that make use of the computer, statistics, and stylistics. The hypothesis behind these studies is that an author has a unique and identifiable style. The computer has now become ubiquitous in eighteenth century literary studies and is the main reason why non-traditional authorship studies have advanced to where they are. David Holmes gives a good overview of the field in, The Analysis of Literary Style – A Review.[1] This article surveys a representative sample of authorship studies of eighteenth century literature and gives an exemplum of an ongoing study.
Because my research is on the canon of Daniel Defoe, the following representative survey of non-traditional authorship studies in eighteenth century literature focuses on literature written in English. However, many non-traditional authorship studies in other languages are important to researchers in English literature for reasons of theory and technique. Among others, Richard Frautschi has done stylistic and authorship work on many eighteenth century French authors: Diderot, d'Alembert, Jaucourt, Rousseau, Voltaire, and Perrault.[2] Estelle Irizarry has done work on Rodríguez Juliá imitating the hyperbolic style of the Spanish intellectuals of the eighteenth century.[3]
In addition to the non-English works, a few seminal studies that use stylistics and statistics but not the computer are mentioned for the same reasons of style and technique.
The following list is not exhaustive but is very representative. There are many authorship attribution problems that exist in the long eighteenth century. Griffin, in Anonymity and Authorship, gives a good synopsis of the why of the problem.[4]
As an aside, it is interesting to note that the English courts allowed language style as evidence for authorship as early as 1728 in the trial of William Hales.[5]
This list does not include stylistic studies on groups of authors such as the one that Burrows did on a group of twenty eighteenth century authors,[6] or the one that Sigelman et al. did on fifteen eighteenth century pamphleteers,[7] or the study that Kroeber did on five eighteenth century novelists.[8]
Two questions to keep in mind as you look at the following studies are: (1) Am I aware of these studies? (2) Have the results of the study become generally accepted by the profession?
• Swift
Louis Milic's study of the quantitative style of Swift, A Quantitative Approach to the Style of Jonathan Swift, is must reading for anyone working on non-traditional attribution studies.[9]
Along with Milic, Cynthia and William Matlack in their, A Statistical Approach to Problems of Attribution: »A Letter of Advice to a Young Poet« deal with A Letter of Advice to a Young Poet.[10]
Corbett identifies several quantifiable style-markers and uses them to analyze A Modest Proposal.[11]
• Smollett
Barbara Laning Fitzpatrick did preliminary work for a non-traditional study of some essays that appeared in the British Magazine.[12] However, she decided (correctly, in my opinion) that important elements for a valid non-traditional study were missing and therefore ceased the study.[13]
• Samuel Johnson and/or Charlotte Lennox
Deborah McLeod studied the attribution of the penultimate chapter of The Female Quixote – Was it Johnson, Lennox, or was it a collaboration? »The results of the statistical analysis [...] are maddeningly contradictory.«[14]
Isobel Grundy discusses this same question and goes on to show the importance of the study of attribution in general and that of attributions to women in particular.[15]
• Sir Josiah Child
O'Brien and Darnell did a study to determine if Sir Josiah Child was Philopatris – the author of the Treatise[...] defending the East India Company.[16]
• Goldsmith
Mannion and Dixon have done non-traditional authorship studies on many of the over one hundred essays that have been attributed to Goldsmith since his death.[17]
• Patrick Henry
Stephen Olson did an authorship study on Patrick Henry's Liberty or Death speech. But, do historians or the general public believe that St. George rather than Patrick Henry wrote the speech?[18]
Olson also did a study, Computerized Thematic Analysis of Selected American Revolution Pamphlets. Unfortunately, all of the digital records, print copies, and notes were lost in a fire in 1995.[19]
• Charles Brockden Brown
Fritz Fleischmann is working on Brown's short fiction and essays that were published pseudonymously or anonymously. After spending years on a traditional study, he is set to begin the non-traditional part of his study.[20]
Larry Stewart also is doing quantitative analysis on Brockden Brown's work. He delivered a paper at the ALLC/ACH 2002 conference in Tübingen, Charles Brockden Brown: Quantitative Analysis and Narrative Voice.[21]
• The Junius Letters
Various practitioners have tried to answer the question: Did Sir Philip Francis, Johnson, or (Pick-a-name) write the Junius Letters?[22]
• The Federalist Papers
The study that arguably is the most famous and the most successful is the Mosteller and Wallace work on the twelve disputed Federalist Papers. Were they written by Hamilton or Madison?[23] It is interesting to note that Mosteller and Wallace used Bayes' Theorem in their study. Bayes was an eighteenth century minister and mathematician.[24] Gavin Budge delivered a paper at the 2002 American Society for Eighteenth Century Studies meeting titled, Bayesian Probability and the Crisis of Representation in Eighteenth Century Mathematics.[25] The following list of studies show the influence that Mosteller and Wallace have over non-traditional authorship studies. They all either test Mosteller and Wallace's results or use Mosteller and Wallace's methods and techniques in their own studies:
– Merriam's An Experiment with the »Federalist Papers«.[26]
– Särndal's On Deciding Cases of Disputed Authorship.[27]
– Tankard's Literary Detective.[28]
– Martindale and McKenzie's On the Utility of Content Analysis in Author Attributions »The Federalist«[29]
– Tweedie, Singh, and Holmes' Neural Network Applications in Stylometry: »The Federalist Papers«.[30]
– Kjell's Discrimination of Authorship Using Letter Pair Frequency Features with Neural Network Classifiers.[31]
– Kjell et al.'s Discrimination of Authorship Using Visualization.[32]
– Wachal's dissertation Linguistic Evidence, Statistical Inference, and Disputed Authorship.[33]
– Rokeach et al.'s A Value Analysis of the Disputed Federalist Papers.[34]
– McColly and Weier's Literary Attribution and Likelihood-Ratio Tests: The Case of the Middle English »Pearl«-Poems.[35]
– Bosch and Smith's Separating Hyperplanes and the Authorship of the Disputed »Federalist Papers«.[36]
– Khmelev and Tweedie's Using Markov Chains for Identification of Writers.[37]
– Fung and Mangasarian's The Disputed Federalist Papers: SVM Feature Selection via Concave Minimization.[38]
– Forsyth's dissertation Stylistic Structures: A Computational Approach to Text Classification.[39]
– Francis' An Exposition of a Statistical Approach to the »Federalist« Dispute.[40]
– Hilton and Holmes' An Assessment of Cumulative Sum Charts for Authorship Attribution.[41]
– Farringdon and Morton's Fielding and the »Federalist«.[42]
– Holmes and Forsyth's The »Federalist« Revisited.[43]
In addition to the above, almost every non-traditional authorship study (done in or out of the eighteenth century) cites Mosteller and Wallace for one reason or another.
• Henry Fielding
Michael and Jill Farringdon did a study of Fielding's translation of the Military History of Charles XII.[44]
Michael Farringdon also contributed a non-traditional authorship attribution section to Martin Battestin's work on Fielding's contributions to the Craftsman.[45]
Michael Farringdon and Andrew Morton worked on Fielding in their, Fielding and the Federalist.[46]
Hugh Amory discussed word usage in Fielding.[47]
Martin Battestin, in New Essays by Henry Fielding: His Contributions to the »Craftsman« (1734-1739) and other Early Journalism with a Stylometric Analysis by Michael Farringdon, discussed the CUSUM method used in authorship attribution in general and attribution in Fielding in particular.[48]
• Sarah Fielding
John Burrows did a study that showed Henry Fielding wrote the beginning of the history of Anna Boleyn and that he allowed Sarah to continue it for him. Burrows also showed that Henry either revised her ending, or added an ending in which he sought to imitate her style.[49]
An editorial review in The Scriblerian discussed Burrows' article and made the statement that »Computers instill confidence«. But the reviewer goes on to quote Burrows concession that, »statistical analysis never yields conclusive answers«.[50]
Sheridan Baker wrote a probing article, Did Fielding Write ›A Vision‹?, casting doubt on Burrows' contentions.[51]
M.W.A. Smith, in Attribution by Statistics: A Critique of Four Recent Studies, looks at Burrows' contributions to the study of Anna Boleyn and criticizes him for neither giving the origin of the methodology nor the origin of its underlying theory. Smith agrees that Burrows has showed Henry's and Sarah's authorship can be differentiated. He calls Burrows »results«[52] impressive.
Burrows wrote a few articles to answer his critics and to explain how statistical evidence should be viewed.[53]
• Jane Austen
Karl Kroeber talked about the perils of quantification and used Emma as a case study.[54]
Although John Burrows' book, Computation into Criticism, is not an attribution study, many of his techniques have been referenced and used in authorship studies.[55]
Michael Hilton and David Holmes included a short study of five of Austen's novels in their An Assessment of Cumulative Sum Charts for Authorship Attribution.[56]
• Aphra Behn
John Burrows and Harold Love did a fine study – traditional and non-traditional – on Caesar's Ghost.[57]
• Shadwell
Burrows and Love also did a study on some works attributed to Shadwell – confirming some of the attributions but not all.[58]
• Defoe
As early as 1966, Edward McAdam used the computer to try to find out if it was Defoe who wrote about one hundred anonymous political tracts.[59] The results of McAdam's study are lost – not as the result of a fire as in Olson's case, but seemingly because no one published or preserved them when McAdam died.
Newsome did a study on the 1745 continuation of Roxana – I have never been able to find this in print. Fortunately, I have a 1987 e-mail pre-print from the author.[60] It is unfortunate that Furbank and Owens did not know of this study when they wrote their The ›Lost‹ Continuation of Defoe's »Roxana«.[61]
Steig Hargevik's monumental work, The Disputed Assignment of »Memoirs of an English Officer« to Daniel Defoe, is a good starting point for anyone doing non-traditional attribution work on Defoe. Hargevik did not use the computer in his 1974 study. However, he did use stylistics and statistics. He hand counted various stylistic traits in a text sample of over two million words.[62]
Irving Rothman looks at Hargevik's study, makes some points for expanding it to other Defoe works, and takes Furbank and Owens to task for ignoring important aspects of etylometrics.[63]
Maximillian Novak has used the computer to generate some concordances as an aid in his Defoe attribution studies – and, therefore,in the compilation of his bibliographies of Defoe's canon.[64] Novak, in a recent essay, re-attributed a de-attributed work by using a mixture of Rothman and historical scholarship.[65] Novak also takes Furbank and Owens to task for promising the use of new technology – stylometrics – and then, although finding the methodology inconclusive, abandoning any real analysis of language. Novak also finds fault with their not using the Eighteenth Century Short Title Catalogue (ECSTC).[66]
Furbank and Owens answered Novak's article but were silent on his charge about stylometrics, language analysis, and the ECSTC.[67]
Paula Backscheider, one of the world's preeminent Defoe scholars, made a serious effort to understand and employ non-traditional authorship techniques while working on her Daniel Defoe: His Life.[68] The results were not convincing enough to use as support evidence for her attributions based on traditional external evidence. Professor Backscheider continues to review the pertinent literature and believes that the non-traditional technology will eventually become valid and valuable.[69]
There are at least three dissertations on Defoe's style that are valuable aids in Defoe attribution studies:
– Horten – concentrated on Defoe's spelling, vowels, consonants, and punctuation[70]
– Dill – concentrated on the quantifiable elements of Defoe's style: phrases, vocabulary, sentence structure, and prose rhythm.[71]
– Lannerd – concentrated on Defoe's use of the indefinite article, indefinite pronoun, and periphrastic tenses.[72]
I began working on attribution studies in Defoe's canon in the late 1970's. It soon became clear that there were monumental problems.
Correspondence with Curtis, Furbank and Owens, and other Defoe scholars
convinced me that a non-traditional approach offered the best hope for some kind of resolution. A large majority of my research time since the mid-eighties has been on the theory and techniques of non-traditional authorship
attribu-
tion.[73]
The bibliography of this paper (embedded in the footnotes) contains the references to many non-traditional authorship attribution studies in the eighteenth century.
The following is an exemplum to give a little better idea of what non-traditional studies are about. It is a part of an ongoing project and, in a broad sense, will point out many of the techniques and potential dangers of using the computer in a non-traditional authorship attribution study.
The questioned anonomous work in this example is A Letter from Scotland to a Friend in London (Letter) – a 1705 political pamphlet that had significant impact on the union of England and Scotland. It is about intrigue, piracy, and revenge. The Letter was first attributed to Defoe by Moore[74] – who later wrote that he was, »[...] much less sure of his authorship than I was in 1939.«[75] I have found no other Defoe scholar who believes the Letter is by Defoe.
Now let me give a major caveat – before any non-traditional study is undertaken, a rigorous and complete traditional study must be done – non-traditional methods are tools to be employed by the traditional scholars – and surely not the most important tools.
The following, for obvious reasons, is necessarily short and incomplete. Two of my published articles that explain and expand on some of the problems and solutions of non-traditional authorship problems are: The State of Authorship Attribution Studies: Some Problems and Solutions.[76] and Non-Traditional Authorship Attribution Studies: Ignis Fatuus or Rosetta Stone.[77]
After the traditional study and all of the preliminary analysis is finished (e.g. are all of the Defoe and control texts valid, is the Letter a valid text) the process of the study begins:
• Enter the text of the Letter into the computer and add the Text Encoding Initiative (TEI) coding.[78]
– I typed the Letter into the computer. The error rate of the optical character reader is too high – although it is down to about 60%. But, even with careful proofing of my typing, an error rate for typos must be calculated and folded into the final experimental error.
– The coding allows automated analysis of style-markers such as parts of speech ratios and phrasal type percentages.
• Enter the sample of Defoe texts into the computer and add the TEI coding.
– This sample should consist of all political tracts written by Defoe and published within (+ or -) five years of the Letter, i.e. 1700-1710.
– Do not include any dubitanda – a certain and stylistically pure Defoe sample must be established – all decisions must err on the side of exclusion. If there can be no certain Defoe touchstone, there can be no non-traditional authorship attribution studies on his canon, and no wide ranging stylistic studies
– Get texts from anywhere possible, then edit, and type in the rest.
– Keep out a random sample of this set as one type of control.
There is a danger in downloading texts on the internet. You must compare the electronic text with the printed copy you have chosen to ensure its integrity.
• Enter a substantial random sample of non-Defoe texts into the computer and add the TEI coding.
– Same genre and time constraints as the Defoe texts.
– Download texts, edit them, and type in the rest.
– This is a random sample of all of the other authors within the constraints. The larger the sample the lower the statistical error on the result.
– This is one of the experimental controls.
• Analyze the texts stylistically
Textual problems must be dealt with before the analysis can begin. For example: quotes (of others, of Defoe's earlier works, of fictional authors), plagiarism, translations, mixed genre, editorial corruption, and accidentals (such as orthography).[79]
There is a danger in using canned computer packages such as TACT or TUSTEP.[80] The user must completely understand the assumptions and techniques that the authors of these programs employed. The user must really understand everything – even something as basic as what the system designers consider a ›word‹ or ›sentence‹.[81]
– Identify all of the style-markers (from the hundreds of thousands available) that Defoe uses consistently (e.g. type/token ratio, word length correlations, function word frequencies).[82]
– Compare Defoe's usage of these style-markers to the writers in the random sample.
– Compare Defoe's usage of these style-markers to the Letter.
• Analyze the results statistically
Again, users of canned packages must understand the assumptions behind them and the methodology that was used in creating them. Users must not let these packages dictate their research plans.
– Look at the results of the stylistic analysis the way you would look at a DNA autoradiogram.[83]
– Cull out the zeitgeist style-markers.
• Answer the question – Did Defoe write the Letter?
– Probabalistic.
No non-traditional authorship attribution study can say with 100% surety that an author wrote a given work. These studies can approach certainty the way that a DNA study can – giving odds such as one in a million that an author did or did not write the questionable work.
The results of nearly all of the studies mentioned in the survey if accepted at all are accepted with a grain of salt.
Why are these studies looked at with such skepticism? Do these studies show non-traditional authorship attribution to be simply ›aspiration‹ and not a science, as Furbank and Owens claim?[84] Has stylistics, with the help of Milic and Dilligan, withstood the onslaught of Fish?[85]
Are most non-traditional authorship studies of eighteenth century literature valid? My answer is no. Much theoretical and experimental work must be done before this answer can change and these studies can take their place in mainstream bibliography.
Joseph Rudman
Jospeph Rudman
Department of English
Carnegie Mellon University
Pittsburgh, PA 15213
USA
jr20@andrew.cmu.edu