Skip to main content

How to Read 100 Million Blogs (and How to Classify Deaths without Physicians)

We develop a new method of computerized content analysis that gives approximately unbiased and statistically consistent estimates of quantities of theoretical interest to social scientists. With a small subset of documents hand coded into investigator-chosen categories, our approach can give accurate estimates of the proportion of text documents in each category in a larger population. The hand coded subset need not be a random sample, and may differ in dramatic but specific ways from the population. Previous methods require random samples, which are often infeasible in social science text analysis applications; they also attempt to maximize the percent of individual documents correctly classified, a criterion which leaves open the possibility of substantial estimation bias for the aggregate proportions of interest. We also correct, apparently for the first time, for the far less-than-perfect levels of inter-coder reliability that typically characterize human attempts to classify documents, an approach that will normally outperform even population hand coding when that is feasible. We illustrate the effectiveness of this approach by tracking the daily opinions of millions of people about candidates for the 2008 presidential nominations in online blogs, data we introduce and make available with this article. We demonstrate the broad applicability of our approach through additional evaluations in a variety of available corpora from other areas, including large databases of movie reviews and university web sites. We also offer easy-to-use software that implements all methods described.

The methods for a key part of this paper build on King and Lu (2007), which the talk will also briefly cover. This paper offers a new method of estimating cause-specific mortality in areas without medical death certification from "verbal autopsy data" (symptom questionnaires given to caregivers). This method turned out to give estimates considerably better than the existing approaches which included expensive and unreliable physician reviews (where three physicians spend 20 minutes with the answers to the symptom questions from each deceased to decide on the cause of death), expert rule-based algorithms, or model-dependent parametric statistical models.

Copies of the two papers are available at: