The analysis of public messages posted on social media websites such as Twitter is widely acknowledged to hold great promise for the measurement of public opinion. The existing literature, however, has yet to deliver on this promise. This paper aims at providing potential solutions to the two main challenges to overcome: the lack of individual demographic information necessary to correct for selection bias in who participates on Twitter, and the use of Twitter data sampled using keywords. First, I apply machine learning methods to infer infer Twitter users' sociodemographic traits (age, gender, party ID, race, income) using the text of their tweets and their social networks. To do so, I rely on a sample of 250,000 Twitter users in the U.S. matched with voting registration records, and Zillow estimates of home values. This analysis reveals that network composition is more informative about users' personal traits than the text of their tweets. Then, I apply this method to estimate the characteristics of a panel of 500,000 Twitter users in the U.S. selected using random selection at the user level. Combining these two innovations, I explore whether weighted estimates of Twitter sentiment approximate two public opinion time series: presidential job approval, and public opinion polls in the 2016 GOP presidential primary election.
The paper is available <a href="http://pablobarbera.com/static/less-is-more.pdf">here</a>