Data analysis is critical to science, public policy, and business. Despite their importance, statistical analyses are difficult to author. Existing statistical tools, prioritizing mathematical expressivity and computational control, are low-level while analysts’ motivating research questions and hypotheses are high-level. Analysts need to translate their questions and hypotheses into low-level statistical code by considering possible statistical approaches, selecting an approach, and formulating a statistical model accurately. This process is error-prone and requires statistical expertise beyond those of many analysts.
In this talk, I will introduce two new statistical tools that prevent common analysis mistakes and enable statistical non-experts to author valid analyses. Tea helps analysts directly express their hypotheses and select statistical tests for assessing them. Tea’s key insight is that statistical test selection can be cast as a constraint satisfaction problem. Tisane enables statistical non-experts to author generalized linear models with or without mixed effects, which are difficult for even statistical experts to author. Tisane derives statistical models from conceptual models that analysts express in a high-level domain specific language by translating their conceptual models into causal DAGs and engaging analysts in a disambiguation process to arrive at an output statistical model. Real-world researchers have already used these tools to conduct analyses in published research that push their own disciplines forward.
These systems serve as platforms for future research into making the entire data lifecycle more approachable for statistical non-experts. These systems also exemplify the promise of combining techniques in human-computer interaction, programming languages/software engineering, and statistics to tackle challenges in data analysis for complex data (e.g., multiple data tables, multi-modal sensor inputs) across multiple domains (e.g., computer vision, digital humanities, fabrication).