Interview: Andrew Gelman, statistician

In which we talk about how to make empirical research a little less bad.

Mar 04, 2022

By Schutz - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=48666043

The past few decades have seen a flowering of empirical research and statistical methodologies. But that empirical revolution has severe growing pains — a replication crisis that has researchers in numerous fields questioning how much they should trust their colleagues’ results and taxpayers wondering how much wasted effort they should fund.

Of all the methodological disciplinarians who have emerged to fight the tide of bad statistics, perhaps none is so fierce or so well-regarded as Columbia University’s Andrew Gelman. While his research output is copious, his well-read blog may be even more influential, having become a hub for critiques of empirical research in a variety of fields. (Andrew and I have even had a couple of arguments of our own over the years!)

In this interview, we talk about both the problems with modern empirical research and about some of the exciting new methods that have been developed in recent years. This just might be the nerdiest interview I’ve done so far, so…enjoy!

N.S.: In the past few years, you've become sort of the scourge of bad statistical papers. I've heard it said that the scariest seven words in academia are: "Andrew Gelman just blogged about your paper". Do you feel that this sort of thing has made empirical researchers more careful about their methodologies?

A.G.: I don't know if I want researchers to be more careful! It's good for people to try all sorts of ideas with data collection and analysis without fear. Indeed, I suspect that common statistical mistakes--for example, reliance on statistical significance and refusal to use prior information--arise from researchers being too careful to follow naive notions of rigor. What's important is not to try to avoid error but rather to be open to criticism and to learn from our mistakes.

N.S.: Ultimately, is the quality of statistical research more fundamentally a matter of opinion and judgment than, say, research in physics or biology?

What's an example of a recent paper you felt was well-done, vs. a recent one you thought was poorly done?

A.G.: I can't really answer your first question here because I'm not sure what you mean by "statistical research." Are you referring to research within the field of statistics, such as a paper in statistical methods or theory? Or are you referring to applied research that uses statistical analysis?

In answer to your second question: Unfortunately, I end up reading many more bad papers than good papers, in part because people keep sending me bad stuff! Back in the early days of the blog, they would send me papers with bad graphs, but now it's often papers with bad statistical analyses. I hate to name just one because then I feel like I'd be singling it out, so let me split the difference and point to two papers that I think were well done in many ways but still I don't buy their conclusions. The first was a meta-analysis of nudge interventions which claimed to find positive effects but I think was made essentially useless by relying on studies that were themselves subject to selection bias; see discussion here: https://statmodeling.stat.columbia.edu/2022/01/10/the-real-problem-of-that-nudge-meta-analysis-is-not-that-it-include-12-papers-by-noted-fraudsters-its-the-gigo-of-it-all/ The second was a regression-discontinuity study reporting that winners of close elections for governors of U.S. states lived five to ten years longer than the losers of the elections. This study again was clearly written with open data, but I don't believe the claims; I think they are artifacts of open-ended statistical analysis, the thing that happens all the time when studying small effects with highly noisy data; see here: https://statmodeling.stat.columbia.edu/2020/07/02/no-i-dont-believe-that-claim-based-on-regression-discontinuity-analysis-that/ I think both these papers have many good features, and I appreciate the effort the authors put into the work, but sometimes your data are just too biased or noisy to allow researchers, no matter how open and scrupulous, to find the small effects that they are looking for. Paradoxically, I think there's often a mistaken attitude to think a paper is good because of its methods (in the aforementioned examples, a comprehensive meta-analysis and a clean regression discontinuity) without realizing that it is doomed because of low data quality and weak substantive theory. One of the problems of focusing on gaudy examples of researchers cheating is that we forget that honesty and transparency are not enough (http://www.stat.columbia.edu/~gelman/research/published/ChanceEthics14.pdf).

OK, you asked me to also give an example of a recent paper I felt was well done. I should probably spend more time reading the good stuff, actually! I didn't want to just respond by pointing you to a paper by one of my friends, so to answer your question I went over to the website of the American Political Science Review. The forthcoming articles on their home page look reasonable to me. That said, most of them are not written quite the way I would. Some of this is silly details such as reporting estimates and standard errors to a ridiculous number of decimal places; some of this is an over-reliance on models and estimates without displays of raw data. Scatterplots make me happy, and I feel that many social science research papers make the mistake of considering a table of regression coefficients to be a culmination of their project rather than just part of the story. But I guess that's partly the point: for an empirical research paper to be good, it doesn't have to be a tour de force, it should just add to our understanding of the world, and often that can be done with a realistic view of what the contribution can represent. For example, consider this abstract from "The Curse of Good Intentions: Why Anticorruption Messaging Can Encourage Bribery," by Nic Cheeseman and Caryn Pfeiffer:

"Awareness-raising messages feature prominently in most anticorruption strategies. Yet, there has been limited systematic research into their efficacy. There is growing concern that anticorruption awareness-raising efforts may be backfiring; instead of encouraging citizens to resist corruption, they may be nudging them to “go with the corrupt grain.” This study offers a first test of the effect of anticorruption messaging on ordinary people’s behavior. A household-level field experiment, conducted with a representative sample in Lagos, Nigeria, is used to test whether exposure to five different messages about (anti)corruption influence the outcome of a “bribery game.” We find that exposure to anticorruption messages largely fails to discourage the decision to bribe, and in some cases it makes individuals more willing to pay a bribe. Importantly, we also find that the effect of anticorruption messaging is conditioned by an individual’s preexisting perceptions regarding the prevalence of corruption."

I like this abstract: It argues for the relevance of the work without making implausible claims. Maybe part of this is that their message is essentially negative: in contrast to much of the work on early childhood intervention (for example, see discussion here: https://statmodeling.stat.columbia.edu/2013/11/05/how-much-do-we-trust-this-claim-that-early-childhood-stimulation-raised-earnings-by-42/), say, they're not promoting a line of research, which makes it easier for them to report their findings dispassionately. I'm not saying that this particular article on anticorruption messaging, or the other recent APSR articles that I looked at, are perfect, just that they are examples of how we can learn from quantitative data. The common threads seem to be good data and plausible effect sizes.

N.S.: Sorry, I should have been more concrete. By "statistical research" I mean "either empirical research or theoretical research into statistical methods". But anyway, I think you answered my question perfectly!

Zooming out a bit, I'm wondering, are there any new kinds of mistakes you see lots of researchers making in empirical work? As in, are there any recently popular techniques that people are misapplying or overapplying? As a possible example of what I mean, I've been seeing a number of regression discontinuity papers whose data plots look like totally uninformative clouds, but who manage to find large effects at the discontinuity only because they assume bizarre, atheoretical trends before and after the discontinuity. I think you've taken apart a couple of these papers.

A.G.: Oh yeah, there's lots of bad regression discontinuity analysis out there; I discussed this in various posts, for example "Another Regression Discontinuity Disaster and what can we learn from it" (https://statmodeling.stat.columbia.edu/2019/06/25/another-regression-discontinuity-disaster-and-what-can-we-learn-from-it/) and "Regression discontinuity analysis is often a disaster. So what should you do instead? Here’s my recommendation:" (https://statmodeling.stat.columbia.edu/2021/03/11/regression-discontinuity-analysis-is-often-a-disaster-so-what-should-you-do-instead-do-we-just-give-up-on-the-whole-natural-experiment-idea-heres-my-recommendation/) and "How to get out of the credulity rut (regression discontinuity edition): Getting beyond whack-a-mole" (https://statmodeling.stat.columbia.edu/2020/01/13/how-to-get-out-of-the-credulity-rut-regression-discontinuity-edition-getting-beyond-whack-a-mole/) and "Just another day at the sausage factory . . . It’s just funny how regression discontinuity analyses routinely produce these ridiculous graphs and the authors and journals don’t even seem to notice." (https://statmodeling.stat.columbia.edu/2021/11/21/just-another-day-at-the-sausage-factory-its-just-funny-how-regression-discontinuity-analyses-routinely-produce-these-ridiculous-graphs-and-the-authors-and-journals-dont-even-seen-to-notice/). I don't actually think regression discontinuity is worse than other methods--we even have a section on the method, with an example, in Regression and Other Stories!--; rather, I think the problem is that a feeling of causal identification gives researchers a feeling of overconfidence, and then they forget that ultimately what they're trying to do is learn from observational data, and that needs assumptions--not just mathematical "conditions," but real-world assumptions. It's similar to how all those psychologists fooled themselves: they were doing randomized experiments and that gave them causal identification, but they didn't realize that this didn't help if they weren't estimating a stable quantity. They were trying to nail a jellyfish to the wall. I will say, though, that bad regression discontinuity analyses have their own special annoyance or amusement in that they are often presented in the published paper with a graph that reveals how ridiculous the fitted model is. It's kind of amazing when an article contains its own implicit refutation, which the authors and editors never even noticed. They're so convinced of the rightness of their method that they don't see what's right in front of them.

N.S.: Are there any other similar examples from recent years, of methods that have been applied in an overly "push-button" way?

A.G.: I'm sure lots of methods have been applied in an overly push-button way. Regression discontinuity is just particularly easy to notice because the graph showing the ridiculousness of the fitted model is often conveniently included in the published paper!

OK, so what do I think are the statistical methods that have been an overly push-button way? Three statistical methods come to mind:

1. Taking an estimated regression coefficient and using it as a parameter estimate going forward. This is so standard we don't even think of it as a choice: the estimate is the estimate, right? That's what we do in our textbooks; it's what everybody does in their textbooks. It's an "unbiased" estimate or something like it, right? Wrong. When you report the estimate that comes out of your fitted model, you're not accounting for selection: the selection in what gets attention, the selection in what gets published, and, before that, the selection in what you decide to focus on, amid all your results. Even the simplest model of selection, saying that the only things that get published are estimates that are more than 2 standard errors away from zero, can correspond to a huge bias, as discussed in pages 17-18 of this paper (http://www.stat.columbia.edu/~gelman/research/published/failure_of.pdf), I looked at a much-publicized study of the effects of early childhood intervention on adult earnings, and under reasonable assumptions about possible effect sizes, the bias in the published estimate can be larger than the true effect size! And this has implications, if you want to use these estimates to guide cost-benefit analyses and policy decisions (see here: https://statmodeling.stat.columbia.edu/2017/07/20/nobel-prize-winning-economist-become-victim-bog-standard-selection-bias/).

Anyway, that's just one example but this particular statistical error must be happening a million times a year--just above every time a published paper reports a numerical estimate of some effect or comparison. Sometimes maybe it's no big deal because the magnitude of the effect doesn't matter, but (a) often the magnitude _does_ matter, and (b) overestimates of effect sizes percolate forward through the literature, as follows: Suppose you're designing a new study. You want to power it to be large enough to detect realistic effects. The trouble is, you grab those "realistic effects" from old studies, which are subject to bias. Then you conduct a new study that's hopelessly noisy--but you don't know it, as your seemingly rigorous calculations led you to believe you have "80% power," that is, an 80% chance that your target analysis will be statistically significant at a conventional level (https://statmodeling.stat.columbia.edu/2017/12/04/80-power-lie/). But you don't really have 80% power--you really have something like 6% power (https://statmodeling.stat.columbia.edu/2014/11/17/power-06-looks-like-get-used/)--so the comparison you were planning to look at probably won't come out a winner, and this motivates you to go along the forking paths of data processing and analysis until you find something statistically significant. Which doesn't feel like cheating, because, after all, your study had 80% power, so you were supposed to find something, right?? Then this new inflated estimate gets published, and so on. And eventually you have an entire literature filled with overestimates, which you can then throw into a meta-analysis to find apparently conclusive evidence of huge effects (as here: https://statmodeling.stat.columbia.edu/2022/01/10/the-real-problem-of-that-nudge-meta-analysis-is-not-that-it-include-12-papers-by-noted-fraudsters-its-the-gigo-of-it-all/).

2. Using a statistical significance threshold to summarize inferences. Here I'm not talking about selection of what to report, but rather about how inferences are interpreted. Unfortunately, it's standard practice to make these divisions, what the epidemiologist Sander Greenland calls "dichotomania" (https://statmodeling.stat.columbia.edu/2019/09/13/deterministic-thinking-dichotomania/). Again, this error is so common as to be nearly invisible. At one level, this is a problem that people are very aware of, as is evidenced by the common use of terms such as "p-hacking," but I think people often miss the point. To me, the problem is not with p-values or so-called type 1 error rates but with this dichotomization, what I sometimes call the premature collapsing of the wavefunction. I'd rather just accept the uncertainty that we have. Sometimes people say that this is impractical, but my colleagues and I disagree; see for example here (http://www.stat.columbia.edu/~gelman/research/published/abandon.pdf). Approaches to do "statistical significance" better through multiple comparisons adjustments or preregistrations etc. . . . I think that's all missing the point. Examples of this problem come up every day: here's one from the medical literature that we looked into a couple years ago, where a non-statistically significant comparison was reported as a null effect (http://www.stat.columbia.edu/~gelman/research/published/Stents_published.pdf).

One perspective that might help in thinking about this problem--how to summarize a statistical result in a way that could be useful to decision makers--is to consider the problem of A/B testing in industry. The relevant question is not, "Is B better than A?"--a question which, indeed, may have no true answer, given that effects can and will change over time and an intervention can be effective in some scenarios and useless or even counterproductive in others--but, rather, "What can we say about what might happen if A or B is implemented?" Any realistic answer to such a question will have uncertainty--even if your past sample size is a zillion, you'll have uncertainty when extrapolating to the future. I'm not saying that such A/B decisions are easy, just that it's foolish to dichotomize based on the data. Summarizing results based on statistical significance is just a way of throwing away information.

3. Bayesian inference. Lots has been written about this too. The short story here is that the posterior probability is supposed to represent your uncertainty about your unknowns--but this is only as good as your model, and we often fill our models with conventional and unrealistic specifications. A simple example is, suppose you do a clean randomized experiment and you get an estimate of, ummm, 0.2 (on some meaningful scale) with a standard error of 0.2? If you used a flat or so-called noninformative prior, this would imply that your posterior is approximately normally distributed with mean 0.2 and standard deviation 0.2, which implies an 84% posterior probability that the underlying effect is positive. So: you get an estimate that's 1 standard error from 0, consistent with pure noise, but it leads you to an 84% probability, which if you take it seriously implies you'd bet with 5-to-1 odds that the true effect is greater than 0. To offer 5-1 odds based on data that could easily be explainable by chance alone, that's ridiculous. As Yuling and I discuss in section 3 of our article on Holes in Bayesian Statistics (http://www.stat.columbia.edu/~gelman/research/published/physics.pdf), the problem here is in the uniform prior, which upon reflection doesn't make sense but which people use by default--hell, we use it by default in our Bayesian Data Analysis book!

In that case, how is it that Bayesians who read our book (or others) aren't wandering the streets in poverty, having lost all their assets in foolish 5-to-1 bets on random noise? The answer is that they know not to trust all their inferences. Or, at least, they know not to trust _some_ of their inferences. The trouble is that this approach, of carefully walking through the posterior as if it were a minefield, avoiding the obviously stupid inferences that would blow you up, won't necessarily help you avoid the less-obviously mistaken inferences that can still hurt you. The problem comes from the standard Bayesian ideology which states that you should be willing to bet on all your probabilities.

I think that Bayesian errors are less common than the other two errors listed above, only because Bayesian methods are used less frequently. But when we make probabilistic forecasts, we pretty much have to think Bayesianly, and in that case we have to wrestle with where are we, relative to the boundaries of our knowledge. We discussed this in the context of election forecasts of the 2020 election; see here (https://statmodeling.stat.columbia.edu/2020/10/28/concerns-with-our-economist-election-forecast/), here (https://statmodeling.stat.columbia.edu/2020/07/31/thinking-about-election-forecast-uncertainty/), and here (https://statmodeling.stat.columbia.edu/2020/10/24/reverse-engineering-the-problematic-tail-behavior-of-the-fivethirtyeight-presidential-election-forecast/). We got some pushback on some of this, but the point is to start with the recognition that your model will be wrong and to perturb it to get a sense of how wrong it is. Rather than walking around the minefield or carefully tiptoeing through it, we grab a stick and start tapping wherever we can, trying to set off some explosions so we can see what's going on.

Taking all these examples together, along with the regression discontinuity thing I talked about earlier, I see a common feature, which is an apparent theoretical rigor leading to overconfidence and a lack of reflection. The theory says you have causal identification, or unbiased estimation, or a specified error rate, or coherent posterior probabilities, so then you just go with that, (a) without thinking about the ways in which the assumptions of the theory aren't satisfied, and (b) without thinking about the larger goals of the research.

N.S.: About that third one...What you said reminds me of this old post by Stephen Senn, which argues that researchers can't do true Bayesian inference, in the philosophical subjective sense, because they can't quantify their own prior. In fact, I recall that you liked that post a lot. So basically, if researchers can't write down what their own prior really is, then while the inference they're doing may use Bayes' Rule and a so-called "prior", it's not really Bayesian inference. So if that's true, do any of the arguments that we typically see for Bayesian over frequentist inference -- for example, the Likelihood Principle -- really hold? And if so, is there any general reason researchers should keep using so-called "Bayesian" methods, when they're cumbersome and unwieldy?

A.G.: Sure, in that case researchers can't do true inference of any sort, as in practice it's rare that the mathematical assumptions of our models are satisfied. We rarely have true probability sampling, we rarely have clean random assignment, and we rarely have direct measurements of what we ultimately care about. For example, an education experiment will be performed in schools that permit the study to be done, not a random sample of all schools; the treatment will be assigned differently in different places and can be altered by the teachers on the ground; and outcome measures such as test scores do not fully capture long-term learning. That's fine; we do our best. I see no reason to single out Bayesian inference here. Many writers on statistics strain at the gnat of the prior distribution while swallowing the camel of the likelihood. All those logistic regressions, independence assumptions, and models with constant parameters: where did they all come from, exactly? In your sentences above, I request that you replace the word "prior" everywhere with the word "model." "If researchers can't write down what their own model really is," etc. As someone with economics training, you will be aware that models are valuable, they can be complicated, and they are in many ways conventional, constructed from available building blocks such as additive utility models, normal distributions, and so forth. Different models have different assumptions, but it would be naive to think that the model you happen to use for a particular problem is, or could be, "what your own model really is."

Regarding some of the specifics: You say if bla bla bla then "it's not really Bayesian inference." I disagree. Bayesian inference is the math; it's the mapping from prior and data model to posterior, it's the probabilistic combination of information. Bayesian inference can give you bad results--I talked about that in my answer to your previous question--but it's still Bayesian inference. We could say the same thing about arithmetic. Suppose I think I have a pound of flour and I give you 6 ounces, I should have 10 ounces left. If my original weighing was wrong and I only had 15 ounces to start, then my analysis is wrong, as I will only have 9 ounces left. But the problem is not with the math, it's with my assumptions.

OK, at this point it might sound like I'm saying that Bayesian inference can't fail; it can only be failed. But that's not what I'm trying to say. I just said that Bayesian inference is the math, but it's also what goes into it. This has been a lot of what of statistics research has been about: constructing families of models that work in more general situations. As Hal Stern says, the most important thing about a statistical model is not what it does with the data but what data it uses, and often what makes a statistical method useful is that it can make use of more data. An example from my own work is how we use Mister P (MRP, multilevel regression and poststratification) to make population inferences: we're using information from the structure of the data in the survey or experiment at hand, and also including information about the population. Another example would be modern machine learning methods that use overparameterization and regularization, which allows more predictors to be flexibly included in their webs (see section 1.3 of this article: http://www.stat.columbia.edu/~gelman/research/published/stat50.pdf). The point is that statistical methods exist within a social context: it's the method and also how it's used.

A couple more points. You ask about the likelihood principle. I think the likelihood principle is kinda bogus. We discuss this in our Bayesian Data Analysis book (http://www.stat.columbia.edu/~gelman/book/). In short, the likelihood comes from the data model--in Bayesian terms, the probability distribution of the data given the parameters. But we don't know this model--it's just an idealization that we construct, so we have to check its fit to data. The likelihood principle is only relevant conditional on the model, which we don't actually know.

Finally, you ask whether there is any general reason researchers should keep using Bayesian methods, when they're cumbersome and unwieldy. Ummm, sure, if you can get results that are just as good with less effort, go for it! I guess you're thinking of problems where you have a small number of parameters, tons of data, and low uncertainty. The problems I've seen often have the opposite characteristics! Look at it the other way: if Bayesian methods are really so "cumbersome and unwieldy," why do we use them at all? Are we just gluttons for punishment? Actually, for many problems a Bayesian approach is fast, direct, and simple, while comparable non-Bayesian methods are cumbersome, requiring awkward approximations that necessitate lots of concern. For example, recently my colleagues and I have been working on a Bayesian model for calibration in laboratory assays. Classical approaches get hung up on measurements that are purportedly above or below detection limits, and they end up throwing away information and giving estimates that don't make sense. Another example is this project (http://www.stat.columbia.edu/~gelman/research/published/chickens.pdf) where we used Bayesian methods to partially adjust for experimental control data in a way that would be difficult using other approaches--indeed, we were motivated by an application where the standard approach using significance testing was horribly wasteful of information. This is not to say that Bayes is always best. I've given some Bayesian success stories; there are lots of success stories of other methods too. To understand why researchers use a method, it makes sense to look at where it has solved problems that cannot easily be solved in other ways.

N.S.: Gotcha. OK, so let me go one step further regarding methods here. In 2001, Leo Breiman wrote a very provocative essay entitled "Statistical Modeling: The Two Cultures", in which he argued that in many instances, researchers should stop worrying so much about modeling the data in an explicable way, and focus more on prediction. He shows how some basic early machine-learning type approaches were already able to yield consistently better predictions than traditional data models. And then a decade or so later, deep learning sort of explodes on the scene, and starts accomplishing all these magical feats -- beating human champions at Go, generating text that sounds as if a human wrote it, revolutionizing the world of protein folding, etc. Does this revolution mean that classical statistics needs to change its ways? Was Breiman basically vindicated? Should statisticians move more toward algorithmic-type approaches, or focus on problems where data is sparse enough that lots of theoretical assumptions are needed, and thus classical modeling approaches still work best?

A.G.: Hey--I wrote something about that Breiman paper (see here: http://www.stat.columbia.edu/~gelman/research/published/gelman_breiman.pdf). Short story is: dude had some blind spots. But we all have blind spots; what's important is what we do, not what we can't or won't do. Anyway, yes, there's been a revolution in overparametrized models and regularization, with computer go champions and all sorts of amazing feats. It's good to have these sorts of tools available; at the same time, traditional concerns of statistical design and analysis remain important for lots of important problems. At one end of things, problems in areas ranging from pharmacology to psychometrics involve latent parameters (concentrations of a drug within bodily compartments, or individual abilities and traits), and I think that to do inference for such problems, you need to do some modeling: pure prediction has fundamental limitations when your goal is to learn about latent properties. At the other end, lots of applied social science has statistical challenges arising from high variability and weak theory (consider, for example, tose studies of early childhood intervention that I discussed earlier): for these, the big problems involve adjusting for bias, combining information from multiple sources, and handling uncertainty, which are core problems of statistics.

You ask whether statisticians should move toward algorithmic approaches. I'd say that statistics has always been algorithmic. There's a duality between models and algorithms: start with a model and you'll need an algorithm to fit it; start with an algorithm and you'll want to understand how it can fail; this is modeling. Lots of us who do applied statistics spend lots of time developing algorithms, not necessarily because that's what we want to do, but because existing algorithms are designed for old problems and won't always work on our new ones.

N.S.: I love that you've already written about practically every question I have! I just hope you don't mind repeating yourself! Anyway, one other thing I wanted to get your thoughts on was the publication system and the quality of published research. The replication crisis and other skeptical reviews of empirical work have got lots of people thinking about ways to systematically improve the quality of what gets published in journals. Apart from things you've already mentioned, do you have any suggestions for doing that?

A.G.: I wrote about some potential solutions in pages 19-21 of this article: http://www.stat.columbia.edu/~gelman/research/published/failure_of.pdf from a few years ago. But it's hard to give more than my personal impression. As statisticians or methodologists we rake people over the coals for jumping to causal conclusions based on uncontrolled data, but when it comes to science reform, we're all too quick to say, Do this or Do that. Fair enough: policy exists already and we shouldn't wait on definitive evidence before moving forward to reform science publication, any more than journals waited on such evidence before growing to become what they are today. But we should just be aware of the role of theory and assumptions in making such recommendations. Eric Loken and I made this point several years ago in the context of statistics teaching (http://www.stat.columbia.edu/~gelman/research/published/ChanceEthics2.pdf), and Berna Devezer et al. published an article last year critically examining some of the assumptions that have at times been taken for granted in science reform (https://royalsocietypublishing.org/doi/10.1098/rsos.200805). When talking about reform, there are so many useful directions to go, I don't know where to start. There's post-publication review (which, among other things, should be much more efficient than the current system for reasons discussed here: https://statmodeling.stat.columbia.edu/2016/12/16/an-efficiency-argument-for-post-publication-review/), there are all sorts of things having to do with incentives and norms (for example, I've argued that one reason that scientists act so defensive when their work is criticized is because of how they're trained to react to referee reports in the journal review process: https://statmodeling.stat.columbia.edu/2018/01/13/solution-puzzle-scientists-typically-respond-legitimate-scientific-criticism-angry-defensive-closed-non-scientific-way/), and various ideas adapted to specific fields. One idea I saw recently that I liked was from the psychology researcher Gerd Gigerenzer, who wrote that we should consider stimuli in an experiment as being a sample from a population rather than thinking of them as fixed rules (https://www.cambridge.org/core/journals/behavioral-and-brain-sciences/article/we-need-to-think-more-about-how-we-conduct-research/DFAE681F3EEF581CEE80139BB63DFF6F), which is an interesting idea in part because of its connection to issues of external validity or out-of-sample generalization that are so important when trying to make statements about the outside world.

N.S.: OK, last questions. What are some of the interesting problems you're working on now -- can you give us a taste? And also, for young people getting started in your field, do you have any key pieces of advice?

A.G.: What am I working on now? Mostly teaching and textbooks! My colleagues and I have been trying to integrate modern ideas of statistics (involving modeling, measurement, and inference) with ideas of student-centered learning. The idea is that students spend their time in class working in pairs figuring things out, and I can walk around the room seeing what they're doing and helping them when they get stuck. In creating these courses, we're trying to put together all the pieces of the puzzle, including creating class-participation activities for every class period. And this has been making me think a lot about workflow and some fundamental questions of what are we doing when we do statistical data analysis. It looks a lot like science, in that we develop theories, make conjectures, and do experiments. Stepping back a bit to consider methods, my colleagues and I have been thinking a lot about MRP, poststratifying to estimate population quantities and causal effects, poststratifying on non-census variables, priors for models with deep interactions, computation for all these models, leveraging the concentration property by which, as our problems become larger, distributions become closer to normal, allowing approximate computation to be more effective, which brings us to methods that we're working on for validating approximate computation, along with methods for predictive model averaging and computational tools for statistical workflow (https://statmodeling.stat.columbia.edu/2021/11/19/drawing-maps-of-model-space-with-modular-stan/). For some of the general ideas, see our papers, Bayesian Workflow (http://www.stat.columbia.edu/~gelman/research/unpublished/Bayesian_Workflow_article.pdf) and Toward a taxonomy of trust for probabilistic machine learning (http://www.stat.columbia.edu/~gelman/research/unpublished/taxonomy.pdf)--I'm lucky to have some great collaborators! And what's it all for? Mostly it's for other people--users of Stan and other probabilistic programming languages, readers of our textbook, pollsters, laboratory researchers, policy analysts, etc. It's also motivated by the studies we are doing on political polarization and various projects related to survey research. I guess you can get some idea of what I've been working on by going to the published and unpublished articles on my home page, as they're listed in reverse chronological order.

Finally, what advice do I have for young people getting started? I don't know! I think that they can get better career advice from people closer to their own situation. I'm happy to offer statistical advice, though. From appendix B of Regression and Other Stories, here are our 10 quick tips to improve your regression modeling:

1. Think about variation and replication

2. Forget about statistical significance

3. Graph the relevant and not the irrelevant

4. Interpret regression coefficients as comparisons

5. Understand statistical methods using fake-data simulation

6. Fit many models

7. Set up a computational workflow

8. Use transformations

9. Do causal inference in a targeted way, not as a byproduct of a large regression

10. Learn methods through live examples.

5 Comments

Alex S

Mar 6, 2022

How do you conduct these interviews? I would've expected every question provided at the start considering the long answers - and if someone was sending me one question at a time I'd wonder if they were leading me somewhere.

But from the back and forth it does sound like it went one email at a time.

Expand full comment

Carl Mosk

Mar 5, 2022

This is a brilliant interview/exposition. Get away from the fancy implausible econometric tricks. Stick to problems where the data set you are using is robust enough to do the heavy lifted. Pay attention to the inherent quality and reliability of your data. Use common sense. Bravo!

3 more comments...

Noahpinion

Interview: Andrew Gelman, statistician

In which we talk about how to make empirical research a little less bad.