Tal Yarkoni recently published an article arguing that psychological science suffers from a generalizability crisis. Although this article has caused quite a stir in the field (and quite a bit of confusion), the issue Yarkoni discusses is by no means new. It has been known by methodologists for at least a couple of decades. It is also intimately connected to the problems that led to the demise of positivism and falsificationism in the philosophy of science. But Yarkoni has provided a new statistical formulation of this problem and brought it to the attention of mainstream researchers in psychology.

I discuss the basic methodological issue, divorced from the statistical formulation, below.

The fundamental problem

Research questions, hypotheses, and conclusions in psychological science tend to be expressed in very general terms, without contextual qualifications. The implicit assumption is often that the research addresses human nature and general principles of behavior. There are at least two reasons for this.

  • The first is the nomothetic ideal that stems from our disicpline’s positivist heritage and encourages the pursuit of context free-laws.
  • The second is that an article today comes across as more important (or “cooler”), and has a greater chance of being published in a high-impact journal, if it purpurts to unveil a basic truth about human psychology than if it only covers the behavior of say American college students in a particular artificial situation.

But are such broad formulations really justifiable? Is it defensible to hide or ignore potential contextual contingencies?

In order to draw valid conclusions from empirical research, our claims need to match the observations that were made. This means that all the theoretical concepts that are implicitly or explicitly used need to match concrete aspects of the study. This is known as the problem of construct validity in the methodological literature. Following Campbell’s classical typology, we need to obtain construct validity in terms of participants (or whatever are the units of research), instruments, manipulations, and situations.

The problem that researchers commonly rely exclusively on WEIRD (Western, Educated, Industrialized, Rich, Democratic) participant samples is well-known (see my previous blog post). The fact that we also need samples of instruments, manipulations, and situations that cover the constructs, causes, and situations we want to make claims about is less known. Research articles are full of sweeping claims about broad theoretical constructs, although what the studies have looked at are really just specific, often arbitrarily selected operationalizations of these constructs out of a broad universe of possible operationalizations. Nor is a statistical (multilevel) model that permits generalizations to populations of instruments, manipulations, and situations typically used.

Solving the problem

This problem might seem insurmountable at first glance. How would it be possible to obtain representative samples of all features of the study? How would we specify the total universe of instruments, manipulations, and situations?

We would do it in the same way we obtain representative samples from the total population of individuals. Obtaining a completely random sample is not feasible in a psychology that aspires to make broad claims about all of humanity. Contra positivist philosophy of science, making broad inductivist generalizations that are completely uncontaminated by theoretical assumptions is not possible. We inevitably rely on theoretical assumptions about relevant variations among the instances of the population when we generalize. We might, for example, try to make the sample of participants representative in terms of age, gender, education, income, nation, political identity, ethnicity, culture, or geographic region—that is, we are assuming that these specific variations are relevant to the phenomena we are studying. In the same way, we need to think carefully about theoretically relevant variations in the instruments (e.g., self-report, peer report, or observations), manipulations (e.g., techniques, frequency, or duration), or situations (e.g., anonymous or non-anonymous) and take these into consideration in our studies.

Would it perhaps be the case, as some commentators have thought, that the generalizability problem does not even emerge on a falsificationist philosophy of science?

Popper argued that (a) science is nomothetic, and (b) universal claims can be deductively falsified (through Modus Tollens) by never verified, so therefore (c) science is characterized by falsification. Would it perhaps then be enough to test one prediction that operationalizes the hypothesis out of many possible predictions? After all, logic dictates that the hypothesis cannot be true if any prediction that follows from it turns out to be false. The problem with this line of reasoning is that Popper’s deductivist account of falsification never worked out. When a result conflicts with the hypothesis, this does not necessarily mean that the hypothesis is false; it might be the case that some auxiliary assumption about, for instance, the instruments, participants, or manipulations rather than the hypothesis was false (this is also known as the Duhem-Quine thesis). What is deductively refuted is the conjunction between the hypothesis and all of the myriad auxiliary assumptions that we need to rely on to test it. Therefore, as Popper was forced to acknowledge, a falsification is in the end a practical decision about what to attribute the results to rather than a simple deductive relation between hypothesis and observation statement–and it is a complex one. To rigorously falsify or corroborate a hypothesis in a falsificationist spirit, we need to repeat the study with different populations, instruments, and situations, guided by theoretical assumptions and critical thinking about relevant variations, and analyze the extent to which the result is robust contra sensitive to different auxiliary assumptions.


The generalizability problem is a real problem in psychological science and the philosophy of science–it cannot be dissolved by simple philosophical maneuvers. It does, at the same time, not make up an insurmountable crisis that should lead to the abandonment of the entire enterprise of quantitative psychology. A more constructive move is for researchers, editors, and reviewers to get better at promoting careful, nuanced, contextually contingent generalizations over wild hyped-up overgeneralizations, and tests of the robustness of results across relevant variations in participants, instruments, manipulations, and situations. Scientific inquiry is, for all its successes, inevitably falliable, imperfect, and limited. It should always be subject to critical scrutiny and improvement.