Noise and statistical analysis

Tyler Cowen quotes a story in the Atlantic about a study that finds that genetic variation can explain 11 percent of the variation in educational attainment.

consider that household income explains just 7 percent of the variation in educational attainment, which is less than what genes can now account for.

In any statistical study of this sort, I think it helps to think about the sources and impacts of mis-measurement of key variables.

For example, if you measure “educational attainment” as years of schooling, you are ignoring differences in quality. You also have to deal with artificial factors that push people toward specific numbers. If you count graduating high school counts as 12, then a lot people who are actually below that will have been carried along to that point, and some people who continue to teach themselves in a non-school setting will be stuck on that point.

Also, household income is a noisy measure of overall economic status. Maybe this year the household earned much less than usual, or much more.

Measurement error in either the independent variable, the dependent variable, or both, will drive down the percentage of variation that is “explained” by the independent variable.

Also, genetic characteristics and household income may be correlated. If the former is measured more precisely than the latter, then the former will appear to matter more.

6 thoughts on “Noise and statistical analysis

  1. It would be nice to see people compete in all sorts of tournaments in order to prove what they are capable of. Without prizes, it’s hard to tell. We know the ranking of competitive chess players, and marathoners, and how much the Kardashians seem to make, but not much about how “smart” someone is after graduating with a B.A, or how “smart” they are leaving high school.

    I wrote a long discursive post on this, sometime in the couple years, perhaps a half-baked post. For example, to actually see how much students learn in four years of college, they should take demanding exams with large prizes when they enter and then when they graduate. Otherwise all we know is that they attended for four years. If people could make, say, $5,000.00 taking a test that day, people might compete.

    Some fields are different–passing CPA exam or bar exam. But not enough.

    I think I got the original idea from reading _The Bell Curve_ in which Herrnstein / Murray said it would be good to administer seriously competitive exams with large prizes on a national basis, for high school seniors–with immediate, few-or-no-strings, cash prizes, independent of need, unconditional of what you do (go to college, for example). We do it, but not enough. They thought the extra money to fund such an exercise would be useful.

    = – = – = – =

    Herrnstein Murray suggested a national test, with federal funding. A private foundation with deep pockets might manage it. What are all these new plutocrats spending their money on? Thiel academy shows the way, in part…

    • Interestingly, people taking the CPA or bar exam after graduating usually take time off to study and/or take a specialized course geared to passing the exam. If one were naive, one might suppose their years of school have already prepared them.

      • The CPA and bar courses taken in preparation of those exams are normally referred to as “review” courses, to enable someone to recall things from a course years earlier. Having taken the CPA exam I can say without the knowledge gained from accounting courses taken in college, chances of passing the CPA exam would have been slim. I have also taken a couple of bar exams. Still review courses, and often focused on individual state law differences, but law school does a much poorer job of preparing one for those tests.

  2. An important thing to understand here is that the specific genetic model their analysis produced predicts 11%. Genetic models of highly polygenic traits are not very good, and usually explain only a fraction of the genetic contribution implied by twin studies. For example, twin studies suggest that genetics are responsible for 80-90% of the variation in height, but GWAS models can only explain about 25%. I suspect that the reason for this is that in highly polygenic traits, individual gene variants interact in complicated, nonlinear ways, but am not highly confident in that explanation.

    I agree that the noisiness of permanent family income as estimated by a single year biases the estimate of the contribution of family income to educational attainment (and other traits) downwards, but I suspect that the much greater source of bias here is that GWAS models capture only a fraction of the actual genetic contribution.

  3. “I suspect that the reason for this is that in highly polygenic traits, individual gene variants interact in complicated, nonlinear ways, but am not highly confident in that explanation.”

    From first principles this seems like a reasonable guess, but I doubt it’s what’s going on for two reasons related to what we’re actually observing.

    reason A: Heritability is a linear calculation, so it seems to me that those highly nonlinear effects shouldn’t tend to show up as a huge heritability. Or more precisely, shouldn’t show up as a huge heritability except under special circumstances (which are hard to speculate about concretely because there are so many different ways for multivariate interactions to be nonlinear).

    reason B: AFAIK, no one has found many examples of such nonlinear gene interactions in highly polygenic traits. (And many practicitioners — not just ivory tower researchers jockeying for the Nobel prize, but breeders trying to breed fatter chickens — should be highly motivated to find them.)

    Also, as I understand it there are at least two less-gee-whiz reasons for GWAS to fall short of capturing all of heritability.

    reason C: For any given finite data set, a gene with sufficiently tiny effect is not going to be reliably detected. If a lot of the heritability is in the long tail of zillions of genes with tiny effects, this can be a big deal.

    reason D: For a finite data set, it tends to be hard to determine the effect of rare variants with small effect. E.g., five individuals whose heights are changed 0.6% by a variant which was created when a bit of ionizing radiation struck their greatgrandmother’s ovary, and which exists nowhere else in the population.

    Note that reason D is a first-principles argument, as well: we have reason to expect that the total effect of the equilibrium level of slightly-deleterious effects of this sort (only rarely generated, but also only very slowly eliminated by natural selection) should be pretty significant. (Eliminating a trait with only a tiny fitness disadvantage takes a long time, and copying error rates are low but significant. And the nervous system tends to be very vulnerable to things going wrong, not just specifically-nervous-system genes going wrong: note how common it is for toxins and infections and other problems — shock, heat stroke, hypoxia… — to make you woozy or spacy or unconscious well before they kill you.)

Comments are closed.