FITs No. 25

The post is here. I express my (possibly unfounded) worries about polygenic scores for outcomes like educational attainment.

I can’t help thinking that the “genes” for educational attainment, after you control for the genes for intelligence, might not be causal factors. I worry about culture as a confounding factor. For example, suppose that Asian parents successfully encourage their children to do well in school, and this accounts for all of the educational attainment of Asians over and above what you would expect based on their intelligence. Then you would find that genes that are associated with Asian-ness help to “explain” school attainment, even though the causal factor is actually cultural. (Note that if you are one of those people who insists that IQ itself is culturally determined, then this would lead you to question the interpretation of polygenic scores for IQ. I myself am willing to believe that psychometricians have figured out a way to measure IQ so that it picks up genetic causes rather than cultural causes. But I really don’t believe that one can do that with educational attainment or with income.)

What does consumer sentiment tell us?

A newspaper story says,

America has already slipped into a recession that could be as bad as the 2008 financial meltdown according to key consumer data, a Dartmouth College professor has warned.

David Blanchflower, of Dartmouth, and Alex Bryson, of University College London, say that every slump since the 1980s has been foreshadowed by 10-point drops in consumer indices from the Conference Board and University of Michigan.

The abstract of the Blanchflower/Bryson working paper says,

We show consumer expectations indices from both the Conference Board and the University of Michigan predict economic downturns up to 18 months in advance in the United States, both at national and at state-level. All the recessions since the 1980s have been predicted by at least 10 and sometimes many more point drops in these expectations indices. A single monthly rise of at least 0.3 percentage points in the unemployment rate also predicts recession, as does two consecutive months of employment rate declines. The economic situation in 2021 is exceptional, however, since unprecedented direct government intervention in the labor market through furlough-type arrangements has enabled employment rates to recover quickly from the huge downturn in 2020. However, downward movements in consumer expectations in the last six months suggest the economy in the United States is entering recession now (Autumn 2021) even though employment and wage growth figures suggest otherwise.

When I worked on forecasting, which was at the Fed almost 40 years ago, we thought that those consumer sentiment indices were too volatile to be useful in forecasting. Some possibilities:

1. We were stupid. We just did not understand that you could filter out volatility by looking just at large drops.

2. The consumer sentiment surveys were bad indicators in the 1970s and early 1980s, but they are better indicators now. Perhaps households have gotten better about sensing when firms are getting ready to initiate layoffs, so that their sentiment is now a good leading indicator. Perhaps the survey methods have improved.

3. The relationship that they found is mostly coincidental. Consider that “all the recessions since the 1980s” is not a particularly big sample. Since 1982, we’ve only had four. With so few observations, you cannot confirm that a statistical relationship is reliable.

My money is on (3). I wouldn’t call this work “research.” If you want to verify a statistical relationship that you will arrive at by trial and error, you need two samples of reasonable size. One sample is used to hunt for a relationship. The other sample is used to verify that you did not discover a coincidence. You can’t follow such a procedure with only the recessions since the 1980s to work from.

Blanchflower makes it sound as though their results imply a 100 percent probability that the U.S. economy is entering a recession. I think that the true probability is less than 25 percent.

The latest Nobel Prize in economics

Timothy Taylor has coverage

The award announcements says that one-half of the award is given to David Card “for his empirical contributions to labour economics,” while the other half is given jointly to Joshua D. Angrist and Guido W. Imbens “for their methodological contributions to the analysis of causal relationships.”

All three are involved in what Angrist and Jörn-Steffen Pischke dubbed the “credibility revolution” in empirical economics. As Alex Tabarrok puts it,

Almost all of the empirical work in economics that you read in the popular press (and plenty that doesn’t make the popular press) is due to analyzing natural experiments using techniques such as difference in differences, instrumental variables and regression discontinuity.

Last year, I suggested that a Nobel should go to Edward Leamer, the economist who set off the credibility revolution.

The Leamer critique caused economists to largely abandon ordinary multiple regression and to instead employ more credible research designs, such as natural experiments.

Leamer in 2010 pointed out that the newer methods also come with limitations. But, alas, he was bypassed by the Nobel committee.

Noah Smith goes way overboard in praise of the new laureates. He makes it sound as though the results that David Card and Alan Krueger claimed about the minimum wage were only controversial because they were surprising. But they were also controversial because they were wrong.

Science: method vs. sayings

Emily Oster writes,

Back in spring 2020, many of us (here’s my particular take) argued that the only way to really track case rates (and serious illness rates and so on) was to engage in a program of random testing. Some countries, like the U.K., did that. But the U.S. did not. And we still do not have such a program. And while our tracking of the pandemic has improved tremendously with better testing and better data reporting, this would still be helpful.

Consider this. Right now South Dakota has low case rates relative to most places with its vaccination rates. Is this because of natural immunity from very, very high infection rates in the winter? Or is it, more plausibly, a lack of testing? We have no idea, since without random sample testing, we cannot know.

Rigorous thinkers, like Oster, naturally employ a scientific mindset. Apparently, many people trained in epidemiology don’t. Epidemiological “science” seems more like a set of sayings that students learn. They learn to scold people based on those sayings.

Even when they do studies, epidemiologists do not seem to think like scientists. Hence you have many irreproducible and unreliable results.

Don’t trust doctors as statisticians

1. Daniel J. Morgan and others write,

These findings suggest that many practitioners are unaccustomed to using probability in diagnosis and clinical practice. Widespread overestimates of the probability of disease likely contribute to overdiagnosis and overuse.

Pointer from Tyler Cowen.

In 2008, Barry Nalebuff and Ian Ayres in Supercrunchers also reported that doctors tend to do poorly in basic probability. When I taught AP statistics in high school, I always used an example of bad experience inflicted on me by a Harvard-trained physician who did not know Bayes’ Theorem.

2. From a 2014 paper by Ralph Pelligra and others

The Back to Sleep Campaign was initiated in 1994 to implement the American Academy of Pediatrics’ (AAP) recommendation that infants be placed in the nonprone sleeping position to reduce the risk of the Sudden Infant Death Syndrome (SIDS). This paper offers a challenge to the Back to Sleep Campaign (BTSC) from two perspectives: (1) the questionable validity of SIDS mortality and risk statistics, and (2) the BTSC as human experimentation rather than as confirmed preventive therapy.

The principal argument that initiated the BTSC and that continues to justify its existence is the observed parallel declines in the number of infants placed in the prone sleeping position and the number of reported SIDS deaths. We are compelled to challenge both the implied causal relationship between these observations and the SIDS mortality statistics themselves.

I thank Russ Roberts for the pointer. This specific issue has been one of my pet peeves since 2016. See this post, for example. I think that back-sleeping is a terrible idea, and I never trusted the alleged evidence for it. Doctors do not understand the problem of inferring causation from non-experimental data, etc.

UPDATE: A commenter points to this Emily Oster post, reporting on a more careful study that supports back-sleeping.

Administrative data in economic research

Timothy Taylor writes,

economic research often goes well beyond these extremely well-known data sources. One big shift has been to the use of “administrative” data, which is a catch-all term to describe data that was not collected for research purposes, but instead developed for administrative reasons. Examples would include tax data from the Internal Revenue Service, data on earnings from the Social Security Administration, data on details of health care spending from Medicare and Medicaid, and education data on teachers and students collected by school districts. There is also private-sector administrative data about issues from financial markets to cell-phone data, credit card data, and “scanner” data generated by cash registers when you, say, buy groceries.

Vilhuber writes: “In 1960, 76% of empirical AER [American Economic Review- articles used public-use data. By 2010, 60% used administrative data, presumably none of which is public use …”

The quote is from a paper by Lars Vilhuber.

My thoughts:

1. The Census Bureau has procedures in place for allowing researchers to use administrative data while making sure that no individual record can be identified. I know this because one of my daughters worked at Census in this area for a few years.

2. The private firms that collect data as part of their business are not going to waste resources making sure that the data is free from coding mistakes and other errors. Researchers who are used to assuming that they are working with clean data are going to be surprised.

3. The data surrounding COVID are particularly treacherous. I found this out first-hand back when I was trying to follow the virus crisis closely.

4. This course should be taught more widely. As a point of trivia, the professor, Robert McDonald, was an undergraduate roommate of Russ Roberts (now the host of EconTalk) and also shared an apartment with me our first year of graduate school.

Intervening for racial equality

Glenn Loury says,

I must address myself to the underlying fundamental developmental deficits that impede the ability of African Americans to compete. If, instead of doing so, I use preferential selection criteria to cover for the consequences of the historical failure to develop African American performance fully, then I will have fake equality. I will have headcount equality. I will have my-ass-is-covered-if-I’m-the-institution equality. But I won’t have real equality.

I recommend the entire interview.

Meanwhile, Lilah Burke reports,

In 2013, the University of Texas at Austin’s computer science department began using a machine-learning system called GRADE to help make decisions about who gets into its Ph.D. program — and who doesn’t. This year, the department abandoned it.

Before the announcement, which the department released in the form of a tweet reply, few had even heard of the program. Now, its critics — concerned about diversity, equity and fairness in admissions — say it should never have been used in the first place.

The article does not describe GRADE well enough for me to say whether or not it was a good system. For me, the key question is how well it predicts student performance in computer science.

I draw the analogy with credit scoring. If a credit scoring system correctly separates borrowers who are likely to repay loans from borrowers who are likely to default, and its predictions for black applicants are accurate, then it is not racially discriminatory, regardless of whether the proportion of good scores among blacks is the same as that among whites or not.

David Arnold and co-authors find that

Estimates from New York City show that a sophisticated machine learning algorithm discriminates against Black defendants, even though defendant race and ethnicity are not included in the training data. The algorithm recommends releasing white defendants before trial at an 8 percentage point (11 percent) higher rate than Black defendants with identical potential for pretrial misconduct, with this unwarranted disparity explaining 77 percent of the observed racial disparity in algorithmic recommendations. We find a similar level of algorithmic discrimination with regression-based recommendations, using a model inspired by a widely used pretrial risk assessment tool.

That does seem like a bad algorithm. On the face of it, the authors believe that they have a better model for predicting pretrial misconduct than that used by the city’s algorithm. The city should be using the authors’ model, not the algorithm that they actually chose.

I take Loury as saying that intervening for racial equality late in life, at the stage where you are filling positions in the work place or on a college campus, is wrong, especially if you are lowering standards in order to do so. Instead, you have to do the harder work of improving the human capital of the black population much earlier in their lives.

It seems to me that Loury’s warning about the harms of affirmative action is being swamped these days by a tsunami of racialist ideology. Consider the way that a major Jewish movement seeks to switch religions.

In order to work toward racial equality through anti-racism, we must become aware of the many facets of racial inequality created by racism in the world around us and learn how to choose to intervene. Join us as we explore:

– How race impacts our own and each others’ experiences of the world

– The choice as bystander to intervene or overlook racist behavior

– How to be an anti-racist upstander

There is more of this dreck at the link.

I foresee considerable damage coming from this. Institutions and professions where I want to see rigor and a culture of excellence are being degraded. Yascha Mounk, who doesn’t think of himself as a right-wing crank, recently wrote Why I’m Losing Trust in the Institutions.

Finally, this seems like as good a post as any to link to an essay from last June by John McWhorter on the statistical evidence concerning police killings.

Vaccine testing and the trolley problem

As you know, I am not ready to proclaim a vaccine a success based on what I see as a small number of cases in the sample population. A lot of you think I am wrong about that, and I hope I am.

But my intuition is still that there is something unsatisfying about the testing protocol.

The actual protocol seems to be to give the vaccine to a large sample and wait for people to get exposed to the virus as they go about their normal business. My inclination is to deliberately expose a smaller sample to the virus and see how well the vaccine works.

One argument against deliberate exposure is that the response of people to deliberate exposure may not be the same as their response to normal-business exposure.

But in favor of deliberate exposure:

–you can clearly test how well the vaccine does in a specific populations, such as people with and without obesity.
–you can clearly test how well the vaccine does at two levels of viral load.
–you can get results quickly, rather than wait for months for people in the sample population to become exposed the the virus
–you could identify the main contacts of people to whom the virus is exposed and make sure that the vaccine reduces spreading.

Whether you deliberately expose 100 people or wait for 100 cases to emerge, that is still 100 cases either way. It seems to me that deliberate exposure is equivalent to throwing the switch in the trolley problem. Those 100 cases are 100 cases either way, it’s just that the experimenter didn’t specifically choose them.

I think that the case for deliberate exposure as a testing protocol ends up being pretty strong. What am I missing?

UPDATE: a reader points me to a story about a proposed challenge trial.

Challenge trials are controversial because of the risks involved with infecting patients with a potentially lethal virus

But again I ask, what is the difference between infecting a group of people and waiting for a group of people to become infected?

Further thoughts on a vaccine

UPDATE: Thanks to a commenter for finding the study protocol for the Pfizer study. That helps a lot.

1. If you assume that everyone is identical, and if you want to try to eradicate the virus by giving everyone the vaccine, then you don’t need to show that it is 90 percent effective. My guess is that 20 percent effectiveness will do the trick. The problem you face is a PR problem. Most people hear the word “vaccine” and think “super-power that keeps me from getting sick.” But to be honest you would have to say “You might still get sick, but if everyone takes the vaccine and also exercises some degree of precaution, the virus will die out and eventually you won’t have to worry about it.”

2. But if your goal with the virus is to target high-risk populations or people whose working conditions might expose them to high viral loads and make them safe, then you want the super-power story to be true. I would want to see that high viral load does not degrade the effectiveness of the vaccine, and I would want to see that being in a high-risk category does not degrade the effectiveness of the vaccine.

3. The most I can find out about the Pfizer study is from this press release.

The first primary objective analysis is based on 170 cases of COVID-19, as specified in the study protocol, of which 162 cases of COVID-19 were observed in the placebo group versus 8 cases in the BNT162b2 group. Efficacy was consistent across age, gender, race and ethnicity demographics. The observed efficacy in adults over 65 years of age was over 94%.

There were 10 severe cases of COVID-19 observed in the trial, with nine of the cases occurring in the placebo group and one in the BNT162b2 vaccinated group.

There are a number of comparisons you could make between the placebo group and the treatment group.

–number of people who tested positive for the virus
–number of people who tested positive and showed symptoms (this appears to be what they used)
–number of “severe cases” (this was reported in the press release)
–number of hospitalizations (is this the same as “severe cases”?)
–number of deaths

Suppose that nobody in either group died from the virus. A headline that says “Vaccine prevents zero deaths” would not be very inspiring, would it?

When a study can look at many possible outcome measures and chooses to report only those that favor the drug, this is known as p-hacking. I don’t know that Pfizer was p-hacking, but I don’t know that they weren’t.

UPDATE: the protocol specified the outcome measures in advance, so no p-hacking.

The context here is one in which people who spread the virus differ greatly in their ability to spread it (at least, that seems like a good guess), and people who come in contact with the virus differ greatly in the extent to which they get sick. In that context, a ratio of 162/22000 compared to 8/22000 is promising but not definitive. I would be much more impressed if it were 1620/22000 and 80/22000. With the smaller numbers, I can think of a dozen ways to get those results without the vaccine actually being effective.

UPDATE: Now that I have seen the protocol, I can go back to the first two points. For the purpose of trying to eradicate the virus with universal vaccination, you don’t need much efficacy. But I think you want to be really, really, confident that there is some efficacy, because otherwise you will have blown it with the public if your vaccine campaign does not eradicate the virus. If I were the public official in charge of making the decision, I would want to see a larger number of cases in the sample before I would make this kind of a bet.

For the purpose of protecting vulnerable populations, you need to conduct a different protocol, in my opinion. Again, you want to know how the vaccine does as viral load goes up and as vulnerability of the individual goes up. I think that would argue for a different study protocol altogether.

Testing a vaccine

A follow-up/clarification to my earlier post:

I believe in what I call the Avalon-Hill model of how the virus affects people. That is, it depends on a combination of viral load and patient vulnerability. Accordingly, I would like to see a vaccine tested on various combinations of these factors. That means that the experimenter should control the viral load rather than leave it to chance in the context of selection bias (people who volunteer for the trial may be behaving in ways that reduce their probability of being exposed to high viral load).

In principle, that means assigning a high viral load to some high-risk subjects in both the control group and the placebo group. That could discomfit the experimenter, not to mention the experimental subjects.

But if you don’t do that, what have you learned? If the most severe cases in the real world come from people exposed to high viral loads, and almost no one in your trial was exposed to a high viral load, then you have at best shown that the vaccine is effective under circumstances where it is least needed.