Saturday, September 19, 2020

Hydroxychloroquine as a treatment for covid-19: summary of the best data we have as of Sept 2020 and an explanation of the statistics

 I'd like to start this post by drawing attention to the elephant in the room. It's clear that hydroxychloroquine as a treatment for covid-19 has become highly politicized.  I try to avoid consuming media on that sort of thing, to preserve my sanity, but I do get the general idea that at least some on the political right have hailed it as a miracle cure, and some on the left have painted it as having deadly side effects.  Early on, there was evidence that it might be a good drug candidate, and there was also evidence that it might have side effects that would preclude its use in covid patients, so each side had at least some grounds for their argument. Now that the randomized studies had been done, however, it appears that neither is correct.  

Many of the studies mention that patients receiving HCQ had more side effects than the control group, but they were not severe enough to prevent the drug from being used if it had helped with recovery from covid. 

But if there is a benefit to covid patients in taking hydroxychloroquine, it's so tiny that we would need a huge trial with tens of thousands of participants to actually see the effect.  Given the difficulties involved with setting up such a trial, and the small potential returns, it just doesn't make sense. But of course, I'm not asking you to take my word for it.  I'm going to show you lots of data.  

A search (on Google Scholar) of randomized, controlled trials for HCQ as a treatment for covid-19 came up with ten studies.  Randomized trials are superior to non-randomized trials because in a non-randomized trial, people who are more likely to recover without the drug are often the ones who are more likely to choose the drug.  People who are more educated are much more likely to trust the scientists enough to sign up for an experimental trial, and more educated people have better health outcomes all the way around.  People who are healthy enough to advocate for attention from medical professionals, or who have someone who is looking out for their interests and can advocate for them, or those with doctors who are highly motivated to get them the best care are more likely to end up with the drug.  Because of this, positive results in a non-randomized trial of a drug is taken as reason to do further experiments, but not to blindly accept the drug as effective. 

Placebo-controlled trials are superior to open-label ones, because taking a placebo, even if you know it's a placebo, makes people feel better.  I am including both placebo-controlled and open-label trials in this summary, but noting their category in my summary table.

These ten randomized controlled trials include a wide range of doses, and there is one each that includes zinc and azithromycin. There was a wide range of subjects with respect to how far they were in the course of the disease, from people who had been exposed but had not developed symptoms, to those who were intubated in an ICU. There was a range of outcomes observed, from the time it takes for patients to feel better, to how many patients died.

Of those ten studies, only two had a positive clinical outcome, and those were two of the three studies with fewer than 100 subjects.  Smaller studies are more subject to statistical error, and given how many larger studies we have that show no benefit, those two were most likely statistical flukes.  One of my tasks in this post is to explain how this can happen. 

What I do want to do, aside from presenting the data from those ten studies, is to explain why it looked like the drug was working, but then turned out that it didn't.  I can imagine that non-scientists might feel like someone is pulling something over on them somewhere, but as someone who as done early anti-viral drug discovery and design studies, I can tell you that this sort of thing is very much the norm.  Most of the drugs that look promising early on turn out to be false leads.  

P-values

In order to understand why a few smaller experiments found a positive result for HCQ, we need to talk about p-values. Technically, the p-value of an experiment is "the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct." (Wikipedia).  But let's do an example with coin-flipping that hopefully will help clarify what this means.

Suppose someone gives you a coin, and tells you that it may or may not be weighted to give either heads or tails more than 50% of the time. Your job is to figure out if this is the case, by conducting coin-flipping experiments.  In this case the "null hypothesis" is that the coin is just an ordinary fair coin. 

You flip once and get heads.  You can say that 1/1=100% of the flips were heads.  But of course, you don't want to conclude that the coin is weighted, because a fair coin will give a result this extreme 50% of the time. In this experiment, you would say you got 100% heads, but the p-value is 0.50, so it doesn't count as good evidence of a weighted coin.  If we accepted this as evidence, half of genuinely fair coins flipped once would be described as weighted, and we don't want that.

So what if you did an experiment with two flips, and the coin came up heads both times?  That's a bit better.  You can say that 2/2=100% were heads.  A fair coin would give a result this extreme 25% of the time, for a p-value of 0.25.  But we still would not want to use this standard, because it would mean that a quarter of fair coins flipped twice would be called weighted. That's not good.  

With a three-flip experiment, if 3/3 came up heads, that's 100% with a p-value of 0.125.  One out of 8 fair coins would be described as weighted.

With a four-flip experiment, with 4/4 heads, the p-value is 0.0625. One out of 16 fair coins would be described as weighted.

With a five-flip experiment, with 5/5 heads, the p-value is 0.03.  Now we have crossed a threshold where this experiment would be considered publishable.  That threshold is p=0.05. If your experiment crosses this threshold, you can say that your 100% heads result is statistically significant.   

Note that as our experiments got bigger and therefore better, the p-values got smaller.  Lower p-values are better because they mean the odds that the result was from a fair coin giving an unusual (but possible) result is lower. If we go back to our definition of the p-value as "the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct."  In this case, the null hypothesis is that the coin is just an ordinary fair coin.  Because 100% heads is the most extreme result possible, the math is easy.  

Also note that statistical significance is not the same as ordinary significance.  We say that something is significant when it is notable or important.  It's possible that you would get a result that is 51% heads and the divergence from 50% is statistically significant, but that result is not very significant in the ordinary sense of the word. 

There is nothing magical about p=0.05. It's just a standard that scientists (mostly) agree that you must meet to draw conclusions from your data.  If you have a result that has a p-value higher than 0.05, but still low enough to be suspicious, you can report it and say that there is a "trend."  But if you want to argue that the coin is weighted in favor of heads, and you send in a paper with p=0.07, the reviewers are (rightly) going to send it back to you and say "do more flips!" 

Now, there are some scientists who argue that the standard ought to be lower than p=0.05. If you give fair coins to 100 scientists around the world, and have them each do a five-flip experiment, there is a 3% chance that each individually will get 5/5 heads, and therefore a 97% chance that each will get something other than five heads.  But the chance that all twenty will get something other than five heads is 0.97 raised to the 100th power.  That's 5%. Turn it around and the chance that at least one of the 20 people doing the experiment will get five out of five heads is 95%. 

What if the 100 scientists around the world are all doing their experiments without knowing the others are doing the same experiment?  The majority that get an uninteresting result will do nothing, but the few that get the surprising result that the coin appears to be behaving strangely will publish their results, and the media will put out clickbait headlines about the extreme result. Remember, the math here is for a fair coin, giving a result that is unusual if you do the experiment once, but not if you do the experiment many times and cherry-pick the one result that is unusual.  If you have enough people doing independent coin-flipping experiments, there is a very high probability that a few people will think that the fair coin is actually weighted, if the standard for calling it weighted is a p-value less than 0.05. 

This is probably what happened with HCQ. Back in April, an experiment was published that looked really good.  It was a double-blind, placebo controlled experiment.  That's good experimental design.  It wasn't very big, but it did reach statistical significance.  The people who got HCQ were more likely than the placebo group to show improvement on their lung scans. The p-value for the difference was 0.047, which is just below the p=.05 threshold.  What we don't know is how many people tried similar experiments early on and found a negative result that they deemed not worth publishing.  There had been some early hints that it inhibited growth of the virus in a petri dish, and it was already an approved drug, and in non-randomized trials it looked promising, so it was definitely something worth checking out, and every scientist on the planet was aware of covid. So I wouldn't be surprised if there were quite a few.  And even if there were only one, there is a 4.7% chance that the improvement in the HCQ group was due to chance rather than to the drug actually being helpful. 

That was the first randomized trial to be released, then two more studies, both larger than the first, found p-values of 0.34 and 0.35.  One of those was a placebo-controlled double-blind study done in the US and Canada, and had over 800 participants. This was the study that was stopped early because it became clear that the drug wasn't helping.  At that point (in early June) doctors stopped using the drug.  During the big outbreak in NYC, hydroxychloroquine was commonly used as part of a "covid cocktail" of drugs that were promising but not fully tested. Despite its use in New York, the fatalities there were high, which is not surprising given the results of the studies. 

In late June, another small trial was released that had 48 people divided in to three groups. It found a p-value of 0.28 for days to get off oxygen (not statistically significant) but p=0.013 for chest CT score.  This is a better p-value, but remember that you will get this result by chance 1.3% of the time, assuming that the drug does nothing. It was also open-label, so the people reading the CT scans may have known which patient got the drug and which didn't.  

There was a third study with a statistically significant result, but it looks like a testing error.  They did a PCR test for shedding of viral RNA at day 7 and day 14.  At day 7 the HCQ group had more negative PCR results (with a statistically significant difference), but at day 14, the placebo group had more negative results.  So it looks like some people in the HCQ group tested negative on day 7, but positive on day 14.  In that study there weren't any clinical differences in disease severity between the groups, so there was probably just some sort of problem with their PCR testing. 

Of the seven studies that found no statistically significant benefit, most found a slight trend toward a benefit, which means that if someone did a huge study, they might find a small benefit that reaches statistical significance.  But it would be the equivalent of a coin that is weighted enough to give heads 51% of the time.  Not a miracle drug.  And there is one study that came out with a trend in the opposite direction. The UK did a large study in which the outcome they were looking for was death, not just how quickly those with mild or moderate illness improved.  In this case, the people given HCQ were more likely to die. The p-value for this difference was 0.18.  This study gave a fairly high dosage (the second highest on the list of ten studies) and it was the only study on the list giving the drug to people with more severe covid.  So it may be that a lower dose given to people with mild or moderate illness helps a little bit, while at the same time a higher dose given to those with severe illness leads to increased deaths.  But again those are trends, not statistically significant given the size of the studies.  

I've made a chart of the ten randomized trials of HCQ as a treatment for covid.  These studies are listed in the order in which they were released to the public.

Columns include:

  • The title of the study
  • The name of the first author of the study
  • The date the data was released to the public
  • The country where each study was done
  • Whether or not it was placebo controlled or open-label
  • The number of subjects in the study
  • The characteristics of the subject in the study (i.e. were they people who were exposed and were given the drug in the hopes that it would prevent early infection?  Or people with more severe illness?)
  • The groups the subjects were divided into, including information about dosage of the HCQ.  For reference, I found typical doses for already-established uses of the drug:
    • 200-400 mg/day for lupus
    • 400-600 mg/day for RA
    • Treatment of Uncomplicated Malaria: 800 mg (620 mg base) followed by 400 mg (310 mg base) at 6 hours, 24 hours and 48 hours after the initial dose (total 2000 mg hydroxychloroquine sulfate or 1550 mg base)
  • Total length of time the drug (or placebo) was given
  • Summarized results. Any p-values less than 0.05 are shown in red type