> Setting a p-value threshold of 0.05 is equivalent to saying: "I’m willing to accept a 5% chance of shipping something that only looked good by chance."
No, it means "I’m willing to ship something that if it was not better than the alternative it would have had only a 5% chance of looking as good as it did.”
layer8 3 minutes ago [-]
Shouldn’t that be “…at least as good as it did”?
2 hours ago [-]
phaedrus441 3 hours ago [-]
This! I see this all the time in medicine.
wavemode 2 hours ago [-]
Can you elaborate on the difference between your statement and the author's?
datastoat 34 minutes ago [-]
Author: "5% chance of shipping something that only looked good by chance". One philosophy of statistics says that the product either is better or isn't better, and that it's meaningless to attach a probability to facts, which the author seems to be doing with the phrase "5% chance of shipping something".
Parent: "5% chance of looking as good as it did, if it were truly no better than the alternative." This accepts the premise that the product quality is a fact, and only uses probability to describe the (noisy / probabilistic) measurements, i.e. "5% chance of looking as good".
Parent is right to pick up on this, if we're talking about a single product (or, in medicine, if we're talking about a single study evaluating a new treatment). But if we're talking about a workflow for evaluating many products, and we're prepared to consider a probability model that says some products are better than the alternative and others aren't, then the author's version is reasonable.
sweezyjeezy 50 minutes ago [-]
This is a subtle point that even a lot of scientists don't understand. A p value or < 0.05 doesn't mean "there is less than a 5% chance the treatment is not effective". It means that "if the treatment was only as effective, (or worse) than the original, we'd have < 5% chance of seeing results this good". Note that in the second case we're making a weaker statement - it doesn't directly say anything about the particular experiment we ran and whether it was right or wrong with any probability, only about how extreme the final result was.
Consider this example - we don't change the treatment at all, we just update its name. We split into two groups and run the same treatment on both, but under one of the two names at random. We get a p value of 0.2 that the new one is better. Is it reasonable to say that there's a >= 80% chance it really was better, knowing that it was literally the same treatment?
ghkbrew 52 minutes ago [-]
The chance that a positive result is a false positive depends on the false positive rate of your test and on total population statistics.
E.g. imagine your test has a 5% false positive rate for a disease only 1 in 1 million people has. If you test 1 million people you expect 50,000 false positive and 1 true positive. So the chance that one of those positive results is a false positive is 50,000/50,001, not 5/100.
Using a p-value threshold of 0.05 similar to saying: I'm going to use a test that will call a false result positive 5% of the time.
The author said: chance that a positive result is a false positive == the false positive rate.
likecarter 1 hours ago [-]
Author: 5% chance it could be same or worse
Parent: 5% chance it could be same
esafak 34 minutes ago [-]
@wavemode: In other words, the probability of it being the exactly the same is typically (for continuous random variables) zero, so we consider the tail probability; that of it being the same or more extreme.
Palmik 7 hours ago [-]
This isn't just a startup thing. This is common also at FAANG.
Not only are expriments commonly multi-arm, you also repeat your experiment (usually after making some changes) if the previous experiment failed / did not pass the launch criteria.
This is further complicated by the fact that lauch criteria is usually not well defined ahead of time. Unless it's a complete slam dunk, you won't know until your launch meeting whether the experiment will be approved for launch or not. It's mostly vibe based, determined based on tens or hundreds of "relevant" metric movements, often decided on the whim of the stakeholder sitting at the lauch meeting.
netcan 1 hours ago [-]
Is this terrible?
The idea is not do do science. The idea is to loosely systematize and conceptualize innovation. To generate options and create a failure tolerant system.
I'm sure improvements could be made... but this isn't about being a valid or invalid expirement.
setgree 3 hours ago [-]
You're describing conditioning analyses on data. Gelman and Loken (2013) put it like this:
> The problem is there can be a large number of potential comparisons when the details of data analysis are highly contingent on data, without the researcher having to perform any conscious procedure of fishing or examining multiple p values. We discuss in the context of several examples of published papers where data-analysis decisions were theoretically-motivated based on previous literature, but where the details of data selection and analysis were not pre specified and, as a result, were contingent on data.
simonw 15 hours ago [-]
On the one hand, this is a very nicely presented explanation of how to run statistically significant A/B style tests.
It's worth emphasizing though that if your startup hasn't achieved product market fit yet this kind of thing is a huge waste of time! Build features, see if people use them.
cdavid 6 hours ago [-]
A/B testing does not have to involve micro optimization. If done well, it can reduce the risk / cost of trying things. For example, you can A/B test something before investing a full prod development, etc. When pushing for some ML-based improvements (e.g. new ranking algo), you also want to use it.
This is why the cover of the reference A/B test book for product dev has a hippo: A/B test is helpful against just following the HIghest Paid Person Opinion. The practice is ofc more complicated, but that's more organizational/politics.
simonw 1 hours ago [-]
In my own career I've only ever seen it increase the cost of development.
The vast majority of A/B test results I've seen showed no significant win in one direction or the other, in which case why did we just add six weeks of delay and twice the development work to the feature?
Usually it was because the Highest Paid Person insisted on an A/B test because they weren't confident enough to move on without that safety blanket.
There are other, much cheaper things you can do to de-risk a new feature. Build a quick prototype and run a usability test with 2-3 participants - you get more information for a fraction of the time and cost of an A/B test.
noodletheworld 14 hours ago [-]
“This kind of thing” being running AB tests at all.
There’s no reason to run AB / MVT tests at all if you’re not doing them properly.
freehorse 6 hours ago [-]
I do not understand what the first tests are supposed to do. The author says the:
> Your hypothesis is: layout influences signup behavior.
I would expect that then the null hypothesis is that *layout does not influence signup behavior*. I would think that then an ANOVA (or an equivalent linear model) to be what tests this hypothesis, where you test the 4 layouts (or the 4 new layouts plus a control?) in one factor. If you get a significant p-value (no multiple tests required) you go on with post-hoc tests to look into comparisons between the different layouts (for 4 layouts, it should be 6 tests). But then you can use ways to control for multiple comparisons that are not as strict as just dividing your threshold by the number of comparisons, eg with Tukey's test.
But here I assume there is a control (as in some for users are still presented the old layout?) and each layout is compared to that control? If I would see that distribution of p-values I would just intuitively think that the experiment is underpowered. P-values from null tests are supposed to be distributed uniformly between 0 and 1, while these cluster around 0.05. It rather seems like a situation that it is hard to make inferences from because of issues in designing the experiment itself.
For example, I would rather have fewer layouts, driven by some expert design knowledge, rather than a lot of randomish layouts. The first increases statistical power, because the fewer tests you investigate, the less you have to adjust your p-values. But also, the fewer layouts you have, the more users you have per group (as the test is between groups) which also increases statistical power. The article is not wrong overall about how to control p-values etc, but I think that this knowledge is important not just to "do the right analysis" but, even more importantly, understand the limitations of an experimental design and structure it in a way that it may succeed in telling you something. To this end, g*power [0] is a useful tool that eg can let one calculate sample size in advance based on predicted effect size and power required.
The fraction of A/B tests I've seen personally that mentioned ANOVA at all is very small. Or thought that critically about experiment design. Understanding of p values is also generally poor; prob/stat education in engineering and business degrees seems to be the least-covered-or-respected type of math.
Even at places that want to ruthlessly prioritize velocity over rigor I think it would be better to at least switch things up and worry more about effect size than p-value. Don't bother waiting to see if marginal effects are "significant" statistically if they aren't significant from the POV of "we need to do things that can 10x our revenue since we're a young startup."
fho 4 hours ago [-]
> mentioned ANOVA at all is very small
That's because nobody learns how to do statistics and/or those who do are not really interested in it.
I taught statistics to biology students. Most them treated the statistics (and programming) courses like chores. Out of 300-ish students per year we had one or two that didn't leave uni mostly clueless about statistics.
TeMPOraL 4 hours ago [-]
FWIW, universities are pitching statistics the same way as every other subject, i.e. not at all. They operate under a delusion that students are deaperately interested in everything and grateful for the privilege of being taught by a prestigious institution. That may have been the case 100 years ago, but it hasn't been for decades now.
For me, stats was something I had to re-learn years after graduating, after I realized their importance (not just practical, but also epistemological). During university years, whatever interest I might have had, got extinguished the second the TA started talking about those f-in urns filled with colored balls.
fho 1 hours ago [-]
Also part of the problem:
> those f-in urns filled with colored balls.
I did my Abitur [1] in 2005, back then that used to be high school material.
When I was teaching statistics we had to cut more and more content from the courses in favor of getting people up to speed on content that they should have known from school.
And you didn't have the mental capacity to abstract from the colored balls to whatever application domain you were interested in? Does everything have to come pre-digested for students so they don't have to do their own thinking?
stirfish 2 hours ago [-]
Hey yusina, that's pretty rude. What's a different way you could ask your question?
fn-mote 3 hours ago [-]
> They operate under a delusion that students are desperately interested in everything
In the US, students are the paying customers. The consequence for not learning everything is lowered skills available for the job market (engineering) or life (philosophy?).
To me it is preferable that students who do not understand are not rated highly by the university (=do not get top marks), but “forcing” the students to learn statistics? That doesn’t make much sense.
Also, there’s nothing wrong with learning something after uni. Every skill I use in my job was developed post-degree. Really.
enaaem 2 hours ago [-]
Instead of trying to make p-values work. What if we just stopped teaching p-values and confidence intervals, and just teach Bayesian credible intervals and log odds ratios?
Are there problems that can only be solved with p-values?
zug_zug 36 minutes ago [-]
Yes this is real, and even happens at larger companies.
The surface issue is that when somebody has an incentive to self-measure their success then they have an incentive to overestimate (I increased retention by 14% by changing the shade of the "About Us" button!).
Which means the root-cause issue is managers who create environments where self-reporting improvements without any rigor or any contrary perspective. Ultimately they are the ones foot-gunning themselves (by letting their team focus on false vanity metrics).
aDyslecticCrow 2 hours ago [-]
Whenever working with this kind of probabilities, i always throw in a python rand() as a comparison. It sanity checks the calculation of threshold with very low rist of miscalculation.
Of course calculating the threshold properly needs to be done aswell... but a rand() is so quick and simple to add as a napkin check.
b0a04gl 52 minutes ago [-]
even the proposed fix preregistration or sticking to singlemetric tests assumes your metric design is clean to begin with. in practice, i've seen product metrics get messy, nested, and full of indirect effects. i might preregister a metric like activation rate, but it's influenced by onboarding UX, latency, cohort time and external traffic spikes. so even if i avoid phacking structurally, i'm still overfitting to a proxy i don't fully control. that's the blindspot. how do i preregister a test when the metric itself isn't stable across runs? doesn't it overcomplicate the process only. it's new to me but context plays a bigger role ig
Jemaclus 15 hours ago [-]
> This isn't academic nit-picking. It's how medical research works when lives are on the line. Your startup's growth deserves the same rigor.
But does it, really? A lot of companies sell... well, let's say "not important" stuff. Most companies don't cost peoples' lives when you get it wrong. If you A/B test user signups for a startup that sells widgets, people aren't living or dying based on the results. The consequences of getting it wrong are... you sell fewer widgets?
While I understand the overall point of the post -- and agree with it! -- I do take issue with this particular point. A lot of companies are, arguably, _too rigorous_ when it comes to testing.
At my last company, we spent 6 weeks waiting for stat sig. But within 48 hours, we had a positive signal. Conversion was up! Not statistically significant, but trending in the direction we wanted. But to "maintain rigor," we waited 6 weeks before turning it... and the final numbers were virtually the same as the 48 hour numbers.
Note: I'm not advocating stopping tests as soon as something shows trending in the right direction. The third scenario on the post points this out as a flaw! I do like their proposal for "peeking" and subsequent testing.
But, really, let's just be realistic about what level of "rigor" is required to make decisions. We aren't shooting rockets into space. We're shipping software. We can change things if we get them wrong. It's okay. The world won't end.
IMO, the right framing here is: your startup deserves to be as rigorous as is necessary to achieve its goals. If its goals are "stat sig on every test," then sure, treat it like someone might die if you're wrong. (I would argue that you have the wrong goals, in this case, but I digress...)
But if your goals are "do no harm, see if we're heading in the right vector, and trust that you can pivot if it turns out you got a false positive," then you kind of explicitly don't need to treat it with the same rigor as a medical test.
travisjungroth 13 hours ago [-]
Completely agree. The sign up flow for your startup does not need the same rigor as medical research. You don’t need transportation engineering standards for your product packaging, either. They’re just totally different levels of risk.
I could write pages on this (I’ve certainly spoken for hours) but the adoption of a scientific research mindset is very limiting for A/B testing. You don’t need all the status quo bias of null hypothesis testing.
At the same time, it’s quite impressive how people are able to adapt. An organization experienced with A/B testing will start doing things like multi variate correction in their heads.
For anyone spinning this stuff up, go Bayesian from the start. You’ll end up there, whether you realize it or not. (People will look at p-values in consideration of prior evidence).
0.05 (or any Bayesian equivalent) is not a magic number. It’s really quite high for a default. Harder sciences (the ones not in replication crisis) use much stricter values by default.
Adjust the confidence required to the cost of the change and the risk of harm. If you’re at the point of testing, the cost of change may be zero (content). It may be really high, it may be net negative!
But in most cases, at a startup, you should be going after wins that are way more impactful and end up having p-values lower than 0.05, anyway. This is easy to say, but don’t waste your time coming up with methods to squeeze out more signal. Just (just lol) make better changes to your product so that the methods don’t matter. If p=0.00001, that’s going to be a better signal than p=0.05 with every correction in this article.
If you’re going to pick any fanciness from the start (besides Bayes) make it anytime valid methods. You’re certainly already going to be peaking (as you should) so have your data reflect that.
yorwba 9 hours ago [-]
> You don’t need all the status quo bias of null hypothesis testing.
You don't have to make the status quo be the null hypothesis. If you make a change, you probably already think that your change is better or at least neutral, so make that the null. If you get a strong signal that your change is actually worse, rejecting the null, revert the change.
Not "only keep changes that are clearly good" but "don't keep changes that are clearly bad."
scott_w 9 hours ago [-]
This is a reasonable approach, particularly when you’re looking at moving towards a bigger redesign that might not pay off right away. I’ve seen it called “non-inferiority test,” if you’re curious.
parpfish 10 hours ago [-]
Especially for startups with a small user base.
Not many users means that getting to stat sig will take longer (if at all).
Sometimes you just need to trust your design/product sense and assert that some change you’re making is better and push it without an experiment. Too often people use experimentation for CYA reasons so they can never be blamed for making a misstep
scott_w 9 hours ago [-]
100% this. I’ve seen people get too excited to A/B test everything even when it’s not appropriate. For us, changing prices was a common A/B test when the relatively low number of conversions meant the tests took 3 months to run! I believe we’ve moved away from that, now.
The company has a large user base, it’s just SaaS doesn’t have the same conversion # as, say, e-commerce.
bigfudge 8 hours ago [-]
The idea you should be going after bigger wins than .05 misses the point. The p value is a function of the effect size and the sample size. If you have a big effect you’ll see it even with small data.
Completely agree on the Bayesian point though, and the importance of defining the loss function. Getting people used to talking about the strength of the evidence rather than statistical significance is a massive win most of the time.
epgui 14 hours ago [-]
It does, if you assume you care about the validity of the results or about making changes that improve your outcomes.
The degree of care can be different in less critical contexts, but then you shouldn’t lie to yourself about how much you care.
renjimen 14 hours ago [-]
But there’s an opportunity cost that needs to be factored in when waiting for a stronger signal.
epgui 12 hours ago [-]
Even if you have to be honest with yourself about how much you care about being right, there’s still a place for balancing priorities. Two things can be true at once.
Sometimes someone just has to make imperfect decisions based on incomplete information, or make arbitrary judgment calls. And that’s totally fine… But it shouldn’t be confused with data-driven decisions.
The two kinds of decisions need to happen. They can both happen honestly.
scott_w 8 hours ago [-]
There is but you can decide that up front. There’s tools that will show you how long it’ll take to get statistical significance. You can then decide if you want to wait that long or have a softer p-value.
Nevermark 14 hours ago [-]
One solution is to gradually move instances to you most likely solution.
But continue a percentage of A/B/n testing as well.
This allows for a balancing of speed vs. certainty
imachine1980_ 13 hours ago [-]
do you use any tool for this, or simply crunk up slightly the dial each day
travisjungroth 13 hours ago [-]
There are multi armed bandit algorithms for this. I don’t know the names of the public tools.
This is especially useful for something where the value of the choice is front loaded, like headlines.
jjmarr 12 hours ago [-]
Can this be solved by setting p=0.50?
Make your expectations explicit instead of implicit. 0.05 is completely arbitrary. If you are comfortable with a 50/50 chance of being right, make your threshold less rigorous.
scott_w 8 hours ago [-]
I think at that point you may as well skip the test and just make the change you clearly want to make!
bigfudge 8 hours ago [-]
Or collect some data and see if the net effect is positive.
It’s possibly worth collecting some data though to rule out negative effects?
scott_w 6 hours ago [-]
Absolutely, you can still analyse the outcomes and try to draw conclusions. This is true even for A/B testing.
scott_w 9 hours ago [-]
> The consequences of getting it wrong are... you sell fewer widgets?
If that’s the difference between success and failure then that is pretty important to you as a business owner.
> do no harm, see if we're heading in the right vector, and trust that you can pivot if it turns out you got a false positive
That’s a reasonable, and in plenty of contexts the absolute best, approach to take. But don’t call it A/B testing, because it’s not.
yusina 7 hours ago [-]
I see where you are coming from, and overtesting is a thing, but I really believe that the baseline of quality of all software out there is terrible. We are just so used to it and it's been normalized. But there is really no day going by during which I'm not annoyed by a bug that somebody with more attention to quality would have not let through.
It's not about space rocket type of rigor, but it's about a higher bar than the current state.
(Besides, Elon's rockets are failing left and right, in contrast to what NASA achieved in the 60s, so there are some lessons there too.)
brian-armstrong 13 hours ago [-]
The thing is though, you're just as likely to be not improving things.
I think we can realize another reason to just ship it. Startups need to be always moving. You need to keep turning the wheel to help keep everyone busy and keep them from fretting about your slow growth or high churn metrics. Startups need lots of fighting spirit. So it's still probably better to ship it rather than admit defeat and suffer bad vibes.
scott_w 9 hours ago [-]
Allow me to rephrase what I think you’re saying:
Startups need to ship because they need to have a habit of moving constantly to survive. Stasis is death for a startup.
psychoslave 9 hours ago [-]
> We aren't shooting rockets into space.
Most of us don't, indeed. So still aligned with your perspective, it's good to take in consideration what we are currently working on, and what will be the possible implication. Sometimes the line is not so obvious though. If we design a library or framework which is not very specific to a inconsequential outcome, it's no longer obvious what policy make the more sense.
BrenBarn 11 hours ago [-]
The other thing is that in those medical contexts, the choice is often between "use this specific treatment under consideration, or do nothing (i.e., use existing known treatments)". Is anyone planning to fold their startup if they can't get a statistically significant read on which website layout is best? Another way to phrase "do no harm" is to say that a null result just means "there is no reason to change what you're doing".
bobbruno 3 hours ago [-]
It's not a matter of life and death, I agree - to some extent. Startups have very limited resources, and ignoring inconclusive results in the long term means you're spending these resources without achieving any bottom line results. If you do that too much/too long, you'll run out of funding and the startup will die.
The author didn't go into why companies do this (ignoring or misreading test results). Putting lack of understanding aside, my anecdotal experience from the time I worked as a data scientist boils down to a few major reasons:
- Wanting to be right. Being a founder requires high self-confidence, that feeling of "I know I'm right". But feeling right doesn't make one right, and there's plenty of evidence around that people will ignore evidence against their beliefs, even rationalize the denial (and yes, the irony of that statement is not lost on me);
- Pressure to show work: doing the umpteenth UI redesign is better than just saying "it's irrelevant" in your performance evaluation. If the result is inconclusive, the harm is smaller than not having anything to show - you are stalling the conclusion that your work is irrelevant by doing whatever. So you keep on pushing them and reframing the results into some BS interpretation just to get some more time.
Another thing that is not discussed enough is what all these inconclusive results would mean if properly interpreted. A long sequence of inconclusive UI redesign experiments should trigger a hypothesis like "does the UI matter"? But again, those are existentially threatening questions for the people in the best position to come up with them. If any company out there were serious about being data-driven and scientific, they'd require tests everywhere, have external controls on quality and rigour of those and use them to make strategic decisions on where they invest and divest. At the very least, take them as a serious part of their strategy input.
I'm not saying you can do everything based on tests, nor that you should - there are bets on the future, hypothesis making on new scenarios and things that are just too costly, ethically or physically impossible to test. But consistently testing and analysing test results could save a lot of work and money.
syntacticsalt 8 hours ago [-]
> Most companies don't cost peoples' lives when you get it wrong.
True, but it usually costs money to fix it. I think the themes of "this only matters if lives are on the line" or "it's too rigorous" are straw-men.
We have limited resources -- time, money, people. We'd like to avoid deploying those resources badly. Statistical inference can be one way to give us more information so we avoid using our resources badly, but as you note, statistical inference also has costs: we have to spend resources to get the data we need to do the inference, plus other costs. We can estimate the costs of getting sufficient data using sample size estimation methods. For go/no-go decision-making, if the cost of getting the decision wrong isn't something like at least 10x the cost of doing the statistical inference, I don't think it's worth doing the inference. It may be worth doing the inference for _other_ reasons, but those reasons are out of scope.
As an example, a common use of statistical inference in medical research is to compare the efficacy of a treatment with a placebo. Some of the motivation is to decide whether to invest more resources in developing the treatment, not because people will die if they get a false positive stating that the treatment is effective when it isn't.
> A lot of companies are, arguably, _too rigorous_ when it comes to testing.
My experience in industry has been the opposite. Companies like the idea of data-driven decision-making, but then they discover pain points. They should have some idea of how much of a change they're looking to detect (i.e., an effect size). They should estimate how much data they're likely to need to run their tests (i.e., sample size estimation). They have to consider other issues like model misfit, calibration, multiple-testing corrections, and so on. Then they also have to rig up the infra to be able to _do_ the testing, collect the data, analyze the results, and communicate the results to their internal stakeholders. These pain points are why companies like Eppo and StatSig exist -- A/B testing ends up being more high-touch than developers expect.
Messing up any one of these issues can yield "flaky tests," which developers hate. Failing to gather a sufficiently large sample size for a given effect size is a pretty common failure mode.
> But to "maintain rigor," we waited 6 weeks before turning it... and the final numbers were virtually the same as the 48 hour numbers.
It's difficult to tell precisely what you mean by "maintain rigor" here. The only context I can gather is that whatever procedure you were using needed more data in order to satisfy the preconditions of the test needed for the nominal design criteria of the test -- usually, its nominal false positive rate. I don't think this is an issue of rigor -- it's an issue of statistical modeling and correctness.
Sometimes, it's possible to use different methods that may require less data at the cost of more (or different) modeling assumptions. Failing to satisfy the assumptions of a test can increase its false positive rate. Whether that matters is really up to you.
> I do like their proposal for "peeking" and subsequent testing.
What the post is suggesting is not a proposal, but a standard class of frequentist statistical inference methods called sequential testing. Daniël Lakens has a good online textbook (https://lakens.github.io/statistical_inferences/) that briefly discusses these methods in Chapter 10 and provides further references.
> We're shipping software. We can change things if we get them wrong.
That's usually true -- as long as you have the resources needed to make those changes, and are willing to spend them that way.
> IMO, the right framing here is: your startup deserves to be as rigorous as is necessary to achieve its goals.
While I don't disagree with the sentiment, I think you're conflating rigor with correctness here.
> If its goals are "stat sig on every test", then sure, treat it like someone might die if you're wrong.
I think that's a false equivalence. Even the American Statistical Association has issued a statement on p-values (see https://www.amstat.org/asa/files/pdfs/p-valuestatement.pdf) that includes "Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold."
> But if your goals are "do no harm, see if we're heading in the right vector, and trust that you can pivot if it turns out you got a false positive," then you kind of explicitly don't need to treat it with the same rigor as a medical test.
If those are your goals, just ship it; I don't think it makes sense to justify the effort to test in this situation, especially if, as you argue, it's financially feasible to roll back the change or pivot if it doesn't work.
AIFounder 13 hours ago [-]
[dead]
kgwgk 4 hours ago [-]
> Users are randomly assigned to one of the four layouts and you track their activity. Your hypothesis is: layout influences signup behavior.
> You plan ship the winner if the p-value for one of the layout choices falls below the conventional threshold of 0.05.
Tests P-value
B is winner 0.041
A is winner 0.051
D is winner 0.064
C is winner 0.063
What kind of comparison between the results for the four options makes each of them a likely winner? They all rank very well in whatever metric is being used!
Or maybe the are they being compared with a fifth - much worse - alternative.
vjerancrnjak 4 hours ago [-]
This kind of ranking is also not correct. You have to compare the outcomes , not their p values. Ranking by p values is just silly, just like ranking by avg metric is silly.
Startups in general have to
make decisions with high signal, thinking that 100 improvements of 1% p=0.05 will actually compound in an environment with so much noise is delusion.
I’d say doing this kind of silliness in a startup is just ceremonial, helpful long term if people feel they are doing a good job optimizing a compounding metric, even though it never materializes.
begemotz 2 hours ago [-]
> Your hypothesis is: layout influences signup behavior.
This might be your hypothesis but this isn't the hypothesis that the p-value is related to.
>Setting a p-value threshold of 0.05 is equivalent to saying: "I’m willing to accept a 5% chance of shipping something that only looked good by chance."
P-values don't provide any information about "chance occurrence" but rather they test the probability of observing a particular outcome assuming a particular state of the world (i.e. the null hypothesis).
> Bonferonni
Besides being a very aggressive (i.e. conservative) correction, I would imagine that in industry, just as in science, the motivations to observe results will mean it wont be used. There are other way more reasonable corrections.
> Avoid digging through metrics post hoc
The reasonable solution is not to ignore the results that you found but to interpret them appropriately. Maybe the result actually does signal improved intention - alternatively maybe it is noise. Treat it as exploratory data. If that improved retention was real, is this important? Important enough to appropriately retest for?
ec109685 14 hours ago [-]
Keep in mind that Frequent A/B tests burn statistical “credit.” Any time you ship a winner at p = 0.05 you’ve spent 5 % of your false-positive budget. Do that five times in a quarter and the chance at least one is noise is 1 – 0.95⁵ ≈ 23 %.
There are several approaches you can take to reduce that source of error:
Quarterly alpha ledger
Decide how much total risk you want this quarter (say 10 %). Divide the remaining α by the number of experiments left and make that the threshold for the next launch. Forces the “is this button-color test worth 3 % of our credibility?” conversation. More info: “Sequential Testing in Practice: Why Peeking Is a Problem and How to Fix It” (https://medium.com/@aisagescribe/sequential-testing-in-pract...).
Benjamini–Hochberg (BH) for metric sprawl
Once you watch a dozen KPIs, Bonferroni buries real lifts. BH ranks all the p-values at the end, then sets the cut so that, say, only 5 % of declared winners are false positives. You keep power, and you can run the same BH step on the primary metric from every experiment each quarter to catch lucky launches. More info: “Controlling False Discoveries: A Guide to BH Correction in Experimentation” (https://www.statsig.com/perspectives/controlling-false-disco...).
Bayesian shrinkage + 5 %
“ghost” control for big fleets
FAANG-scale labs run hundreds of tests and care about 0.1 % lifts. They pool everything in a simple hierarchical model; noisy effects get pulled toward the global mean, so only sturdy gains stay above water. Before launch, they sanity-check against a small slice of traffic that never saw any test. Cuts winner’s-curse inflation by ~30 %. Clear explainer: “How We Avoid A/B Testing Errors with Shrinkage” (https://eng.wealthfront.com/2015/10/29/how-we-avoid-ab-testi...) and (https://www.statsig.com/perspectives/informed-bayesian-ab-te...)
<10 tests a quarter: alpha ledger or yolo; dozens of tests and KPIs: BH; hundreds of live tests: shrinkage + ghost control.
akoboldfrying 12 hours ago [-]
> the chance at least one is noise is 1 – 0.95⁵ ≈ 23 %
Yes, but that's not really the big deal that you're making it out to be, since it's (usually) not an all-or-nothing thing. Usually, the wins are additive. The chance of each winner being genuine is still 95% (assuming no p-hacking), and so the expected number of wins out of those 5 will be be 0.95 * 5 = 4.75 wins (by linearity of expectation), which is a solid win rate.
kgwgk 5 hours ago [-]
>> the chance at least one is noise is 1 – 0.95⁵ ≈ 23 %
> The chance of each winner being genuine is still 95%
Not really. It depends on what’s the unknown (but fixed in a frequentist analysis like this one) difference between the options - or absence thereof.
If there is no real difference it’s 100% noise and each winner is genuine with probability 0%. If the difference is huge the first number is close to 0% and the second number is close to 100%.
ec109685 12 hours ago [-]
Good point. The 23% in the example refers to the worst case where 5 tests are all null throughout the period.
kylecazar 14 hours ago [-]
I like the points and I'll probably link to this.
I'll add one from my experience as a PM dealing with very "testy" peers in early stage startups: don't do any of this if you don't have {enough} users -- rely on intuition and focus on the core product.
physix 10 hours ago [-]
I was waiting for that comment to appear.
If your core product isn't any good, A/B testing seems like rearranging the deck chairs on the Titanic.
kdamica 11 hours ago [-]
Hard disagree with this. Unlike medical experiments, the cost of being wrong startup experiments is very low: you thought there was a small effect and there was none. It’s usually just a matter of pushing one variant vs another and moving on.
There are certainly scenarios where more rigor is appropriate, but usually those come from trying to figure out why you’re seeing a certain effect and how that should affect your overall company strategy.
My advice for startups is to run lots of experiments, do bad statistics, and know that you’re going to have some false positives so that you don’t take every result as gospel.
bravesoul2 11 hours ago [-]
The danger I think is less the numbers but what are you measuring makes sense. E.g. sure your A beats B in click through rate. But if the person then thinks fuck I was duped and closes the browser then that's no good.
tmoertel 14 hours ago [-]
When reading this article, be aware that there are some percent signs missing, and their absence might cause confusion. For example:
> After 9 peeks, the probability that at least one p-value dips below 0.05 is: 1 − (1 − 0.05)^9 = 37.
There should be a percent sign after that 37. (Probabilities cannot be greater than one.)
kookamamie 7 hours ago [-]
…or, you could have a product that does not hinge on some micro-optimization of a website layout.
roncesvalles 7 hours ago [-]
Exactly. This is what micro-optimization looks like on the product side.
stared 11 minutes ago [-]
I think we should declare a moratorium on the use of p-values.
If you don't understand what a p-value is, you shouldn't use it. If you do understand p-values, you're probably already moving away from them. The Bayesian approach makes much more sense. I highly recommend David MacKay's "Information Theory, Inference and Learning Algorithms," as well as "Bayesian Methods for Hackers": https://github.com/CamDavidsonPilon/Probabilistic-Programmin....
At the same time, startups aren't science experiments. Our goal is not necessarily to prove conclusively whether something is statistically "better". Rather, our goal is to solve real-world problems.
Suppose we run an A/B test, and its result indicates B is better according to whatever statistical test we've used. In this scenario, we will likely select B—frankly, regardless of whether B is truly better or merely indistinguishable from A.
However, what truly matters in practice are the metrics we choose to measure. Picking the wrong metric can lead to incorrect conclusions. For example, suppose our data shows that users spend, on average, two more seconds on our site with the new design (with p < 0.001 or whatever). That might be a positive result—or it could simply mean the new design causes slower loading or more confusion, frustrating users instead of benefiting them.
welpo 7 hours ago [-]
On the third point (peeking at p-values), I created an A/A test simulator that compares peeking vs not peeking in terms of false positive rate: https://stop-early-stopping.osc.garden/
dooglius 11 hours ago [-]
If the goal is for the company to maximize profit from having the best page, this is an instance of a very well-studied problem https://en.m.wikipedia.org/wiki/Multi-armed_bandit?useskin=v... and one can do much better than statistical significance testing. (If the goal is to validate scientific theories, or there are other extenuating factors, things may be different.)
meindnoch 4 hours ago [-]
I've never seen AB-testing micro-optimisation affect the bottom line of a software company by more than 3%.
aDyslecticCrow 2 hours ago [-]
Whenever working with this kind of probabilities, i always throw in a python rand() as a comparison. It sanity checks the calculation of threshold with very low rist of miscalculation.
Of course calculating the threshold properly needs to be done aswell... but a rand() is so quick and simple to add as a napkin check, and would catch how silly the first analysis is.
snowstormsun 5 hours ago [-]
The problem probably more often just is that the product/vision itself is not as good as it is sold to shareholders. No PM will like to push back on this, so they're stuck with things like optimizing the landing page because "clearly the business vision is flawless, it must be that users don't understand it correctly".
andy99 15 hours ago [-]
> Imagine you're a product manager trying to optimize your website’s dashboard. Your goal is to increase user signups.
This would be Series B or later right? I don't really feel like it's a core startup behavior.
mo_42 9 hours ago [-]
> Back to the dashboard experiment: after you applied the Bonferroni correction you got... nothing.
I guess you got something: Users are not sensitive to these changes, or that any effect is too small to detect with your current sample size/test setup.
In a startup scenario, I'd quickly move on and possibly ship all developed options if good enough.
Also, running A/B tests might not be the most appropriate method in such a scenario. What about user-centric UX research methods?
ryan-duve 14 hours ago [-]
Good news: no p-value threshold needs to be passed to switch from one UI layout to another. As long as they all cost the same amount of money to host/maintain/whatever, the point estimate is sufficient. The reason is, at the end of the day, some layout has to be shown, and if each option had an equal number of visitors during the test, you can safely pick the one with the most signups.
When choosing one of several A/B test options, a hypothesis test is not needed to validate the choice.
ec109685 13 hours ago [-]
Yes, but assuming it was enhancing something already there, it was all pointless work.
blobbers 13 hours ago [-]
1 - (1-0.95)^9 = 64
Did they generate this blog post with AI? That math be hallucinating. Don’t need a calculator to see that.
larfus 9 hours ago [-]
Read a few more posts and it shouts GPT occasionally. Plus the author's (as I like to call them still) role is listed as 'Content Engineer' which isn't inspiring either. Too bad, the topics sounded interesting.
blobbers 13 hours ago [-]
I’m so confused by the math in this article. It’s also not 37. I can’t be the only person scratching their head.
Hmm, (1 - (1-0.95))^9 also = 63%. No idea why 64, closest I can see is 1-(0.95)^20 or 1-(1-0.05)^20 = 64%.
blobbers 11 hours ago [-]
Yeah, I thought he was talking about 1 out of 20 features, but that's kind of why I was wondering if AI had written it. Sometimes it'll have mis-aligned figures etc.
calrain 6 hours ago [-]
This feels like a billion dollar company problem, not a startup problem
psychoslave 9 hours ago [-]
P-value is something new for me, so the post is starting with prerequisites in mind that I miss. Tough I can go search by myself, would anyone have some online resources I can follow and test myself against to recommend, please?
Thanks, seems relevant, I would appreciate also resources on epistomological level, when and where it was devised and any historical context of development.
Thanks again
shoo 15 hours ago [-]
related book: Trustworthy Online Controlled Experiments
I don't have any first hand experience with customer facing startups, SaaS or otherwise. How common is rigorous testing in the first place?
dayjah 15 hours ago [-]
As you scale it improves. More often at a small scale you ask users and they’ll give you invaluable information. As you scale you abstract folks into buckets. At about 1million MAU I’ve found A/B testing and p-value starts to make sense.
bcyn 15 hours ago [-]
Great read, thanks! Could you dive a little deeper into example 2 & pre-registration? Conceptually I understand how the probability of false positives increases with the number of variants.
But how does a simple act such as "pre-registration" change anything? It's not as if observing another metric that already existed changes anything about what you experimented with.
PollardsRho 15 hours ago [-]
If you have many metrics that could possibly be construed as "this was what we were trying to improve", that's many different possibilities for random variation to give you a false positive. If you're explicit at the start of an experiment that you're considering only a single metric a success, it turns any other results you get into "hmm, this is an interesting pattern that merits further exploration" and not "this is a significant result that confirms whatever I thought at the beginning."
It's basically a variation on the multiple comparisons, but sneakier: it's easy to spend an hour going through data and, over that time, test dozens of different hypotheses. At that point, whatever p-value you'd compute for a single comparison isn't relevant, because after that many comparisons you'd expect at least one to have uncorrected p = 0.05 by random chance.
noodletheworld 15 hours ago [-]
There are many resources that will explain this rigorously if you search for the term “p-hacking”.
The TLDR as I understand it is:
All data has patterns. If you look hard enough, you will find something.
How do you tell the difference between random variance and an actual pattern?
It’s simple and rigorously correct to only search the data for a single metric; other methods, eg. Bonferroni correction (divide p by k) exist, but are controversial (1).
Basically, are you a statistician? If not, sticking to the best practices in experimentation means your results are going to be meaningful.
If you see a pattern in another metric, run another experiment.
Aside from the p-values, I don't understand the reasoning behind whatever "experiment" is being used for the A/B testing. What test is being done whose result is interpreted as "A is winner"? The discussion is about those being separate comparisons, and yeah, okay, but what are they comparisons of? Each group in isolation vs. all the others? If (as the article says) the hypothesis is "layout influences signup behavior" then it seems more reasonable to do a chi-squared test on a contingency table of layout vs. signed-up-or-didn't, which would give you one p-value for "is there anything here at all".
And then, if there isn't. . . it means you can just ship whatever you want! The real root cause of p-hacking is glossed over in the article: "Nobody likes arriving empty-handed to leadership meetings." This is the corporate equivalent of "no one will publish a null result", and is just as harmful here. The statistical techniques described are fine, but there's not necessarily a reason to fortify your stats against multiple comparisons rather than just accepting a null result.
And you can, because of the other thing I kept thinking when reading this: you have to ship something. There isn't really a "control" condition if you're talking about building a website from scratch. So whether the result is null doesn't really matter. It's not like comparing different medicines or fertilizers or something where if none of them work you just do nothing; there is no "do nothing" option in this situation. So why not just take a simple effect measurement (e.g., proportion who signed up) and pick the layout that performs best? If that result is statistically significant, great, it means you picked the best one, and if it's not, it just means it doesn't matter which one you pick, so the one you picked is still fine. (And if you have an existing design and you're trying to see if a new one will be better, the null result just means "there's no reason to switch", which means the existing design is also fine.)
vmesel 6 hours ago [-]
Parabéns pelo conteúdo Thais!
NoahZuniga 10 hours ago [-]
Why don't the p values in the first figure sum to 1?
cckolon 14 hours ago [-]
Example 01 is basically the “green jellybeans cause acne” problem
Don't do any of this. It's very outdated advice. And you're going to get it wrong anyway. These threshold adjustment methods were invented before we had access to reasonable computers.
You shuffle the data. Say you want to know if viewing time is affected by color. Literally randomly shuffle viewing time and color. Then, look at that distribution. Is the data that you observed significant?
As long as you shuffle everything related to your experiment and you don't double dip into the data you're going to get things right.
This also has the big advantage that it doesn't overcorrect like traditional methods which apply such strict corrections that eventually it's impossible to get significant results.
This post hasn't even begun scratching the surface on what can go wrong with traditional tests. Just don't use them.
This has nothing to do with speed or rigor. Permutation tests are much simpler to run and faster to analyze. Sadly we keep teaching crappy statistics to our students.
syntacticsalt 8 hours ago [-]
Permutation tests don't account for family-wise error rate effects, so I'm curious why you would say that "it doesn't overcorrect like traditional methods".
I'm also curious why you say those "cover every case", because permutation tests tend to be underpowered, and also tend to be cumbersome when it comes to constructing confidence intervals of statistics, compared to something like the bootstrap.
Don't get me wrong -- I like permutation tests, especially for their versatility, but as one tool out of a bunch of methods.
cornel_io 9 hours ago [-]
Even though this post says exactly the thing that most Proper Analysts will say, and write long LinkedIn posts about where other Proper Analysts congratulate them on standing up for Proper Analysis in the face of Evil And Stupid Business Dummies who just want to make bad decisions based on too little data, it's wrong. The Jedi Bell Curve meme is in full effect on this topic, and I say this as someone who took years to get over the midwit hump and correct my mistaken beliefs.
The business reality is, you aren't Google. You can't collect a hundred million data points for each experiment that you run so that you can reliably pick out 0.1% effects. Most experiments will have a much shorter window than any analyst wants them to, and will have far too few users, with no option to let them run longer. You still have to make a damned decision, now, and move on to the next feature (which will also be tested in a heavily underpowered manner).
Posts like this say that you should be really, REALLY careful about this, and apply Bonferonni corrections and make sure you're not "peeking" (or if you do peek, apply corrections that are even more conservative), preregister, etc. All the math is fine, sure. But if you take this very seriously and are in the situation that most startups are in where the data is extremely thin and you need to move extremely fast, the end result is that you should reject almost every experiment (and if you're leaning on tests, every feature). That's the "correct" decision, academically, because most features lie in the sub 5% impact range on almost any metric you care about, and with a small number of users you'll never have enough power to pick out effects that small (typically you'd want maybe 100k, depending on the metric you're looking at, and YOU probably have a fraction of that many users).
But obviously the right move is not to just never change the product because you can't prove that the changes are good - that's effectively applying a very strong prior in favor of the control group, and that's problematic. Nor should you just roll out whatever crap your product people throw at the wall: while there is a slight bias in most experiments in favor of the variant, it's very slight, so your feature designers are probably building harmful stuff about half the time. You should apply some filter to make sure they're helping the product and not just doing a random walk through design space.
The best simple strategy in a real world where most effect sizes are small and you never have the option to gather more data really is to do the dumb thing: run experiments for as long as you can, pick whichever variant seems like it's winning, rinse and repeat.
Yes, you're going to be picking the wrong variant way more often than your analysts would prefer, but that's way better than never changing the product or holding out for the very few hugely impactful changes that you are properly powered for. On average, over the long run, blindly picking the bigger number will stack small changes, and while a lot of those will turn out to be negative, your testing will bias somewhat in favor of positive ones and add up over time. And this strategy will provably beat one that does Proper Statistics and demands 95% confidence or whatever equivalent Bayesian criteria you use, because it leaves room to accept the small improvements that make up the vast majority of feature space.
There's an equivalent and perhaps simpler way to justify this, which is to throw out the group labels: if we didn't know which one was the control and we had to pick which option was better, then quite obviously, regardless of how much data we have, we just pick the one that shows better results in the sample we have. Including if there's just a single user in each group! In an early product, this is TOTALLY REASONABLE, because your current product sucks, and you have no reason to think that the way it is should not be messed with. Late lifecycle products probably have some Chesterton's fence stuff going on, so maybe there's more of an argument to privilege the control, but those types of products should have enough users to run properly powered tests.
akoboldfrying 12 hours ago [-]
Yes! (Correct) pre-registration is everything. ("Correct" meaning: There's no point "pre-registering" if you fail to account for the number of tests you'll do -- but hopefully the fact that you have thought to pre-register at all is a strong indication that you should be performing such corrections.)
That said, I agree with the other poster here about how important this really is for startups. It's critical to know if the drug really improves lung function; it's probably not critical to know whether the accent colour on your landing page should be mauve or aqua blue.
No, it means "I’m willing to ship something that if it was not better than the alternative it would have had only a 5% chance of looking as good as it did.”
Parent: "5% chance of looking as good as it did, if it were truly no better than the alternative." This accepts the premise that the product quality is a fact, and only uses probability to describe the (noisy / probabilistic) measurements, i.e. "5% chance of looking as good".
Parent is right to pick up on this, if we're talking about a single product (or, in medicine, if we're talking about a single study evaluating a new treatment). But if we're talking about a workflow for evaluating many products, and we're prepared to consider a probability model that says some products are better than the alternative and others aren't, then the author's version is reasonable.
Consider this example - we don't change the treatment at all, we just update its name. We split into two groups and run the same treatment on both, but under one of the two names at random. We get a p value of 0.2 that the new one is better. Is it reasonable to say that there's a >= 80% chance it really was better, knowing that it was literally the same treatment?
E.g. imagine your test has a 5% false positive rate for a disease only 1 in 1 million people has. If you test 1 million people you expect 50,000 false positive and 1 true positive. So the chance that one of those positive results is a false positive is 50,000/50,001, not 5/100.
Using a p-value threshold of 0.05 similar to saying: I'm going to use a test that will call a false result positive 5% of the time.
The author said: chance that a positive result is a false positive == the false positive rate.
Parent: 5% chance it could be same
Not only are expriments commonly multi-arm, you also repeat your experiment (usually after making some changes) if the previous experiment failed / did not pass the launch criteria.
This is further complicated by the fact that lauch criteria is usually not well defined ahead of time. Unless it's a complete slam dunk, you won't know until your launch meeting whether the experiment will be approved for launch or not. It's mostly vibe based, determined based on tens or hundreds of "relevant" metric movements, often decided on the whim of the stakeholder sitting at the lauch meeting.
The idea is not do do science. The idea is to loosely systematize and conceptualize innovation. To generate options and create a failure tolerant system.
I'm sure improvements could be made... but this isn't about being a valid or invalid expirement.
> The problem is there can be a large number of potential comparisons when the details of data analysis are highly contingent on data, without the researcher having to perform any conscious procedure of fishing or examining multiple p values. We discuss in the context of several examples of published papers where data-analysis decisions were theoretically-motivated based on previous literature, but where the details of data selection and analysis were not pre specified and, as a result, were contingent on data.
It's worth emphasizing though that if your startup hasn't achieved product market fit yet this kind of thing is a huge waste of time! Build features, see if people use them.
This is why the cover of the reference A/B test book for product dev has a hippo: A/B test is helpful against just following the HIghest Paid Person Opinion. The practice is ofc more complicated, but that's more organizational/politics.
The vast majority of A/B test results I've seen showed no significant win in one direction or the other, in which case why did we just add six weeks of delay and twice the development work to the feature?
Usually it was because the Highest Paid Person insisted on an A/B test because they weren't confident enough to move on without that safety blanket.
There are other, much cheaper things you can do to de-risk a new feature. Build a quick prototype and run a usability test with 2-3 participants - you get more information for a fraction of the time and cost of an A/B test.
There’s no reason to run AB / MVT tests at all if you’re not doing them properly.
> Your hypothesis is: layout influences signup behavior.
I would expect that then the null hypothesis is that *layout does not influence signup behavior*. I would think that then an ANOVA (or an equivalent linear model) to be what tests this hypothesis, where you test the 4 layouts (or the 4 new layouts plus a control?) in one factor. If you get a significant p-value (no multiple tests required) you go on with post-hoc tests to look into comparisons between the different layouts (for 4 layouts, it should be 6 tests). But then you can use ways to control for multiple comparisons that are not as strict as just dividing your threshold by the number of comparisons, eg with Tukey's test.
But here I assume there is a control (as in some for users are still presented the old layout?) and each layout is compared to that control? If I would see that distribution of p-values I would just intuitively think that the experiment is underpowered. P-values from null tests are supposed to be distributed uniformly between 0 and 1, while these cluster around 0.05. It rather seems like a situation that it is hard to make inferences from because of issues in designing the experiment itself.
For example, I would rather have fewer layouts, driven by some expert design knowledge, rather than a lot of randomish layouts. The first increases statistical power, because the fewer tests you investigate, the less you have to adjust your p-values. But also, the fewer layouts you have, the more users you have per group (as the test is between groups) which also increases statistical power. The article is not wrong overall about how to control p-values etc, but I think that this knowledge is important not just to "do the right analysis" but, even more importantly, understand the limitations of an experimental design and structure it in a way that it may succeed in telling you something. To this end, g*power [0] is a useful tool that eg can let one calculate sample size in advance based on predicted effect size and power required.
[0] https://www.psychologie.hhu.de/arbeitsgruppen/allgemeine-psy...
Even at places that want to ruthlessly prioritize velocity over rigor I think it would be better to at least switch things up and worry more about effect size than p-value. Don't bother waiting to see if marginal effects are "significant" statistically if they aren't significant from the POV of "we need to do things that can 10x our revenue since we're a young startup."
That's because nobody learns how to do statistics and/or those who do are not really interested in it.
I taught statistics to biology students. Most them treated the statistics (and programming) courses like chores. Out of 300-ish students per year we had one or two that didn't leave uni mostly clueless about statistics.
For me, stats was something I had to re-learn years after graduating, after I realized their importance (not just practical, but also epistemological). During university years, whatever interest I might have had, got extinguished the second the TA started talking about those f-in urns filled with colored balls.
> those f-in urns filled with colored balls.
I did my Abitur [1] in 2005, back then that used to be high school material.
When I was teaching statistics we had to cut more and more content from the courses in favor of getting people up to speed on content that they should have known from school.
[1] https://en.m.wikipedia.org/wiki/Abitur
In the US, students are the paying customers. The consequence for not learning everything is lowered skills available for the job market (engineering) or life (philosophy?).
To me it is preferable that students who do not understand are not rated highly by the university (=do not get top marks), but “forcing” the students to learn statistics? That doesn’t make much sense.
Also, there’s nothing wrong with learning something after uni. Every skill I use in my job was developed post-degree. Really.
The surface issue is that when somebody has an incentive to self-measure their success then they have an incentive to overestimate (I increased retention by 14% by changing the shade of the "About Us" button!).
Which means the root-cause issue is managers who create environments where self-reporting improvements without any rigor or any contrary perspective. Ultimately they are the ones foot-gunning themselves (by letting their team focus on false vanity metrics).
Of course calculating the threshold properly needs to be done aswell... but a rand() is so quick and simple to add as a napkin check.
But does it, really? A lot of companies sell... well, let's say "not important" stuff. Most companies don't cost peoples' lives when you get it wrong. If you A/B test user signups for a startup that sells widgets, people aren't living or dying based on the results. The consequences of getting it wrong are... you sell fewer widgets?
While I understand the overall point of the post -- and agree with it! -- I do take issue with this particular point. A lot of companies are, arguably, _too rigorous_ when it comes to testing.
At my last company, we spent 6 weeks waiting for stat sig. But within 48 hours, we had a positive signal. Conversion was up! Not statistically significant, but trending in the direction we wanted. But to "maintain rigor," we waited 6 weeks before turning it... and the final numbers were virtually the same as the 48 hour numbers.
Note: I'm not advocating stopping tests as soon as something shows trending in the right direction. The third scenario on the post points this out as a flaw! I do like their proposal for "peeking" and subsequent testing.
But, really, let's just be realistic about what level of "rigor" is required to make decisions. We aren't shooting rockets into space. We're shipping software. We can change things if we get them wrong. It's okay. The world won't end.
IMO, the right framing here is: your startup deserves to be as rigorous as is necessary to achieve its goals. If its goals are "stat sig on every test," then sure, treat it like someone might die if you're wrong. (I would argue that you have the wrong goals, in this case, but I digress...)
But if your goals are "do no harm, see if we're heading in the right vector, and trust that you can pivot if it turns out you got a false positive," then you kind of explicitly don't need to treat it with the same rigor as a medical test.
I could write pages on this (I’ve certainly spoken for hours) but the adoption of a scientific research mindset is very limiting for A/B testing. You don’t need all the status quo bias of null hypothesis testing.
At the same time, it’s quite impressive how people are able to adapt. An organization experienced with A/B testing will start doing things like multi variate correction in their heads.
For anyone spinning this stuff up, go Bayesian from the start. You’ll end up there, whether you realize it or not. (People will look at p-values in consideration of prior evidence).
0.05 (or any Bayesian equivalent) is not a magic number. It’s really quite high for a default. Harder sciences (the ones not in replication crisis) use much stricter values by default.
Adjust the confidence required to the cost of the change and the risk of harm. If you’re at the point of testing, the cost of change may be zero (content). It may be really high, it may be net negative!
But in most cases, at a startup, you should be going after wins that are way more impactful and end up having p-values lower than 0.05, anyway. This is easy to say, but don’t waste your time coming up with methods to squeeze out more signal. Just (just lol) make better changes to your product so that the methods don’t matter. If p=0.00001, that’s going to be a better signal than p=0.05 with every correction in this article.
If you’re going to pick any fanciness from the start (besides Bayes) make it anytime valid methods. You’re certainly already going to be peaking (as you should) so have your data reflect that.
You don't have to make the status quo be the null hypothesis. If you make a change, you probably already think that your change is better or at least neutral, so make that the null. If you get a strong signal that your change is actually worse, rejecting the null, revert the change.
Not "only keep changes that are clearly good" but "don't keep changes that are clearly bad."
Not many users means that getting to stat sig will take longer (if at all).
Sometimes you just need to trust your design/product sense and assert that some change you’re making is better and push it without an experiment. Too often people use experimentation for CYA reasons so they can never be blamed for making a misstep
The company has a large user base, it’s just SaaS doesn’t have the same conversion # as, say, e-commerce.
Completely agree on the Bayesian point though, and the importance of defining the loss function. Getting people used to talking about the strength of the evidence rather than statistical significance is a massive win most of the time.
The degree of care can be different in less critical contexts, but then you shouldn’t lie to yourself about how much you care.
Sometimes someone just has to make imperfect decisions based on incomplete information, or make arbitrary judgment calls. And that’s totally fine… But it shouldn’t be confused with data-driven decisions.
The two kinds of decisions need to happen. They can both happen honestly.
But continue a percentage of A/B/n testing as well.
This allows for a balancing of speed vs. certainty
This is especially useful for something where the value of the choice is front loaded, like headlines.
Make your expectations explicit instead of implicit. 0.05 is completely arbitrary. If you are comfortable with a 50/50 chance of being right, make your threshold less rigorous.
If that’s the difference between success and failure then that is pretty important to you as a business owner.
> do no harm, see if we're heading in the right vector, and trust that you can pivot if it turns out you got a false positive
That’s a reasonable, and in plenty of contexts the absolute best, approach to take. But don’t call it A/B testing, because it’s not.
It's not about space rocket type of rigor, but it's about a higher bar than the current state.
(Besides, Elon's rockets are failing left and right, in contrast to what NASA achieved in the 60s, so there are some lessons there too.)
I think we can realize another reason to just ship it. Startups need to be always moving. You need to keep turning the wheel to help keep everyone busy and keep them from fretting about your slow growth or high churn metrics. Startups need lots of fighting spirit. So it's still probably better to ship it rather than admit defeat and suffer bad vibes.
Startups need to ship because they need to have a habit of moving constantly to survive. Stasis is death for a startup.
Most of us don't, indeed. So still aligned with your perspective, it's good to take in consideration what we are currently working on, and what will be the possible implication. Sometimes the line is not so obvious though. If we design a library or framework which is not very specific to a inconsequential outcome, it's no longer obvious what policy make the more sense.
The author didn't go into why companies do this (ignoring or misreading test results). Putting lack of understanding aside, my anecdotal experience from the time I worked as a data scientist boils down to a few major reasons:
- Wanting to be right. Being a founder requires high self-confidence, that feeling of "I know I'm right". But feeling right doesn't make one right, and there's plenty of evidence around that people will ignore evidence against their beliefs, even rationalize the denial (and yes, the irony of that statement is not lost on me); - Pressure to show work: doing the umpteenth UI redesign is better than just saying "it's irrelevant" in your performance evaluation. If the result is inconclusive, the harm is smaller than not having anything to show - you are stalling the conclusion that your work is irrelevant by doing whatever. So you keep on pushing them and reframing the results into some BS interpretation just to get some more time.
Another thing that is not discussed enough is what all these inconclusive results would mean if properly interpreted. A long sequence of inconclusive UI redesign experiments should trigger a hypothesis like "does the UI matter"? But again, those are existentially threatening questions for the people in the best position to come up with them. If any company out there were serious about being data-driven and scientific, they'd require tests everywhere, have external controls on quality and rigour of those and use them to make strategic decisions on where they invest and divest. At the very least, take them as a serious part of their strategy input.
I'm not saying you can do everything based on tests, nor that you should - there are bets on the future, hypothesis making on new scenarios and things that are just too costly, ethically or physically impossible to test. But consistently testing and analysing test results could save a lot of work and money.
True, but it usually costs money to fix it. I think the themes of "this only matters if lives are on the line" or "it's too rigorous" are straw-men.
We have limited resources -- time, money, people. We'd like to avoid deploying those resources badly. Statistical inference can be one way to give us more information so we avoid using our resources badly, but as you note, statistical inference also has costs: we have to spend resources to get the data we need to do the inference, plus other costs. We can estimate the costs of getting sufficient data using sample size estimation methods. For go/no-go decision-making, if the cost of getting the decision wrong isn't something like at least 10x the cost of doing the statistical inference, I don't think it's worth doing the inference. It may be worth doing the inference for _other_ reasons, but those reasons are out of scope.
As an example, a common use of statistical inference in medical research is to compare the efficacy of a treatment with a placebo. Some of the motivation is to decide whether to invest more resources in developing the treatment, not because people will die if they get a false positive stating that the treatment is effective when it isn't.
> A lot of companies are, arguably, _too rigorous_ when it comes to testing.
My experience in industry has been the opposite. Companies like the idea of data-driven decision-making, but then they discover pain points. They should have some idea of how much of a change they're looking to detect (i.e., an effect size). They should estimate how much data they're likely to need to run their tests (i.e., sample size estimation). They have to consider other issues like model misfit, calibration, multiple-testing corrections, and so on. Then they also have to rig up the infra to be able to _do_ the testing, collect the data, analyze the results, and communicate the results to their internal stakeholders. These pain points are why companies like Eppo and StatSig exist -- A/B testing ends up being more high-touch than developers expect.
Messing up any one of these issues can yield "flaky tests," which developers hate. Failing to gather a sufficiently large sample size for a given effect size is a pretty common failure mode.
> But to "maintain rigor," we waited 6 weeks before turning it... and the final numbers were virtually the same as the 48 hour numbers.
It's difficult to tell precisely what you mean by "maintain rigor" here. The only context I can gather is that whatever procedure you were using needed more data in order to satisfy the preconditions of the test needed for the nominal design criteria of the test -- usually, its nominal false positive rate. I don't think this is an issue of rigor -- it's an issue of statistical modeling and correctness.
Sometimes, it's possible to use different methods that may require less data at the cost of more (or different) modeling assumptions. Failing to satisfy the assumptions of a test can increase its false positive rate. Whether that matters is really up to you.
> I do like their proposal for "peeking" and subsequent testing.
What the post is suggesting is not a proposal, but a standard class of frequentist statistical inference methods called sequential testing. Daniël Lakens has a good online textbook (https://lakens.github.io/statistical_inferences/) that briefly discusses these methods in Chapter 10 and provides further references.
> We're shipping software. We can change things if we get them wrong.
That's usually true -- as long as you have the resources needed to make those changes, and are willing to spend them that way.
> IMO, the right framing here is: your startup deserves to be as rigorous as is necessary to achieve its goals.
While I don't disagree with the sentiment, I think you're conflating rigor with correctness here.
> If its goals are "stat sig on every test", then sure, treat it like someone might die if you're wrong.
I think that's a false equivalence. Even the American Statistical Association has issued a statement on p-values (see https://www.amstat.org/asa/files/pdfs/p-valuestatement.pdf) that includes "Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold."
> But if your goals are "do no harm, see if we're heading in the right vector, and trust that you can pivot if it turns out you got a false positive," then you kind of explicitly don't need to treat it with the same rigor as a medical test.
If those are your goals, just ship it; I don't think it makes sense to justify the effort to test in this situation, especially if, as you argue, it's financially feasible to roll back the change or pivot if it doesn't work.
> You plan ship the winner if the p-value for one of the layout choices falls below the conventional threshold of 0.05.
What kind of comparison between the results for the four options makes each of them a likely winner? They all rank very well in whatever metric is being used!Or maybe the are they being compared with a fifth - much worse - alternative.
Startups in general have to make decisions with high signal, thinking that 100 improvements of 1% p=0.05 will actually compound in an environment with so much noise is delusion.
I’d say doing this kind of silliness in a startup is just ceremonial, helpful long term if people feel they are doing a good job optimizing a compounding metric, even though it never materializes.
This might be your hypothesis but this isn't the hypothesis that the p-value is related to.
>Setting a p-value threshold of 0.05 is equivalent to saying: "I’m willing to accept a 5% chance of shipping something that only looked good by chance."
P-values don't provide any information about "chance occurrence" but rather they test the probability of observing a particular outcome assuming a particular state of the world (i.e. the null hypothesis).
> Bonferonni
Besides being a very aggressive (i.e. conservative) correction, I would imagine that in industry, just as in science, the motivations to observe results will mean it wont be used. There are other way more reasonable corrections.
> Avoid digging through metrics post hoc
The reasonable solution is not to ignore the results that you found but to interpret them appropriately. Maybe the result actually does signal improved intention - alternatively maybe it is noise. Treat it as exploratory data. If that improved retention was real, is this important? Important enough to appropriately retest for?
There are several approaches you can take to reduce that source of error:
Quarterly alpha ledger
Decide how much total risk you want this quarter (say 10 %). Divide the remaining α by the number of experiments left and make that the threshold for the next launch. Forces the “is this button-color test worth 3 % of our credibility?” conversation. More info: “Sequential Testing in Practice: Why Peeking Is a Problem and How to Fix It” (https://medium.com/@aisagescribe/sequential-testing-in-pract...).
Benjamini–Hochberg (BH) for metric sprawl
Once you watch a dozen KPIs, Bonferroni buries real lifts. BH ranks all the p-values at the end, then sets the cut so that, say, only 5 % of declared winners are false positives. You keep power, and you can run the same BH step on the primary metric from every experiment each quarter to catch lucky launches. More info: “Controlling False Discoveries: A Guide to BH Correction in Experimentation” (https://www.statsig.com/perspectives/controlling-false-disco...).
Bayesian shrinkage + 5 %
“ghost” control for big fleets FAANG-scale labs run hundreds of tests and care about 0.1 % lifts. They pool everything in a simple hierarchical model; noisy effects get pulled toward the global mean, so only sturdy gains stay above water. Before launch, they sanity-check against a small slice of traffic that never saw any test. Cuts winner’s-curse inflation by ~30 %. Clear explainer: “How We Avoid A/B Testing Errors with Shrinkage” (https://eng.wealthfront.com/2015/10/29/how-we-avoid-ab-testi...) and (https://www.statsig.com/perspectives/informed-bayesian-ab-te...)
<10 tests a quarter: alpha ledger or yolo; dozens of tests and KPIs: BH; hundreds of live tests: shrinkage + ghost control.
Yes, but that's not really the big deal that you're making it out to be, since it's (usually) not an all-or-nothing thing. Usually, the wins are additive. The chance of each winner being genuine is still 95% (assuming no p-hacking), and so the expected number of wins out of those 5 will be be 0.95 * 5 = 4.75 wins (by linearity of expectation), which is a solid win rate.
> The chance of each winner being genuine is still 95%
Not really. It depends on what’s the unknown (but fixed in a frequentist analysis like this one) difference between the options - or absence thereof.
If there is no real difference it’s 100% noise and each winner is genuine with probability 0%. If the difference is huge the first number is close to 0% and the second number is close to 100%.
I'll add one from my experience as a PM dealing with very "testy" peers in early stage startups: don't do any of this if you don't have {enough} users -- rely on intuition and focus on the core product.
If your core product isn't any good, A/B testing seems like rearranging the deck chairs on the Titanic.
There are certainly scenarios where more rigor is appropriate, but usually those come from trying to figure out why you’re seeing a certain effect and how that should affect your overall company strategy.
My advice for startups is to run lots of experiments, do bad statistics, and know that you’re going to have some false positives so that you don’t take every result as gospel.
> After 9 peeks, the probability that at least one p-value dips below 0.05 is: 1 − (1 − 0.05)^9 = 37.
There should be a percent sign after that 37. (Probabilities cannot be greater than one.)
If you don't understand what a p-value is, you shouldn't use it. If you do understand p-values, you're probably already moving away from them. The Bayesian approach makes much more sense. I highly recommend David MacKay's "Information Theory, Inference and Learning Algorithms," as well as "Bayesian Methods for Hackers": https://github.com/CamDavidsonPilon/Probabilistic-Programmin....
At the same time, startups aren't science experiments. Our goal is not necessarily to prove conclusively whether something is statistically "better". Rather, our goal is to solve real-world problems.
Suppose we run an A/B test, and its result indicates B is better according to whatever statistical test we've used. In this scenario, we will likely select B—frankly, regardless of whether B is truly better or merely indistinguishable from A.
However, what truly matters in practice are the metrics we choose to measure. Picking the wrong metric can lead to incorrect conclusions. For example, suppose our data shows that users spend, on average, two more seconds on our site with the new design (with p < 0.001 or whatever). That might be a positive result—or it could simply mean the new design causes slower loading or more confusion, frustrating users instead of benefiting them.
Of course calculating the threshold properly needs to be done aswell... but a rand() is so quick and simple to add as a napkin check, and would catch how silly the first analysis is.
This would be Series B or later right? I don't really feel like it's a core startup behavior.
I guess you got something: Users are not sensitive to these changes, or that any effect is too small to detect with your current sample size/test setup.
In a startup scenario, I'd quickly move on and possibly ship all developed options if good enough.
Also, running A/B tests might not be the most appropriate method in such a scenario. What about user-centric UX research methods?
When choosing one of several A/B test options, a hypothesis test is not needed to validate the choice.
Did they generate this blog post with AI? That math be hallucinating. Don’t need a calculator to see that.
Hmm, (1 - (1-0.95))^9 also = 63%. No idea why 64, closest I can see is 1-(0.95)^20 or 1-(1-0.05)^20 = 64%.
https://pmc.ncbi.nlm.nih.gov/articles/PMC5187603/
Thanks again
https://experimentguide.com/
But how does a simple act such as "pre-registration" change anything? It's not as if observing another metric that already existed changes anything about what you experimented with.
It's basically a variation on the multiple comparisons, but sneakier: it's easy to spend an hour going through data and, over that time, test dozens of different hypotheses. At that point, whatever p-value you'd compute for a single comparison isn't relevant, because after that many comparisons you'd expect at least one to have uncorrected p = 0.05 by random chance.
The TLDR as I understand it is:
All data has patterns. If you look hard enough, you will find something.
How do you tell the difference between random variance and an actual pattern?
It’s simple and rigorously correct to only search the data for a single metric; other methods, eg. Bonferroni correction (divide p by k) exist, but are controversial (1).
Basically, are you a statistician? If not, sticking to the best practices in experimentation means your results are going to be meaningful.
If you see a pattern in another metric, run another experiment.
[1] - https://pmc.ncbi.nlm.nih.gov/articles/PMC1112991/
https://en.wikipedia.org/wiki/Sequential_probability_ratio_t...
And then, if there isn't. . . it means you can just ship whatever you want! The real root cause of p-hacking is glossed over in the article: "Nobody likes arriving empty-handed to leadership meetings." This is the corporate equivalent of "no one will publish a null result", and is just as harmful here. The statistical techniques described are fine, but there's not necessarily a reason to fortify your stats against multiple comparisons rather than just accepting a null result.
And you can, because of the other thing I kept thinking when reading this: you have to ship something. There isn't really a "control" condition if you're talking about building a website from scratch. So whether the result is null doesn't really matter. It's not like comparing different medicines or fertilizers or something where if none of them work you just do nothing; there is no "do nothing" option in this situation. So why not just take a simple effect measurement (e.g., proportion who signed up) and pick the layout that performs best? If that result is statistically significant, great, it means you picked the best one, and if it's not, it just means it doesn't matter which one you pick, so the one you picked is still fine. (And if you have an existing design and you're trying to see if a new one will be better, the null result just means "there's no reason to switch", which means the existing design is also fine.)
https://xkcd.com/882/
There's a far simpler method that covers every case: permutation tests. https://bookdown.org/ybrandvain/Applied-Biostats/perm1.html
You shuffle the data. Say you want to know if viewing time is affected by color. Literally randomly shuffle viewing time and color. Then, look at that distribution. Is the data that you observed significant?
As long as you shuffle everything related to your experiment and you don't double dip into the data you're going to get things right.
This also has the big advantage that it doesn't overcorrect like traditional methods which apply such strict corrections that eventually it's impossible to get significant results.
This post hasn't even begun scratching the surface on what can go wrong with traditional tests. Just don't use them.
This has nothing to do with speed or rigor. Permutation tests are much simpler to run and faster to analyze. Sadly we keep teaching crappy statistics to our students.
I'm also curious why you say those "cover every case", because permutation tests tend to be underpowered, and also tend to be cumbersome when it comes to constructing confidence intervals of statistics, compared to something like the bootstrap.
Don't get me wrong -- I like permutation tests, especially for their versatility, but as one tool out of a bunch of methods.
The business reality is, you aren't Google. You can't collect a hundred million data points for each experiment that you run so that you can reliably pick out 0.1% effects. Most experiments will have a much shorter window than any analyst wants them to, and will have far too few users, with no option to let them run longer. You still have to make a damned decision, now, and move on to the next feature (which will also be tested in a heavily underpowered manner).
Posts like this say that you should be really, REALLY careful about this, and apply Bonferonni corrections and make sure you're not "peeking" (or if you do peek, apply corrections that are even more conservative), preregister, etc. All the math is fine, sure. But if you take this very seriously and are in the situation that most startups are in where the data is extremely thin and you need to move extremely fast, the end result is that you should reject almost every experiment (and if you're leaning on tests, every feature). That's the "correct" decision, academically, because most features lie in the sub 5% impact range on almost any metric you care about, and with a small number of users you'll never have enough power to pick out effects that small (typically you'd want maybe 100k, depending on the metric you're looking at, and YOU probably have a fraction of that many users).
But obviously the right move is not to just never change the product because you can't prove that the changes are good - that's effectively applying a very strong prior in favor of the control group, and that's problematic. Nor should you just roll out whatever crap your product people throw at the wall: while there is a slight bias in most experiments in favor of the variant, it's very slight, so your feature designers are probably building harmful stuff about half the time. You should apply some filter to make sure they're helping the product and not just doing a random walk through design space.
The best simple strategy in a real world where most effect sizes are small and you never have the option to gather more data really is to do the dumb thing: run experiments for as long as you can, pick whichever variant seems like it's winning, rinse and repeat.
Yes, you're going to be picking the wrong variant way more often than your analysts would prefer, but that's way better than never changing the product or holding out for the very few hugely impactful changes that you are properly powered for. On average, over the long run, blindly picking the bigger number will stack small changes, and while a lot of those will turn out to be negative, your testing will bias somewhat in favor of positive ones and add up over time. And this strategy will provably beat one that does Proper Statistics and demands 95% confidence or whatever equivalent Bayesian criteria you use, because it leaves room to accept the small improvements that make up the vast majority of feature space.
There's an equivalent and perhaps simpler way to justify this, which is to throw out the group labels: if we didn't know which one was the control and we had to pick which option was better, then quite obviously, regardless of how much data we have, we just pick the one that shows better results in the sample we have. Including if there's just a single user in each group! In an early product, this is TOTALLY REASONABLE, because your current product sucks, and you have no reason to think that the way it is should not be messed with. Late lifecycle products probably have some Chesterton's fence stuff going on, so maybe there's more of an argument to privilege the control, but those types of products should have enough users to run properly powered tests.
That said, I agree with the other poster here about how important this really is for startups. It's critical to know if the drug really improves lung function; it's probably not critical to know whether the accent colour on your landing page should be mauve or aqua blue.