From Data to Decisions: Experiment Design

Here’s a misconception that costs teams months of progress: running an experiment means you’re being scientific and data-driven. In reality, most experiments I review are flawed from the start. Not because the implementation is wrong, but because the design makes any result meaningless.

I’ve reviewed dozens of experiment proposals over the years. The pattern is depressingly consistent: teams jump straight to “let’s AB test this” without establishing what they’re actually trying to learn. They run experiments that can’t possibly answer their questions, then make major decisions based on ambiguous results.

You want to know the truth? While running experiments is not easy, the challenge isn’t running them. Most teams have the technical capability. The challenge is designing experiments that actually inform decisions.

Core Process: Setting Up for Success

Start With a Falsifiable Hypothesis

“Let’s test whether users like the new design” isn’t a hypothesis. It’s a vague hope.

A proper hypothesis is falsifiable and specific: “Users who see the streamlined onboarding (treatment) will complete signup at a rate at least 10% higher than users who see the current onboarding (control), measured over two weeks with minimum 5,000 users per variant.”

Notice what that includes:

Specific treatment and control conditions
Measurable outcome (completion rate)
Minimum detectable effect (10%)
Time frame (two weeks)
Sample size consideration (5,000 users)

When Booking.com runs experiments, their hypotheses are brutally specific (I know, I failed the inverview). They’re not testing “does urgency messaging work?” — they’re testing “does showing ‘only 2 rooms left’ for properties with fewer than 3 rooms available increase booking rate by at least 2% without decreasing customer satisfaction scores?”

That specificity forces clarity. It prevents the slippery interpretation that turns every experiment into a “success” regardless of results.

The Pre-Mortem: What Could Go Wrong?

Before launching any experiment, I like to run a pre-mortem. I assume the experiment has failed spectacularly and work backwards: what went wrong?

This exercise surfaces assumptions I didn’t realise I was making. Common failure modes:

Novelty effects - Users engage more with anything new, regardless of whether it’s actually better. The solution: run experiments long enough for novelty to wear off. Two weeks minimum for most product changes, longer for features users don’t engage with daily.

Selection bias - Your treatment and control groups aren’t actually comparable. This happens when randomisation isn’t truly random, or when users can self-select into variants.

Metric interference - Your experiment affects metrics you didn’t intend to measure. Optimising for clicks might tank engagement downstream. Improving conversion might attract the wrong customers who churn quickly. Always define your guardrail metrics upfront.

Insufficient power - Your sample size is too small to detect the effect you care about. This is the most common mistake. Teams run experiments for a week, see no significant difference, and conclude the change doesn’t matter. Reality: their experiment couldn’t have detected anything less than a 50% improvement.

The pre-mortem doesn’t prevent all problems, but it dramatically reduces unpleasant surprises two weeks into your test.

Instrumentation Before Implementation

Here’s a pattern I see: team builds a feature, launches an experiment, then realises they’re not tracking the right metrics. They scramble to add tracking, but now the first week of data is useless.

Instrument first. Before you write a line of production code, set up tracking in your development environment and verify it’s capturing what you need.

For every experiment, document:

What events you’re tracking
What properties each event includes
How events relate to your hypothesis
What your analysis query will look like
Edge cases that might affect measurement

Advanced Techniques: Beyond Simple AB Tests

Sequential Testing for Faster Decisions

Traditional AB testing requires you to decide your sample size upfront and wait until you reach it. But what if you could make a decision as soon as you have enough evidence, without waiting for an arbitrary endpoint?

Sequential testing lets you check your results continuously and stop early if you detect a clear winner. This isn’t peeking at results and calling it early when you see what you want (that’s p-hacking). It’s a mathematically rigorous approach that adjusts for multiple looks at the data.

Multi-Armed Bandits for Continuous Optimisation

AB tests are great for learning, but sometimes you want to optimise while learning. That’s where multi-armed bandits come in.

Traditional AB testing splits traffic 50-50 and maintains that split throughout. If variant A is clearly winning, you’re still sending half your users to the inferior variant B until the test concludes.

Stratification for Heterogeneous Effects

Not all users respond the same way to changes. A feature that delights power users might confuse newcomers. A price increase that doesn’t affect enterprise customers might kill SMB conversion.

Stratified analysis lets you understand these differences. Instead of reporting a single aggregate result, you analyse how your experiment performed across user segments.

Getting Started: Your First Well-Designed Experiment

The Prerequisites Checklist

Before you run your first rigorous experiment, verify you have:

Reliable tracking infrastructure - Events are logged consistently, you can tie user actions to experiment variants, you can query your data without a three-day delay.

Sufficient traffic - Be realistic about your sample size. If you have 1,000 users per week and need 10,000 per variant to detect your target effect, you’re looking at a 20-week experiment. Maybe rethink the hypothesis.

Stakeholder alignment - Everyone agrees on what metrics matter and what would constitute a successful outcome. Surprises after the experiment ends are expensive.

Technical ability to implement variants - Your engineering team can reliably deliver different experiences to different users without leaking between groups or creating a broken experience.

Time to run properly - You won’t be forced to conclude early because of a deadline. Cutting an experiment short because you need to ship invalidates everything.

The Minimal Viable Experiment

Your first experiment doesn’t need to be perfect. It needs to be rigorous enough to inform a decision.

Start with a simple change, clear hypothesis, and straightforward measurement. Don’t try to test three variables simultaneously. Don’t measure fifteen different outcomes. Don’t stratify across ten user segments.

Test one thing. Measure one primary metric. Maybe two secondary metrics if you must. Run it for long enough to reach statistical significance. Analyse the results honestly.

I’ve seen teams build elaborate experimentation frameworks before they’ve successfully run a single clean experiment. Build the culture of rigorous testing first. The sophisticated infrastructure can come later.

Key Takeaways

Running experiments that actually inform decisions requires:

Falsifiable, specific hypotheses - Vague questions lead to ambiguous results. Define your treatment, control, metrics, and success criteria upfront. If you can interpret any result as positive, your experiment is flawed.
Pre-mortem thinking - Identify failure modes before you launch. What assumptions are you making? What could invalidate your results? Surface these early.
Instrumentation before implementation - Set up tracking first, verify it’s working, then build your feature. Scrambling to add analytics mid-experiment wastes time and data.
Match your method to your question - AB tests for learning, bandits for optimisation, stratification for heterogeneous effects. Different techniques serve different purposes.
Start simple and iterate - Don’t build the perfect experimentation framework before running your first experiment. Build the discipline of rigorous testing first.

Closing Thoughts

The difference between teams that learn from experiments and teams that just run tests is discipline. The technical implementation is straightforward. The hard part is the rigour: defining clear hypotheses, resisting the urge to peek early, accepting negative results, admitting when your experiment was flawed.

Netflix famously ran an experiment on their recommendation algorithm that showed no statistically significant improvement. They shipped it anyway because the improvement — though not reaching their significance threshold — was consistent across every segment and geography. The rigour of their experimentation gave them confidence to trust a weak signal.

That’s what good experiment design enables: not just binary ship/don’t-ship decisions, but nuanced understanding that informs better judgment.

Start with one experiment. Make it tight. Learn from it. Then do it again. Within six months, you’ll be running experiments that actually drive decisions instead of just validating them.

Have questions or thoughts? Get in touch - I’d love to hear from you!