Better Products Start with Experiment Tracking

Everything you need to know about experiment tracking. Frameworks, examples, and actionable advice.

PC
Piotr Ciechowicz

I ran 17 experiments last year for one of my products. I can tell you the results of exactly two of them.

Not because I didn’t care. Because I had no systematic way to track what I’d tested, what I’d learned, and how those learnings influenced subsequent decisions. I was conducting science without keeping a lab notebook.

Turns out, running experiments is the easy bit. Remembering what you learned is the hard bit.

Why Experiment Tracking Actually Matters

Time for a story: a team ran an onboarding experiment in Q1. Results were inconclusive - small lift in activation, not statistically significant. They moved on.

Q3, the product team redesigned the same onboarding flow. Ran a similar experiment. Similar inconclusive results. Nobody remembered the Q1 test. Nobody thought to check if we’d tested this before.

Q4, customer success suggested we improve onboarding. You can see where this is going.

Three quarters, three teams, essentially the same experiment run three times. Zero cumulative learning. Each team starting from scratch because there was no institutional memory.

The problem isn’t that they were running experiments. The problem was they weren’t building knowledge. They were generating data exhaust.

Understanding the Fundamentals

Core Concepts: What Experiment Tracking Actually Is

Experiment tracking isn’t a tool. It’s a practice of systematically recording what you’re testing, why you’re testing it, what you expect to happen, what actually happened, and what you’re doing about it.

Most teams track experiments in whatever’s handy - a Notion page, a Google Sheet, tickets in Jira, comments in their analytics tool. This is like keeping your research notes on napkins and the backs of envelopes. Technically works, practically useless.

Here’s what you actually need to track:

Hypothesis: Not just “let’s try X.” I mean a proper hypothesis with a because clause. “We believe that adding social proof to the pricing page will increase conversion because users report uncertainty about whether established companies use our product.”

We required every experiment ticket to have a hypothesis. First few weeks, everyone complained. Then something interesting happened: the quality of experiments improved dramatically. Because writing a good hypothesis forces you to think about mechanism, not just intervention.

Success criteria defined upfront: What does success look like, numerically, before you run the test. Not “improve conversion.” Something like “increase trial-to-paid conversion from 12% to 14%, measured over 1000 trials, with 95% confidence.”

I’ve experienced first-hand a team that run an experiment, saw a 3% lift, and debated for 45 minutes whether that’s good. If you don’t define success criteria beforehand, you’re just doing motivated reasoning with statistics. If you don’t know what should be the success criteria, then, well, you should think longer. And harder.

What actually happened: Raw results, not interpretation. Conversion went from 12% to 12.4%, p-value 0.23. Sometimes you want to add context, but keep the facts separate from the story.

What you concluded and what you’re doing next: This is where most teams stop. They record results but not conclusions. Six months later, nobody remembers whether 12.4% vs 12% meant “promising direction, need more data” or “dead end, try something else.”

At an adtechI worked with, we added a mandatory “so what” field. One sentence: what does this mean for our product strategy? Sounds simple. Changed everything about how useful our experiment log became.

Why This Matters for Product Managers

Product management is applied learning. You form beliefs about users and markets, test those beliefs, update your worldview, make better decisions.

Without experiment tracking, you’re learning things once and forgetting them. With experiment tracking, you’re building a knowledge base that compounds.

I can pull up experiments from 18 months ago and see:

  • We tested personalised recommendations three times; never moved the needle
  • Social proof works in marketing pages, doesn’t work in product
  • Users respond to deadline pressure for upgrades but not for feature adoption
  • Our enterprise customers care about security certifications way more than we expected

These aren’t insights from one experiment. They’re patterns across dozens of experiments that only became visible because we tracked systematically.

Common Pitfalls and How to Avoid Them

Mistake One: Tracking Experiments But Not Learnings

Most experiment tracking systems are actually experiment result databases. Lists of tests with outcomes. Useful for “did we test this before?” Not useful for “what do we know about user behaviour?”

The shift: organise by learning, not by test.

At one company, we restructured our tracking. Instead of a chronological list of experiments, we created learning themes:

  • What we know about onboarding
  • What we know about pricing psychology
  • What we know about feature discovery
  • What we know about retention drivers

Each experiment got tagged to one or more themes. When you pulled up “what we know about onboarding,” you saw a synthesis of all related experiments with pattern analysis.

Suddenly our experiment log became something people actually referenced when making decisions.

Mistake Two: No Connection Between Experiments and Decisions

Run an experiment. Learn something. Make a decision six weeks later that ignores what you learned. I’ve done this more times than I’d like to admit.

The missing link: explicit decision documentation that references experiment evidence.

Implement a simple rule: any product decision affecting more than 100 users needs to reference supporting evidence in the decision doc. Could be user research, could be experiments, could be usage data. But something empirical.

Forces you to check: have we tested anything related to this? What did we learn? Is this decision consistent with prior evidence?

Sometimes the answer could be “we’re contradicting prior evidence because market conditions changed” or “the prior test wasn’t quite measuring what we thought.” Fine. But it will be an explicit choice, not an oversight.

Mistake Three: Only Tracking Successful Experiments

Teams love recording wins. Experiments that worked, features that moved metrics, tests that validated hypotheses. The learnings from failures get lost.

Failed experiments are often more informative than successful ones. They tell you what doesn’t work, which constrains the solution space for future problems.

Oh man, at one mobile app I ran an experiment adding gamification to our product. Badges, points, the whole works. Hypothesis was that recognition would increase engagement.

Results: no effect on engagement. Slight negative effect on perceived professionalism among users. Killed the feature.

Two quarters later, someone suggested we add achievements to encourage feature adoption. I pulled up the gamification experiment: “We tested this. Didn’t work. Made users think we weren’t serious. Let’s not do this again.”

That experiment saved us six weeks of development on a feature we already knew wouldn’t work.

Putting It Into Practice

Implementation: Start Simple, Scale Up

You don’t need sophisticated tooling to start tracking experiments. You need discipline.

Week one: Create a Google Sheet or Notion database. Six columns:

  1. Date
  2. What we tested (one sentence)
  3. Why we tested it (hypothesis)
  4. Success criteria
  5. Results
  6. Conclusion and next steps

Every experiment, one row. That’s it. Took us 20 minutes to set up at one company. Used it for eight months before we needed anything more sophisticated.

Month three: If you’re still using it, add tags for learning themes. Start clustering related experiments so you can see patterns.

Month six: If it’s becoming unwieldy (you’ll know - people stop using it because it’s hard to find things), consider dedicated tools. Optimizely has experiment tracking. Amplitude has it. There are purpose-built tools like Eppo.

But don’t start with tooling. Start with practice. The tool doesn’t create discipline. Discipline creates value for tools.

Making It Stick: The Cultural Bit

Tracking experiments only works if people actually do it. In my experience, most of teams that start experiment tracking stop within three months.

What works:

Make it a deployment gate: Imagine if you couldn’t mark an experiment ticket “done” until the experiment log was updated. Annoying? Yes. Effective? Also yes.

Review experiments in regular ceremonies: We did monthly “experiment reviews” where teams presented their most interesting learning from the past month. Not results - learning. What changed about our understanding of users?

Made experiment tracking valuable, not just required. People wanted their experiments on the log because that’s what got discussed.

Tie it to product strategy: Every quarter we’d review the experiment log for that area and extract key learnings. “What did we learn about user onboarding in the past six months?” That synthesis informed Q planning.

When experiment tracking influences actual decisions, people maintain it. When it’s just documentation for documentation’s sake, it atrophies.

Measuring Success: How You Know This Is Working

Three signs your experiment tracking is actually valuable:

People reference it in decision discussions: “I remember we tested something similar - let me check the log.” That’s the behaviour you want.

You stop retesting the same things: Track how many experiments you run that are similar to past experiments. Should decrease over time as institutional memory builds.

Strategy documents cite experiment evidence: When you write specs, PRDs, strategy docs, do they reference experiments as supporting evidence? If not, your tracking isn’t influencing decisions.

Key Takeaways

Let’s make this actionable:

  • Track hypothesis, criteria, results, and conclusions - Not just what you tested, but why, what you expected, what happened, and what it means.
  • Organise by learning themes, not chronologically - Make it easy to answer “what do we know about X?” not just “what did we test in March?”
  • Failed experiments are as valuable as successful ones - They tell you what doesn’t work, preventing future waste. Track them religiously.
  • Start with a simple spreadsheet - Don’t wait for perfect tools. Six columns in a Google Sheet beats no tracking at all.
  • Connect experiments to decisions - Require product decisions to reference supporting evidence. Creates accountability and ensures learnings get used.
  • Review experiments regularly in team ceremonies - Monthly learning reviews, quarterly synthesis for planning. Make tracking valuable, not just required.

Final Thoughts

The best product teams I’ve worked with aren’t necessarily the ones running the most experiments. They’re the ones building institutional knowledge from the experiments they run.

Your experiment from six months ago should inform your decision today. That only happens if you’ve systematically captured what you learned and made it easy to find.

Start simple. Log your next experiment in a spreadsheet. Include hypothesis, success criteria, results, conclusion. Do it for ten experiments and watch what happens when someone asks “Have we tested this before?”

The value isn’t in the log. It’s in not repeating yourself.

Have questions or thoughts? Get in touch - I’d love to hear from you!

Recommended Reading

An Elegant Puzzle

An Elegant Puzzle

by Will Larson

A human-centric guide to solving complex problems in engineering management, ...

The Five Dysfunct...

The Five Dysfunctions of a Team

by Patrick Lencioni

A leadership fable that reveals the five behavioral tendencies that corrupt e...

Affiliate links support independent bookstores