← All work#Measurement

Measurement & Analytics

Backend product analytics that tell the truth — percentiles instead of averages, pre-aggregated rollups for large date ranges, and a graceful fallback to raw queries.

Problem

A single number is easy to trust and easy to be fooled by. An average response time can look healthy while a meaningful slice of users sits in a long tail the average hides. And the moment you ask those questions over a large date range, the queries that answer them get slow and heavy — so the honest view is also the expensive one.

Approach

I built the analytics layer to report the truth rather than the most flattering number. Instead of plain averages it reports percentiles (p70/p90), with weighted aggregation so combining buckets stays correct. Empty time periods are filled rather than skipped, so a gap reads as "nothing happened" instead of silently distorting a trend. To keep large ranges fast, results come from pre-aggregated rollup tables — with a deliberate fallback to raw queries whenever a rollup is missing or stale, so speed never comes at the cost of a wrong answer. The same validation runs across the public API and the internal admin, so both surfaces agree on what a valid question even is.

Key decisions

Percentiles (p70/p90), not just averages
Weighted aggregation across buckets
Empty periods filled, not skipped
Pre-aggregated rollups for large date ranges
Graceful fallback to raw queries when rollups are empty or stale

What made it hard

The hard part wasn't writing the queries. It was deciding what an honest number is. An average flatters the truth; a percentile reports it — but a percentile is harder to explain and harder to compute consistently. The deeper challenge was the rollups: pre-aggregation makes big ranges fast, but it creates a second copy of the truth that has to stay in step with the raw data. The real work was making sure a dashboard never quietly showed a stale or empty rollup as if it were a real measurement.

Trade-offs

Rollups buy speed at the cost of a second source of truth to keep in sync; the raw-query fallback buys correctness at the cost of being slower when it kicks in. Reporting percentiles is more honest than a single average, but it asks more of the reader — one number is comfortable, a distribution is true. The choice each time was to favour an answer that stayed correct over one that was merely fast or simple.

What I learned

An average is the story a dataset tells about itself; a percentile is closer to what people actually felt. Most of measurement isn't computing the number — it's making the honest number the easy one to see, and making sure a fast answer is never quietly a wrong one.

What I worked on

Percentiles vs averages (p70/p90)Weighted aggregationRollup / pre-aggregation tablesRaw-query fallbackDjango · PostgreSQL

Measuring Latency Honestly →

← All work Get in touch →