ETA Display Experimentation
TL;DR
Led experimentation on ETA display formats to find the optimal balance between user expectations and business outcomes. Discovered that widening ETA ranges by 5 minutes had no conversion impact while reducing late deliveries by 9.28% and saving $0.05 per delivery in care costs. Key insight: users prefer reliability over precision.
My Role: Senior Product Manager
- Identified ETA display as a growth lever — shifting logistics from cost center to conversion driver
- Decomposed ETA ranges into testable components (upper bound, lower bound, width)
- Designed scrappy A/B experiment infrastructure when standard tooling didn't fit
- Collaborated with logistics and diner data science teams on experiment design and power analysis
- Designed expected-value ETA system with RL framing (transitioned before implementation)
The Challenge
There was a strategic push at Grubhub to grow orders and acquire new diners. Logistics had typically been viewed as an enabler and cost center for the business — we fulfilled orders, we didn't drive them.
I saw an opportunity to change that. ETA could be a growth lever in two ways:
- Drive conversion — Lower/more appealing ETAs increase checkout rates
- Improve retention — Fewer late deliveries mean better user experiences and fewer care contacts
We also saw competitors like DoorDash and UberEats experimenting with dynamic ETA ranges at different funnel stages. The feature had market validation — we just hadn't quantified it ourselves.
The Core Tension
ML-based prediction systems contain inherent uncertainty. The ETA we show users is a prediction, not a guarantee. This creates a product design challenge:
| If we show tighter ranges... | If we show wider ranges... |
|---|---|
| ✅ More appealing (precision signals confidence) | ✅ Fewer "late" deliveries (more buffer) |
| ❌ Higher risk of missing the window | ✅ Lower care costs |
| ❌ More care contacts when we miss | ❌ May signal uncertainty, hurt conversion |
The internal assumption was that users wanted precision — that wider ranges would feel like hedging. But we hadn't validated that assumption.
Why This Is an ML Product Problem
This wasn't just UX optimization. It was fundamentally about how to present probabilistic outputs to users.
ML predictions have error distributions. The range we show is a product decision about how much of that uncertainty to expose. Too narrow and we'll miss expectations. Too wide and we might hurt conversion. The experiment would tell us where the actual trade-off lies.
The Approach
Decomposing the Levers
An ETA range like "35-45 min" has multiple components. Rather than testing "wider vs. narrower" as a single variable, I decomposed it:
| Component | What It Is | Effect of Increasing |
|---|---|---|
| Lower bound | "Best case" scenario | ↓ CTR/conversion (worse best case), ↑ bundling slack |
| Upper bound | "Worst case" scenario | ↓ Late %, ↓ care costs, potentially ↓ conversion |
| Range width | Uncertainty signal | Signals less precision, but more realistic expectations |
| Target point | What we aim for internally | Affects operational planning |
We had been targeting the lower bound of the 10-minute range — optimizing for "best case" while accepting higher lateness risk.
Experiment Design: Being Scrappy
I designed A/B/n tests experimenting with different combinations of these levers across funnel stages: home page, search, menu, and checkout.
The constraint: Our logistics systems weren't designed to run user-level A/B tests. We typically used switchback tests (exposing treatments in time-based windows across geographies) to account for network effects. But for this experiment, I needed consistent user experiences — the same diner should see the same ETA format throughout their session.
The workaround: I collaborated with both logistics and diner data science teams to design a creative solution:
- Assigned treatments based on diner IDs (ensuring consistent experience per user)
- Selected a set of regions that were a representative sample of the national market
- Ran the A/B test within those regions only
This let us gather valid data while adding minimal code on top of our existing infrastructure. We validated the assumption scrappily before investing in more sophisticated tooling.
- Click-through rate (CTR) at each funnel stage
- Cart-to-checkout conversion (CVR)
- Late delivery % (actual dropoff time > upper bound of displayed range)
- Care costs per delivery
We also worked closely with the diner team to ensure our ETA experiments didn't confound any ongoing frontend experiments.
Results
| Metric | Result |
|---|---|
| CTR impact | No significant change |
| CVR impact | No significant change |
| Late delivery reduction | 9.28% |
| Care cost reduction | $0.05 per delivery |
The Key Insight
A wider range that we consistently hit was better than a narrow range we often missed. This challenged the internal assumption that users wanted precision. They wanted reliability.
"The average diner did not mind increased uncertainty as long as those expectations weren't broken."
Vision: Expected-Value ETA System
The experiment validated a core assumption. But I saw a path to something more sophisticated — and intentionally validated scrappily first before building complexity.
The Ideal System
Rather than static range formats, I envisioned an ETA system that dynamically optimizes based on expected value. I framed this as a reinforcement learning problem:
State: The user's position in the funnel and the characteristics of their potential cart. "Potential" because a user on the home or search page hasn't selected a restaurant yet — every impression represents a possible cart with different characteristics (restaurant prep time, distance, current demand).
Action: Adjust the ETA's upper and lower bounds independently.
Reward: Expected value of the ETA decision:
E[Value] = P(convert | ETA) × Revenue
- P(late | ETA) × Care_Cost
- Fulfillment_Cost(ETA)
The reward function could be weighted based on business priorities — growth mode (weight conversion) vs. profitability mode (weight efficiency).
Why Validate First
The RL system had many moving parts. Before building that complexity, I needed to validate the core assumption: are users actually sensitive to range changes?
The scrappy A/B experiment answered that. Users weren't sensitive to wider ranges. This validated that we had room to optimize the upper bound without conversion risk — a key input to any future RL system's reward function.
Status
I initiated design work on this system but transitioned to Cornell Tech before implementation. The experiments shipped and delivered value. The RL vision showed where we were heading.
What I Learned
Designing for ML Uncertainty
ML predictions have error. How much of that error to expose to users is a product decision with real trade-offs. The default assumption — users want precision — wasn't validated by data.
The insight applies broadly: For any ML-powered feature, the question isn't just "how accurate is the model?" but "how do we present the model's uncertainty in a way that serves users?"
Reliability Over Precision
Users preferred a prediction they could trust over one that sounded precise but missed. This is a crucial insight for any prediction-based product.
When building on top of ML systems, optimizing for "accuracy" might be the wrong goal. Optimizing for "expectations met" may matter more for user outcomes.
Validate Before Building Complexity
The RL system would have been elegant. It also would have taken months to build. The scrappy A/B experiment validated the core assumption in weeks, shipped real value, and de-risked the larger investment.
This is the right sequence: lightweight validation → ship incremental value → build toward the sophisticated system.