−9% late

ETA Display Experimentation

Grubhub2023

TL;DR

Led experimentation on ETA display formats to find the optimal balance between user expectations and business outcomes. Discovered that widening ETA ranges by 5 minutes had no conversion impact while reducing late deliveries by 9.28% and saving $0.05 per delivery in care costs. Key insight: users prefer reliability over precision.

My Role: Senior Product Manager

Identified ETA display as a growth lever — shifting logistics from cost center to conversion driver
Decomposed ETA ranges into testable components (upper bound, lower bound, width)
Designed scrappy A/B experiment infrastructure when standard tooling didn't fit
Collaborated with logistics and diner data science teams on experiment design and power analysis
Designed expected-value ETA system with RL framing (transitioned before implementation)

The Challenge

There was a strategic push at Grubhub to grow orders and acquire new diners. Logistics had typically been viewed as an enabler and cost center for the business — we fulfilled orders, we didn't drive them.

I saw an opportunity to change that. ETA could be a growth lever in two ways:

Drive conversion — Lower/more appealing ETAs increase checkout rates
Improve retention — Fewer late deliveries mean better user experiences and fewer care contacts

We also saw competitors like DoorDash and UberEats experimenting with dynamic ETA ranges at different funnel stages. The feature had market validation — we just hadn't quantified it ourselves.

The Core Tension

ML-based prediction systems contain inherent uncertainty. The ETA we show users is a prediction, not a guarantee. This creates a product design challenge:

If we show tighter ranges...	If we show wider ranges...
✅ More appealing (precision signals confidence)	✅ Fewer "late" deliveries (more buffer)
❌ Higher risk of missing the window	✅ Lower care costs
❌ More care contacts when we miss	❌ May signal uncertainty, hurt conversion

The internal assumption was that users wanted precision — that wider ranges would feel like hedging. But we hadn't validated that assumption.

Why This Is an ML Product Problem

This wasn't just UX optimization. It was fundamentally about how to present probabilistic outputs to users.

ML predictions have error distributions. The range we show is a product decision about how much of that uncertainty to expose. Too narrow and we'll miss expectations. Too wide and we might hurt conversion. The experiment would tell us where the actual trade-off lies.

The Approach

Decomposing the Levers

An ETA range like "35-45 min" has multiple components. Rather than testing "wider vs. narrower" as a single variable, I decomposed it:

Component	What It Is	Effect of Increasing
Lower bound	"Best case" scenario	↓ CTR/conversion (worse best case), ↑ bundling slack
Upper bound	"Worst case" scenario	↓ Late %, ↓ care costs, potentially ↓ conversion
Range width	Uncertainty signal	Signals less precision, but more realistic expectations
Target point	What we aim for internally	Affects operational planning

We had been targeting the lower bound of the 10-minute range — optimizing for "best case" while accepting higher lateness risk.

Experiment Design: Being Scrappy

I designed A/B/n tests experimenting with different combinations of these levers across funnel stages: home page, search, menu, and checkout.

The constraint: Our logistics systems weren't designed to run user-level A/B tests. We typically used switchback tests (exposing treatments in time-based windows across geographies) to account for network effects. But for this experiment, I needed consistent user experiences — the same diner should see the same ETA format throughout their session.

The workaround: I collaborated with both logistics and diner data science teams to design a creative solution:

Assigned treatments based on diner IDs (ensuring consistent experience per user)
Selected a set of regions that were a representative sample of the national market
Ran the A/B test within those regions only

This let us gather valid data while adding minimal code on top of our existing infrastructure. We validated the assumption scrappily before investing in more sophisticated tooling.

Metrics:

Click-through rate (CTR) at each funnel stage
Cart-to-checkout conversion (CVR)
Late delivery % (actual dropoff time > upper bound of displayed range)
Care costs per delivery

We also worked closely with the diner team to ensure our ETA experiments didn't confound any ongoing frontend experiments.

Results

Metric	Result
CTR impact	No significant change
CVR impact	No significant change
Late delivery reduction	9.28%
Care cost reduction	$0.05 per delivery

The Key Insight

Users cared more about whether we met expectations than about the precision of the estimate.

A wider range that we consistently hit was better than a narrow range we often missed. This challenged the internal assumption that users wanted precision. They wanted reliability.

"The average diner did not mind increased uncertainty as long as those expectations weren't broken."

Vision: Expected-Value ETA System

The experiment validated a core assumption. But I saw a path to something more sophisticated — and intentionally validated scrappily first before building complexity.

The Ideal System

Rather than static range formats, I envisioned an ETA system that dynamically optimizes based on expected value. I framed this as a reinforcement learning problem:

State: The user's position in the funnel and the characteristics of their potential cart. "Potential" because a user on the home or search page hasn't selected a restaurant yet — every impression represents a possible cart with different characteristics (restaurant prep time, distance, current demand).

Action: Adjust the ETA's upper and lower bounds independently.

Reward: Expected value of the ETA decision:

E[Value] = P(convert | ETA) × Revenue 
         - P(late | ETA) × Care_Cost
         - Fulfillment_Cost(ETA)

The reward function could be weighted based on business priorities — growth mode (weight conversion) vs. profitability mode (weight efficiency).

Why Validate First

The RL system had many moving parts. Before building that complexity, I needed to validate the core assumption: are users actually sensitive to range changes?

The scrappy A/B experiment answered that. Users weren't sensitive to wider ranges. This validated that we had room to optimize the upper bound without conversion risk — a key input to any future RL system's reward function.

Status

I initiated design work on this system but transitioned to Cornell Tech before implementation. The experiments shipped and delivered value. The RL vision showed where we were heading.

What I Learned

Designing for ML Uncertainty

ML predictions have error. How much of that error to expose to users is a product decision with real trade-offs. The default assumption — users want precision — wasn't validated by data.

The insight applies broadly: For any ML-powered feature, the question isn't just "how accurate is the model?" but "how do we present the model's uncertainty in a way that serves users?"

Reliability Over Precision

Users preferred a prediction they could trust over one that sounded precise but missed. This is a crucial insight for any prediction-based product.

When building on top of ML systems, optimizing for "accuracy" might be the wrong goal. Optimizing for "expectations met" may matter more for user outcomes.

Validate Before Building Complexity

The RL system would have been elegant. It also would have taken months to build. The scrappy A/B experiment validated the core assumption in weeks, shipped real value, and de-risked the larger investment.

This is the right sequence: lightweight validation → ship incremental value → build toward the sophisticated system.

Back to all work