$1.8M

Improving Delivery Time Predictions

Grubhub2024-2025

TL;DR

Led initiative to improve food prep prediction accuracy by identifying ground truth data quality issues and proposing a new modeling approach. Sized at $1.8M annual opportunity. Experiments showed promising results, but production revealed that improving one ML component without holistic evaluation can create trade-offs that negate the gains.

My Role: Senior Product Manager

Conducted error analysis correlating driver wait time with prediction error
Identified ground truth data quality issue and proposed new target variable
Wrote PRD with functional/technical requirements for two-model solution
Designed manual tuning controls for human-in-the-loop adjustments
Designed A/B and switchback experiments to measure accuracy vs. conversion trade-offs
Collaborated with data science on model architecture; got buy-in from engineering and operations

The Problem

Driver-in-restaurant (DinR) wait time was our #1 user complaint across all three sides of the marketplace.

Drivers: Idle time reduced throughput and earnings. Paid primarily for distance, not time — waiting earned them little. In worst cases, drivers unassigned orders due to excessive wait.

Restaurants: Drivers crowding the space, constant questions about order readiness, operational disruption.

Diners: Late deliveries, frustration seeing driver "not moving" on the tracking screen, increased care contacts.

Business Impact: Higher fulfillment cost (care contacts, driver compensation), cascading effects on bundled orders, driver supply issues from frustrated couriers leaving the platform.

Root Cause Analysis

I conducted an error analysis of our food prep prediction model using SQL and ReDash.

The model was an XGBoost regressor predicting time until food ready for pickup, using features like restaurant historical prep times, item count, region, and time of day.

What I found:

Wide error distribution — calculated RMSE, MAE, 10th and 90th percentile errors
Strong correlation between DinR wait time and food prep prediction error
Ground truth data quality issue — the model trained on a restaurant tablet button press that was low volume, often "fast-tapped" too early, or forgotten and pressed too late
Missing handoff phase — our ETA pipeline used Food Prep (ML model) → Handoff (static heuristic) → Driver Departure. The static handoff couldn't capture variance across restaurant types (drive-throughs, parking, etc.)

Quantifying the Opportunity

I used driver departure as a proxy for "food ready" and compared against predictions. This revealed significant error and an opportunity to improve by changing the target variable and explicitly modeling handoff.

Sized opportunity: $1.8M annualized based on estimated care cost reduction and late delivery improvement.

The Solution

Proposed Approach

I proposed a two-model solution:

Improved food prep model — retrain on driver departure as proxy ground truth
New handoff model — explicitly model the handoff phase instead of a static value

PRD and Requirements

Functional Requirements:

Two new predictions integrated into ETA pipeline
Combined prediction replaces existing food prep estimate

Technical Requirements:

Must integrate with existing H2O framework (platform constraint)
Latency requirements for real-time prediction

Human-in-the-Loop Controls:

Manual tuning knobs to adjust assumptions if production behavior diverged
Rationale: ground truth would still be imperfect; needed ability to course-correct

Implementation

I collaborated with data science on the model architecture. They handled feature exploration and training; I owned the PRD, requirements, and experiment design.

Food Prep Model: Retrained with driver departure as target variable, same XGBoost architecture.

Handoff Model: Implemented as a lookup table by region, day of week, and time of day — intentionally simple, better than the static baseline, validated the approach before investing in complexity.

Stakeholder Alignment

Pitched to engineering, operations, and data science leads. Got buy-in based on potential to reduce care contacts, driver unassignments, and late deliveries.

Experiment Results

The experiments showed promising signals:

DinR reduction observed (~1-3 minutes in testing)
Prediction error improved significantly on the new target variable

I designed both A/B tests (for conversion impact) and switchback tests (for network-level fulfillment cost) to measure trade-offs.

The Trade-off

The models made ETAs more accurate but also longer on average — they shifted the E2E ETA distribution right. This created tension:

More accurate ETAs →	Positive	Negative
	✅ Lower DinR	❌ Lower conversion (longer ETAs shown)
	✅ Fewer late deliveries	❌ Longer engaged time for drivers
	✅ Fewer care contacts

The question was whether fulfillment savings would offset conversion losses.

What Happened in Production

I transitioned to Cornell Tech during phased rollout. The models were deployed, but from what I understand, they haven't delivered expected business value.

My Hypothesis (Reasoning Post-Hoc)

I no longer have visibility into the production data, but based on the system dynamics, I believe the benefits and costs roughly canceled out:

1. DinR reduction doesn't generate much business value at Grubhub — driver pay is heavily distance-based, not time-based. Reducing wait time saves less than expected.

2. Exception: regulated markets — In NYC, minimum pay laws require ~$17.96/hr for engaged time. California's Prop 22 requires 120% of local minimum wage for engaged time. DinR reduction matters more in these markets where drivers must be paid for waiting. But these are a subset of the network.

3. Longer ETAs hurt conversion — even though predictions were more accurate, showing longer ETAs at the top of the funnel likely reduced checkout rates.

4. Net effect: neutral — fulfillment savings may have been offset by conversion losses.

To be clear: this is my hypothesis based on understanding the system, not confirmed with production data. The actual root cause may differ.

What I Learned

Evaluate ML Systems Holistically

The mistake: I optimized for food prep accuracy in isolation. But the food prep model was one input to the E2E ETA, which affected conversion, dispatch decisions, bundling efficiency, late delivery rate, care costs, and driver satisfaction.

Improving one component created trade-offs elsewhere that may have negated the gains.

What I do now: Before scoping any ML initiative, I map the component's dependencies:

What systems does this feed into?
What metrics will be affected downstream?
How will we measure the full system impact, not just the component?

This ensures experiments are designed for holistic evaluation from the start.

Pressure-Test the Business Case

The mistake: DinR was the #1 complaint, so I assumed solving it was valuable. But the business impact was limited by driver pay structure (distance > time) and trade-offs with conversion.

What I do now: Before investing in a solution, I ask:

Is this user problem tied to meaningful business value?
What are the second-order effects of solving it?
Are there structural reasons the impact might be smaller than it appears?

Not all user problems are equally valuable to solve. Validating the business case is as important as validating the solution.

Design for Production Uncertainty

The mistake: Experiment results showed promise, but production performance differed. Distribution shifts, system interactions, and edge cases can erode gains.

What I do now: I build in mechanisms for production uncertainty:

Manual tuning controls (which I did include in this project)
Phased rollouts with clear rollback criteria
Monitoring dashboards that track the full system, not just the component

This project reinforced why those safeguards matter — even when experiments look good.

Back to all work