Improving Delivery Time Predictions
TL;DR
Led initiative to improve food prep prediction accuracy by identifying ground truth data quality issues and proposing a new modeling approach. Sized at $1.8M annual opportunity. Experiments showed promising results, but production revealed that improving one ML component without holistic evaluation can create trade-offs that negate the gains.
My Role: Senior Product Manager
- Conducted error analysis correlating driver wait time with prediction error
- Identified ground truth data quality issue and proposed new target variable
- Wrote PRD with functional/technical requirements for two-model solution
- Designed manual tuning controls for human-in-the-loop adjustments
- Designed A/B and switchback experiments to measure accuracy vs. conversion trade-offs
- Collaborated with data science on model architecture; got buy-in from engineering and operations
The Problem
Driver-in-restaurant (DinR) wait time was our #1 user complaint across all three sides of the marketplace.
Drivers: Idle time reduced throughput and earnings. Paid primarily for distance, not time — waiting earned them little. In worst cases, drivers unassigned orders due to excessive wait.
Restaurants: Drivers crowding the space, constant questions about order readiness, operational disruption.
Diners: Late deliveries, frustration seeing driver "not moving" on the tracking screen, increased care contacts.
Business Impact: Higher fulfillment cost (care contacts, driver compensation), cascading effects on bundled orders, driver supply issues from frustrated couriers leaving the platform.
Root Cause Analysis
I conducted an error analysis of our food prep prediction model using SQL and ReDash.
The model was an XGBoost regressor predicting time until food ready for pickup, using features like restaurant historical prep times, item count, region, and time of day.
- Wide error distribution — calculated RMSE, MAE, 10th and 90th percentile errors
- Strong correlation between DinR wait time and food prep prediction error
- Ground truth data quality issue — the model trained on a restaurant tablet button press that was low volume, often "fast-tapped" too early, or forgotten and pressed too late
- Missing handoff phase — our ETA pipeline used Food Prep (ML model) → Handoff (static heuristic) → Driver Departure. The static handoff couldn't capture variance across restaurant types (drive-throughs, parking, etc.)
Quantifying the Opportunity
I used driver departure as a proxy for "food ready" and compared against predictions. This revealed significant error and an opportunity to improve by changing the target variable and explicitly modeling handoff.
Sized opportunity: $1.8M annualized based on estimated care cost reduction and late delivery improvement.
The Solution
Proposed Approach
I proposed a two-model solution:
- Improved food prep model — retrain on driver departure as proxy ground truth
- New handoff model — explicitly model the handoff phase instead of a static value
PRD and Requirements
- Two new predictions integrated into ETA pipeline
- Combined prediction replaces existing food prep estimate
- Must integrate with existing H2O framework (platform constraint)
- Latency requirements for real-time prediction
- Manual tuning knobs to adjust assumptions if production behavior diverged
- Rationale: ground truth would still be imperfect; needed ability to course-correct
Implementation
I collaborated with data science on the model architecture. They handled feature exploration and training; I owned the PRD, requirements, and experiment design.
Food Prep Model: Retrained with driver departure as target variable, same XGBoost architecture.
Handoff Model: Implemented as a lookup table by region, day of week, and time of day — intentionally simple, better than the static baseline, validated the approach before investing in complexity.
Stakeholder Alignment
Pitched to engineering, operations, and data science leads. Got buy-in based on potential to reduce care contacts, driver unassignments, and late deliveries.
Experiment Results
The experiments showed promising signals:
- DinR reduction observed (~1-3 minutes in testing)
- Prediction error improved significantly on the new target variable
I designed both A/B tests (for conversion impact) and switchback tests (for network-level fulfillment cost) to measure trade-offs.
The Trade-off
The models made ETAs more accurate but also longer on average — they shifted the E2E ETA distribution right. This created tension:
| More accurate ETAs → | Positive | Negative |
|---|---|---|
| ✅ Lower DinR | ❌ Lower conversion (longer ETAs shown) | |
| ✅ Fewer late deliveries | ❌ Longer engaged time for drivers | |
| ✅ Fewer care contacts |
The question was whether fulfillment savings would offset conversion losses.
What Happened in Production
I transitioned to Cornell Tech during phased rollout. The models were deployed, but from what I understand, they haven't delivered expected business value.
My Hypothesis (Reasoning Post-Hoc)
I no longer have visibility into the production data, but based on the system dynamics, I believe the benefits and costs roughly canceled out:
1. DinR reduction doesn't generate much business value at Grubhub — driver pay is heavily distance-based, not time-based. Reducing wait time saves less than expected.
2. Exception: regulated markets — In NYC, minimum pay laws require ~$17.96/hr for engaged time. California's Prop 22 requires 120% of local minimum wage for engaged time. DinR reduction matters more in these markets where drivers must be paid for waiting. But these are a subset of the network.
3. Longer ETAs hurt conversion — even though predictions were more accurate, showing longer ETAs at the top of the funnel likely reduced checkout rates.
4. Net effect: neutral — fulfillment savings may have been offset by conversion losses.
To be clear: this is my hypothesis based on understanding the system, not confirmed with production data. The actual root cause may differ.
What I Learned
Evaluate ML Systems Holistically
The mistake: I optimized for food prep accuracy in isolation. But the food prep model was one input to the E2E ETA, which affected conversion, dispatch decisions, bundling efficiency, late delivery rate, care costs, and driver satisfaction.
Improving one component created trade-offs elsewhere that may have negated the gains.
What I do now: Before scoping any ML initiative, I map the component's dependencies:
- What systems does this feed into?
- What metrics will be affected downstream?
- How will we measure the full system impact, not just the component?
This ensures experiments are designed for holistic evaluation from the start.
Pressure-Test the Business Case
The mistake: DinR was the #1 complaint, so I assumed solving it was valuable. But the business impact was limited by driver pay structure (distance > time) and trade-offs with conversion.
What I do now: Before investing in a solution, I ask:
- Is this user problem tied to meaningful business value?
- What are the second-order effects of solving it?
- Are there structural reasons the impact might be smaller than it appears?
Not all user problems are equally valuable to solve. Validating the business case is as important as validating the solution.
Design for Production Uncertainty
The mistake: Experiment results showed promise, but production performance differed. Distribution shifts, system interactions, and edge cases can erode gains.
What I do now: I build in mechanisms for production uncertainty:
- Manual tuning controls (which I did include in this project)
- Phased rollouts with clear rollback criteria
- Monitoring dashboards that track the full system, not just the component
This project reinforced why those safeguards matter — even when experiments look good.