DoorDash Rebuilt Search Engine Twice to Overcome Speed's Hidden Costs

Original Title: Scaling Search Engineering at DoorDash: From Monoliths to Custom Search Engines with Satish Saley

The Hidden Costs of Speed: Why DoorDash Rebuilt Its Search Engine, Twice

In the hyper-growth world of DoorDash, the pursuit of speed and scale revealed a fundamental truth: quick fixes often create complex, long-term problems. This conversation with Satish Saley, a senior software engineer who spearheaded two major transformations of DoorDash's search platform, uncovers the non-obvious implications of relying on off-the-shelf solutions and the strategic necessity of building custom infrastructure. It highlights how seemingly small engineering decisions cascade into significant impacts on product velocity, operational cost, and competitive advantage. Engineers, product managers, and CTOs grappling with scaling challenges will find valuable lessons in understanding when to "buy" and when the long, difficult path of "building" is the only route to true, sustainable performance.

The Unseen Drag of Hosted Solutions

The initial journey at DoorDash, as Satish describes, was a classic startup sprint. The mission was clear: extract search from a monolithic application and make it scale. This led to the first major rebuild, introducing Kafka, Flink, and Elasticsearch. While a significant upgrade, it was always viewed as a stepping stone. The core problem wasn't just about handling more data; it was about the speed at which the business could iterate on its search experience. When a query change, like adapting to Gen Z slang or optimizing for convenience store item searches, required months of re-indexing, it directly throttled product innovation.

"So let's say if someone, if, if someone has an idea about like, 'Okay, how do you, how do I want to test out a change in the index? I want to use like a different kind of analyzers.' So those things like require you re-indexing a lot of data and that used to take like months before and that literally translates into, let's say a product problem."

This highlights a critical systems-level consequence: the engineering bottleneck directly became a business bottleneck. The "obvious" solution--using a hosted search engine like Elasticsearch--provided immediate functionality but created a hidden cost in development velocity. The time spent waiting for data re-indexing was time not spent on customer-facing improvements or new feature development. This delay, compounded over time, meant DoorDash was slower to respond to market shifts and user behavior compared to if they had more agile indexing capabilities. The conventional wisdom of leveraging established tools, while sensible for initial traction, eventually created a drag that necessitated a fundamental shift.

From Stepping Stone to Stumbling Block: The Limits of Elasticsearch

As DoorDash’s business matured, so did its complexity. The move from store search to item search, particularly with the explosion of convenience and grocery items, presented a scale problem that Elasticsearch struggled to meet efficiently. Thousands of items per store, compared to dozens of menu items, demanded a different approach to information retrieval. The "buy" decision that had served them well initially began to show its limitations.

The team realized that operating a critical system like search within a hosted environment meant relinquishing control over crucial parameters. This lack of fine-grained control became a significant barrier to achieving the desired performance and cost efficiencies. The realization dawned that scaling search 10x with the existing technology was simply not feasible. This wasn't just about adding more servers; it was about fundamental architectural constraints.

This realization paved the way for the second, more ambitious rebuild: replacing Elasticsearch with a custom search engine built on Apache Lucene. This decision, while more resource-intensive upfront, offered the promise of solving the core problem of iteration speed and operational control. The team leveraged internal expertise and a successful proof-of-concept on the autocomplete use case to build a compelling case. This demonstrates how a deep understanding of system limitations, coupled with practical validation, can justify a significant "build" investment. The custom solution, though initially more demanding, offered the potential for vastly superior performance and cost savings--a delayed payoff that creates a durable competitive advantage.

The Migration Maze: Where Value Is Truly Created

Both rebuilds underscore a profound insight often overlooked in technology: migrations are where the real engineering value is often created. It's not just about building new technology; it's about the painstaking, often unsung, work of transitioning customers--both internal and external--to a better system. Satish emphasizes that a successful migration isn't just about technical superiority (latency, availability) but also about enhanced operability and the ability to deliver greater value than the previous system.

"Unless you add that value addition, it's, it's really difficult for your customers to go to the new system, motivate them into the new system, using the new system. So yeah, I think that is something we, we like whenever someone is building something new, this is something that that should be on the radar. And there has to be like a very specific timeline on deprecation on the movement."

This perspective frames the rebuilds not as mere technical exercises, but as strategic initiatives to unlock future business potential. The custom search engine wasn't just faster; it was designed to be more tunable, more operable, and capable of handling nuanced use cases like contextual autocomplete and understanding abbreviations. This focus on enabling internal customers (product teams) to derive more value, faster, is a powerful example of systems thinking. The immediate pain of a complex rebuild is directly linked to the long-term advantage of a more agile and performant platform. It’s about recognizing that the "buy" option, while expedient, can eventually lead to a dependency that stifles growth, whereas "build" can create a unique, defensible capability.

The Unsung Value of Legacy

Satish’s reflection on his early days at DoorDash offers a crucial counterpoint to the common disdain for legacy systems. He notes that systems like the initial monolith, while perhaps unpleasant to work with, were instrumental in propelling the company to its current state. This perspective is vital for engineers and leaders:

"You, you don't have to be critical of the, the, the legacy systems because those were the systems which propelled your company, your product till that point. So they were supposed, they were doing what they were supposed to do. And it's an, it's an opportunity for you to taking that and like molding it in a different way and write the, write the next story for your team, for your company."

This mindset shift--from criticism to opportunity--is foundational for effective system evolution. It recognizes that past decisions were likely optimal given the context and constraints of their time. The challenge, and the opportunity, lies in understanding that context and building the next story. This acceptance of history, combined with a forward-looking approach to innovation, is key to navigating complex technical landscapes. It means appreciating that even seemingly "poor engineering" served a purpose and provides a foundation upon which to build something better, rather than starting from scratch in a vacuum.


Key Action Items

  • Prioritize Velocity: When evaluating technology choices, weigh the cost of development velocity impact as heavily as direct monetary cost. If a "buy" solution significantly slows down iteration, consider the long-term implications. (Immediate Action)
  • Validate "Buy" Limitations: Regularly assess whether off-the-shelf solutions meet evolving business needs, especially regarding control, tunability, and operational overhead. Don't let initial convenience become a long-term constraint. (Ongoing Assessment)
  • Frame Rebuilds for Business Impact: When proposing significant technical overhauls, translate engineering challenges into clear business benefits (e.g., faster feature delivery, reduced operational costs, improved customer experience) to secure leadership buy-in. (Strategic Communication)
  • Invest in Migration Planning: Allocate significant resources and time not just to building new systems, but to the crucial phase of migrating users and data, ensuring value addition and clear deprecation paths. (Project Planning)
  • Leverage Internal Expertise for POCs: For ambitious "build" decisions, use targeted Proofs of Concept (POCs) on specific use cases to gather practical data and demonstrate feasibility before committing to a full rebuild. (Technical Validation)
  • Appreciate Legacy Systems: Approach existing systems with an understanding of their historical context and the value they provided. Frame future work as an evolution and improvement, not just a replacement. (Cultural Mindset)
  • Contextualize AI Tooling: Utilize AI tools for learning and onboarding, but remember that understanding the unique history, culture, and existing systems of a large organization like LinkedIn is a critical, non-AI-driven process. (Onboarding Strategy)

---
Handpicked links, AI-assisted summaries. Human judgment, machine efficiency.
This content is a personally curated review and synopsis derived from the original podcast episode.