Shopify's Sidekick: Embedding Human Opinion to Elevate AI Development

TL;DR

Shopify's Sidekick AI assistant leverages LLMs for synthetic data generation and evaluation sets, enabling rapid, reliable feature development by embedding human opinions into ground truth data as the new specification for AI.
AI enables the creation of bespoke user interfaces and custom applications for merchants, moving beyond static designs to personalized experiences that adapt to individual business needs and workflows.
Shopify uses LLMs to build a standardized product taxonomy and attribute system, improving data consistency across millions of merchants and enabling better product surfacing in partner ecosystems like OpenAI.
Developing AI agents requires significant investment in creative data generation and evaluation processes, including negative case grading, to prevent hallucinations and ensure consistent, valuable user interactions.
The future of e-commerce AI involves integrating third-party apps into AI workflows through "app intents," allowing AI assistants like Sidekick to orchestrate actions across multiple services for merchants.
Shopify's approach to AI development prioritizes human oversight and opinion integration into ground truth sets, ensuring AI features align with user expectations and raise the overall quality ceiling.

Deep Dive

Shopify's AI assistant, Sidekick, represents a significant technological renaissance, moving beyond basic AI features to deliver consistent, high-value functionality by deeply integrating commerce knowledge and advanced reasoning capabilities. This evolution, built on two years of foundational work and a novel architectural approach, aims to transform how merchants interact with their businesses, enabling complex tasks like product creation and store editing through natural language, thereby enhancing efficiency and user experience.

The core of Sidekick's success and its ability to avoid common AI pitfalls like hallucination lies in its rigorous evaluation process. Shopify employs a creative approach using LLMs to grade other LLMs, generating synthetic data and "negative cases" to train "judges" to identify and penalize incorrect or undesirable outputs. This system, where human opinions are embedded into the ground truth data, serves as the new specification for AI development, ensuring that AI features align with desired outcomes and continuously improve. While LLMs accelerate development, the human element remains critical, constantly refreshing the ground truth to set and raise the performance ceiling.

Sidekick's impact extends to revolutionizing user interface design and data management. The platform is exploring how LLMs can generate personalized UI elements for individual merchants, moving beyond static interfaces to create bespoke workflows and applications. This capability allows non-technical users to essentially "vibe code" custom solutions, a power previously reserved for developers. Furthermore, Sidekick leverages LLMs to standardize and categorize product data at scale, a critical challenge in e-commerce where metadata can be inconsistent across millions of merchants. By automatically identifying product categories and attributes from images and descriptions, Shopify enhances product discoverability and facilitates better integration with external AI services, ensuring merchants' products are accurately represented across various platforms.

The future of Sidekick involves integrating its capabilities with Shopify's extensive app ecosystem, allowing merchants to access third-party tools directly through AI-driven conversations. This "app intent" system will enable Sidekick to orchestrate complex workflows that span multiple applications, further streamlining merchant operations. Architecturally, Sidekick is built with an MCP-like schema, providing developers with robust tools and frameworks to build and integrate their own applications, thereby democratizing advanced e-commerce functionality and empowering both novice merchants and experienced developers within the Shopify ecosystem.

Action Items

Audit Sidekick's evaluation process: Define 3-5 negative test cases to identify and mitigate LLM hallucinations.
Create app intent schema: Register 5-10 core app functionalities for Sidekick to access and utilize in merchant workflows.
Design personalized UI generation strategy: Explore generating bespoke UI components for 3-5 merchant workflows based on user needs.
Measure product data standardization impact: Track 5-10 product categories to quantify improvements in attribute accuracy and categorization.
Draft runbook template for AI feature development: Define 5 required sections (e.g., evaluation criteria, data sources, human oversight) to standardize AI feature development.

Related Episodes

Shopify's AI Judges: Building Proactive Entrepreneurial Co-Pilots

Nov 05, 2025 Product Thinking

Shopify builds a proactive AI co-pilot, Sidekick, overcoming conversational AI's complexity with LLM judges and pioneering agentic commerce for streamlined entrepreneurship.

View Episode Notes →

AI Bots Reshape Internet Traffic -- Commerce Benefits, Publishers Threatened

Jan 06, 2026 The Stack Overflow Podcast

AI bots are transforming internet traffic, growing 400% annually and blurring lines between bots and agents. Manage this shift by understanding intent, not just identification, to navigate new commerce and content challenges.

View Episode Notes →

Unlock Idle GPUs: The Tetris Game of AI Resource Allocation

Nov 25, 2025 The Stack Overflow Podcast

GPU scarcity stems from underutilization, not capacity limits; efficient allocation requires Tetris-like scheduling, not simple distribution, optimizing omnicloud resources for AI's CapEx economics.

View Episode Notes →

AI Reshapes Engineering: From Code to Architecture

Nov 21, 2025 The Stack Overflow Podcast

AI reshapes software engineering from coding to architecture and expands software's capabilities into human-like tasks, potentially automating significant portions of US labor by 2030.

View Episode Notes →

Purpose-Built Enterprise AI Agents Augment Human Capabilities Through Context and Trust

Jan 08, 2026 Everyday AI Podcast – An AI and ChatGPT Podcast

Enterprise AI agents succeed by augmenting human roles with purpose-built solutions, not general replacements. Transparency and context engineering build trust for 10x efficiency gains.

View Episode Notes →

Rethinking AI Benchmarks for Human-Centric Usability and Safety

Dec 20, 2025 Machine Learning Street Talk (MLST)

Current AI benchmarks create a "leaderboard illusion," masking flaws in safety and user experience. Discover how representative sampling and structured feedback build AI that is truly helpful and relatable.

View Episode Notes →

Key Quotes

"I did engineering in university here in Waterloo in Canada. I've always loved building things and so that was a very natural post-secondary education for me to have. From there, I did a couple of startups. I think this is back in the time before startups were that cool, so it was not that in vogue to do startups. It was a little bit, I felt like a little bit of a lone wolf."

Vanessa Lee explains her early career path, highlighting a passion for building and an independent spirit during a less popular era for startups. This demonstrates her foundational drive and willingness to pursue unconventional paths.

"We've just grown and grown and grown and become truly the operating system of merchants' businesses. I've worked on quite a few parts of our platform, online store and Liquid, some of the Horizon updates that you chatted with Glenn about. I had worked with the team on quite a lot, so it has been a firehose over the last six months, but perhaps one that I had already in some places dabbled in."

Vanessa Lee describes Shopify's evolution into a comprehensive business operating system, reflecting on her own broad experience across various platform components. This illustrates the expansive scope of Shopify's offerings and her deep involvement in its development.

"So the last couple of years, we've been working a lot on Sidekick. When we kind of came out earlier this year with a new architecture of Sidekick, we started seeing Sidekick be a lot more successful in most conversations. So I purposely kind of waited and held back the team from talking and shouting too much about Sidekick until I thought that it really drove some value."

Vanessa Lee details the development and strategic rollout of Shopify's AI assistant, Sidekick, emphasizing a deliberate approach to launching only when the product demonstrated tangible value. This shows her product management philosophy of prioritizing proven utility over early promotion.

"So one of the ways that we did that, we put a lot of work into the foundations of evaluations, but you also make sure that you have enough variety in that judge's training set where you also have essentially negative cases and grading it negatively so that the judge is also able to spot when it answered and tried to sell you a car, which we absolutely do not want."

Vanessa Lee explains a crucial aspect of AI development: the meticulous creation of evaluation datasets, including negative examples, to train AI judges. This highlights the rigorous, detail-oriented work required to ensure AI models behave as intended and avoid undesirable outputs.

"So if you think about how do we make sure that people are actually, people like on Shopify's side, like our opinions of how it should be, how it should act, those are all actually embodied in the ground truth set. And so that's kind of how you go from human, like it's not just LLMs making LLMs. We used LLMs creatively to help us scale. But for example, I would have a team say, okay, go and generate a bunch of conversations between Sidekick and, let's say, an LLM. And you take those conversations, you just edit them. And so that is the human in the loop."

Vanessa Lee clarifies the role of human oversight in AI development, explaining how human opinions are embedded into ground truth sets, which then guide LLM-generated data. This illustrates a hybrid approach where AI assists in scaling data creation, but human judgment remains central to defining desired AI behavior.

"And so one of the things that we, we did actually starting a couple years ago was use LLMs to start to properly categorize products and properly create attributes. So this is where I'm super proud of one of these launches. We've kind of worked on it behind the scenes over the years, but last year we actually basically embedded these predictions into Shopify."

Vanessa Lee discusses the application of LLMs to product categorization and attribute creation within Shopify, emphasizing the long-term development and eventual integration of these AI-driven predictions. This showcases how AI is being used to standardize and enhance product data across the platform.

Resources

External Resources

Books

"The Pragmatic Programmer" by Andrew Hunt and David Thomas - Mentioned as a foundational text for developers.

Articles & Papers

"How to convert empty to null in PostgreSQL" (Stack Overflow) - Mentioned as a question answered by Erwin Brandstetter.

People

Vanessa Lee - VP of Product at Shopify.
Ryan Donovan - Host of the Stack Overflow podcast and editor of the blog.
Glenn Coates - Mentioned as a previous guest from Shopify.
Toby - Mentioned as having put forward an early vision video for Sidekick.
Ilya Gregoric - Distinguished Engineer at Shopify, discussed micro frontends and components.
Erwin Brandstetter - Winner of a great answer badge on Stack Overflow for an answer about converting empty to null in PostgreSQL.

Organizations & Institutions

Shopify - Company where Vanessa Lee is VP of Product, and the primary subject of discussion regarding their AI assistant, Sidekick, and developer platform.
MongoDB - Mentioned as a database solution built for developers, fluent in AI, and ACID compliant.
OpenAI - Mentioned as a partner with Shopify for surfacing merchants' products.
Stack Overflow - Host of the podcast and platform where technical questions are discussed.
Etsy - Mentioned in relation to machine learning and product categorization.

Websites & Online Resources

mongodb.com/build - URL provided for starting to build faster with MongoDB.
Stack Overflow - Mentioned as the platform hosting the podcast and where technical questions are answered.
LinkedIn - Platform where Ryan Donovan can be reached directly.
X (formerly Twitter) - Platform where Vanessa Lee can be found as @vlaurenlee.

Other Resources

Sidekick - Shopify's AI assistant designed to help merchants manage their online stores, create products, and edit their online store.
Polaris - Mentioned in relation to Shopify's front-end components.
GraphQL - Mentioned as an API technology used by Shopify.
MCP (Metafield Configuration Protocol) - Mentioned as a tool for developers building apps on Shopify, similar in schema definition to Sidekick's architecture.
Shopify CLI - A tool that helps developers create test environments, applications, and specify app functionalities.
App Intents - A feature being released by Shopify to allow Sidekick to use third-party app tools in conversations and workflows.
Micro frontends - A software architecture concept discussed in relation to Shopify's development.
Headless storefronts - An architectural approach for e-commerce mentioned in the context of Shopify's flexibility.
Theme Store - Shopify's marketplace for pre-designed themes that merchants can customize.
Taxonomy Tree - A hierarchical classification system used by Shopify for categorizing products.
AI Agent - A concept discussed in the context of building scalable AI assistants like Sidekick.
LLMs (Large Language Models) - The underlying technology powering AI assistants like Sidekick, used for tasks like data generation, evaluation, and UI interaction.
Ground Truth Set - A dataset used for training and evaluating AI models, incorporating human opinions and preferences.
Evaluation Set - A collection of data used to assess the performance and accuracy of AI models.
Synthetic Data - Data generated artificially, often by LLMs, used to supplement real-world data for training and testing.
Vibe Coding - A concept related to building user interfaces in a way that aligns with a desired aesthetic or feeling, made more accessible through AI.