ElevenLabs' Parallel Research and Product Strategy Drives Voice AI Innovation - Episode Hero Image

ElevenLabs' Parallel Research and Product Strategy Drives Voice AI Innovation

Original Title:

TL;DR

  • ElevenLabs' strategy of parallelizing research and product development allows them to deliver a 6-12 month research head start to customers while simultaneously building a robust product layer and ecosystem.
  • The company's focus on foundational audio models, rather than solely on end-user applications, creates a defensible advantage by enabling seamless, human-controllable voice experiences.
  • Voice AI's potential to break down global language barriers through real-time dubbing and translation will unlock immense market opportunities beyond static content delivery.
  • Agentic AI is shifting from reactive customer support to proactive, integrated experiences that guide users through discovery, checkout, and personalized interactions.
  • The future of education will be revolutionized by personalized AI tutors, blending on-demand learning with essential human-to-human interaction for holistic development.
  • ElevenLabs' approach to voice quality emphasizes not just technical benchmarks but also a "voice sommelier" service to match specific branding and customer needs.
  • The company's investment in parallelizing speech-to-speech models aims to achieve more expressive and fused conversational interactions, moving beyond current cascaded approaches.

Deep Dive

ElevenLabs is fundamentally reshaping human-technology interaction by building foundational audio models that enable seamless voice creation and understanding, positioning voice as the ultimate interface for a wide range of applications. The company has achieved rapid growth, reaching $300 million in annual recurring revenue with a 50/50 split between its self-serve creative platform and enterprise agent platform, serving over 5 million monthly active users on the creative side and thousands of enterprise customers.

The core of ElevenLabs' strategy lies in its parallel development of cutting-edge research and practical product applications, a dual approach necessitated by the limitations of existing audio models. Initial attempts to use off-the-shelf models for narration and dubbing produced sub-par, robotic speech, prompting the company to invest heavily in in-house research. This research has yielded significant advancements in text-to-speech, speech-to-text, and orchestration mechanisms, allowing ElevenLabs to outperform competitors on benchmarks and create more human-like, controllable, and expressive audio experiences. The company structures its efforts into specialized "labs"--such as a voice lab and an agent lab--which combine research, engineering, and operational expertise to tackle specific problems, starting with voice creation and then expanding into interactive agents and other modalities like music and visual integration.

ElevenLabs' impact extends across multiple domains through its product offerings. The creative platform empowers individuals and businesses to generate high-quality audio for narrations, voiceovers, and dubbing, with the vision of breaking down language barriers through real-time, natural-sounding translation that preserves original intonation and emotion. The agent platform focuses on elevating customer experience, moving beyond reactive support to proactive engagement. Use cases range from e-commerce navigation and personalized recommendations to immersive media experiences, such as enabling millions of Fortnite players to interact with Darth Vader, and revolutionizing education through personalized AI tutors that can embody historical figures or expert instructors like chess grandmasters or FBI negotiators. The company is also exploring the potential of AI in government services, partnering with Ukraine's Ministry of Transformation to develop digital agents for citizen support and information dissemination.

The company's strategic differentiation lies in its focus on building a robust ecosystem around its foundational models. While acknowledging the eventual commoditization of base models, ElevenLabs emphasizes the enduring value of its product layer, integrations, brand, and distribution. This approach allows them to offer a unique advantage by parallelizing research and product development, ensuring that cutting-edge capabilities are rapidly translated into functional applications for customers. Their ongoing research initiatives include controllable text-to-speech and speech-to-text models for numerous languages, music generation, and the integration of audio with visual modalities. For agents, the focus is on optimizing real-time speech processing, developing new orchestration mechanisms that incorporate emotional context, and exploring fused speech-to-speech approaches for more expressive and conversational interactions.

Ultimately, ElevenLabs foresees a future where voice is the primary interface for technology, driving significant shifts in education, entertainment, and daily life. While acknowledging the rise of AI companions, the company's excitement centers on the potential for "super assistants" that provide proactive, personalized support. They believe AI tutors will become a cornerstone of education, complementing human interaction, and that robots will increasingly become personified, with voice serving as their primary communication channel. The most profound yet unrealized impact, in their view, will be in education, where AI-powered, on-demand tutors will offer personalized learning experiences, potentially even embodying beloved historical educators.

Action Items

  • Audit audio models: Evaluate 3-5 core text-to-speech and speech-to-text models for accuracy and latency across 100 languages.
  • Create voice quality evaluation rubric: Define 5-7 criteria for assessing audio model output, focusing on naturalness and emotional context.
  • Implement agent orchestration framework: Design a system to integrate speech-to-text, LLMs, and text-to-speech for real-time interactive agents.
  • Develop proactive customer engagement strategy: Pilot an agent-led discovery and checkout experience for 2-3 e-commerce product categories.
  • Build educational AI tutor prototype: Create a voice-based tutor for a specific subject (e.g., chess, negotiation) to test interactive learning efficacy.

Key Quotes

"we build foundational audio models so models in the space to help you create speech that sounds human understand speech in a much better way or orchestrate all those components to make it interactive and then built products on top of that foundational models and we have our creative product which is a platform for helping you with narrations for audiobooks with voiceovers for ads or movies or dubs of those movies to other languages and our agent's platform product which is effectively an offering to help you elevate customer experience built an agent for personal ai education new ways of immersive immersive media but all this kind of under under light of that mission of solving how we can interact with technology on our terms in a better way"

Mati Staniszewski explains that ElevenLabs develops core audio models to enable human-like speech creation and understanding, which then form the basis for their products. These products include a creative platform for audio content and an agents platform designed to enhance customer experiences through AI-powered interactions. Staniszewski emphasizes their mission is to improve human-technology interaction on user terms.


"the original belief came from poland it's a it's a very peculiar thing but if you if you watch a movie in polish language a foreign movie in polish language all the voices whether it's a male voice or female voice are narrated with one single character so you have like a flat delivery for everything in a movie that's a terrible experience it is a terrible experience and it's still you like if you grow up there the the as soon as you learn english you're like switch off and you don't want to watch content in this way and it is crazy that it's still happens until today in this way for majority of of content combining that and i worked with palantir my co founder worked at google we knew that that will change in the future and that all the information will be available globally and then as we started digging further we realized in every language in a high quality way exactly and the big thing was like instead of having it just translated um could you have the original voice original motions original intonation carried across so like imagine having this podcast but say people could switch it over to spanish and they still hear sarah they still hear madi and the same voice the same the same delivery which is kind of exactly what we did with lex back when he interviewed narendra modi and you could like kind of immerse yourself in that story a lot better"

Mati Staniszewski shares the origin of ElevenLabs' vision, stemming from a poor dubbing experience in Poland where foreign films used a single, flat voice. He highlights the desire to move beyond simple translation to preserving original intonation and emotion, using the example of a podcast that could be heard in Spanish with the original speakers' voices intact. Staniszewski believed this would be a significant future development for global information access.


"we realized pretty quickly that the models that existed just produced such a robotic and and not not good speech that people would not want to listen to it and that's where my co founder's genius came in where he was able to assemble the team and and do a lot of the research himself to actually create new version of of creating that work but like to your question i think the the way we are kind of organizing internally and how we think about sequencing a lot of that was looking at the first problem and then creating effectively a lab around that problem which is like a combination of mighty researchers engineers operators to go after that problem and the first problem was the problem of voice so how can we recreate the voice and like i say it needs to have that research expertise to be able to do that well so we started with effectively a voice lab which was that mission of can we narrate the work in in in better way"

Mati Staniszewski explains that ElevenLabs initially found existing speech models to be too robotic and unsatisfactory for their intended use cases. He credits his co-founder's expertise in assembling a team and conducting research to develop superior models. Staniszewski describes their approach of creating specialized "labs" to tackle specific problems, starting with a "voice lab" focused on improving narration quality through dedicated research.


"and then a lot of the technology wasn't possible for you to be able to um recreate a specific voice or be able to create that in that high quality way and then of course as we dived into further and and shifted away from the static piece the whole interactive piece is still crazy in the way it functions where most of us see this technological evolution over the last decades but you still will spend most of your time on a keyboard you will look at a screen and that interface feels broken it should be where you can communicate with the devices through through speech through the most natural interface there is one that can started when the humanity started and um and we realized we want to we want to solve that and i think now fast forward from 2022 i feel like many people will carry that belief too and that voice is the interface of the future as you think about the devices around us whether it's smartphones with computers whether it's robots speech will be one of the key ones but i think in 2022 it wasn't and um and as we think about the market for the creative side or whether for that interactive side it was like very clear it will be a huge a huge huge one"

Mati Staniszewski discusses the limitations of past technology in recreating specific voices with high quality and the persistent reliance on keyboard and screen interfaces. He argues that speech is the most natural interface for human-device communication and that this will become increasingly prevalent, even for devices like robots. Staniszewski asserts that by 2022, voice was poised to become a dominant interface, indicating a massive market potential for both creative and interactive applications.


"and then on the agent side you you you are like some of these things that of course customers that we work with or partners will will want to integrate which is we want integrations with xyz systems but then there are like other parts that might not be as easy to predict of as you interactive technology of course want to understand what's happening but you also want to understand how the things are being said and bring that into the fold which would be something we tried to prioritize on our side so then the people when they actually interact with the technology they realize oh expressive thing is actually so so much more enjoyable and beneficial and helpful"

Mati Staniszewski highlights that while customers often request specific system integrations for their agent platforms, ElevenLabs also prioritizes understanding the nuances of how things are said, not just what is said. He explains that by incorporating this emotional context and expressiveness into interactions, users find the technology more enjoyable, beneficial, and helpful. This focus on expressive communication is a key differentiator for their agent platform.


"i think this still the biggest one that hasn't yet kicked into the the system is like how education will be on i think this will be like i think learning with ai will with voice where it's like on your headphone or in a speaker it's just going to be such a big thing where you have like your own teacher on demand and who understands you very personified and kind of delivers the right content through your life i think this will

Resources

External Resources

Books

  • "The Hitchhiker's Guide to the Galaxy" - Mentioned as a conceptual reference for real-time translation technology.

People

  • Mati Staniszewski - Co-founder and CEO of ElevenLabs, discussed for his insights on voice AI, its applications, and the company's growth.
  • Elad Gil - Mentioned as a follower on Twitter.
  • Sarah Guo - Host of the podcast "No Priors: Artificial Intelligence | Technology | Startups," who interviewed Mati Staniszewski.
  • Narendra Modi - Mentioned in relation to a past interview where voice technology was used.
  • Magnus Carlsen - Mentioned as a potential AI tutor for chess education.
  • Hikaru Nakamura - Mentioned as a potential AI tutor for chess education.
  • Chris Voss - Mentioned as an FBI negotiator whose MasterClass lesson can be experienced interactively.
  • Richard Feynman - Mentioned as a potential historical figure to deliver lecture notes through AI.

Organizations & Institutions

  • ElevenLabs - Company discussed for its voice AI technology, foundational audio models, and products like the creative platform and agents platform.
  • Google - Mentioned as a company that Mati Staniszewski's co-founder previously worked at.
  • Palantir - Mentioned as a company where Mati Staniszewski previously worked, and as a potential partner for digital transformation.
  • Midjourney - Mentioned as a company in the creation space alongside ElevenLabs.
  • Suno - Mentioned as a company in the creation space alongside ElevenLabs.
  • Hesian - Mentioned as a company in the creation space alongside ElevenLabs.
  • Pro Football Focus (PFF) - Mentioned as a data source for player grading in a previous example.
  • New England Patriots - Mentioned as an example team for performance analysis in a previous example.
  • NFL (National Football League) - Mentioned as the primary subject of sports discussion in a previous example.
  • Cisco - Mentioned as a company elevating customer support with AI.
  • Twilio - Mentioned as a company elevating customer support with AI.
  • Tels Digital - Mentioned as a company elevating customer support with AI.
  • Myntra - Mentioned as a large e-commerce shop in India using ElevenLabs' agents platform for customer experience.
  • Square - Mentioned as a company enabling businesses to use ElevenLabs' platform for discovery experiences.
  • Epic Games - Mentioned for working with ElevenLabs to bring Darth Vader's voice to Fortnite.
  • Chess.com - Mentioned for working with ElevenLabs to create AI chess tutors.
  • MasterClass - Mentioned for working with ElevenLabs to create interactive educational experiences.
  • Ministry of Transformation (Ukraine) - Mentioned for creating the first government agent using ElevenLabs' technology.
  • Sierra - Mentioned as a use-case oriented company in the AI agent space.
  • OpenAI - Mentioned as a foundation model company and a potential competitor/alternative for AI capabilities.
  • Microsoft - Mentioned as a company that may optimize for things other than foundational models.

Websites & Online Resources

  • No Priors: Artificial Intelligence | Technology | Startups - The podcast where the discussion took place.
  • Twitter - Mentioned as a platform for following individuals and the podcast.
  • Apple Podcasts - Mentioned as a platform to subscribe to the podcast.
  • Spotify - Mentioned as a platform to subscribe to the podcast.
  • NoPriors.com - Website for finding transcripts and signing up for new episodes.

Other Resources

  • Foundational Audio Models - Core technology developed by ElevenLabs for creating human-like speech and understanding speech.
  • Creative Product (ElevenLabs) - A platform for narrations, audiobooks, voiceovers, and movie dubs.
  • Agents Platform (ElevenLabs) - An offering to elevate customer experience, build personal AI, and enable new forms of media.
  • Agentic AI - Discussed in the context of shifting from reactive to proactive support.
  • LLMs (Large Language Models) - Mentioned in relation to combining with speech-to-text and text-to-speech for interactive agents.
  • Speech-to-Text - A core technology component for ElevenLabs' agent platform.
  • Text-to-Speech - A core technology component for ElevenLabs' agent platform.
  • Real-time Translation - A future application of voice AI discussed.
  • AI Personal Tutors - A future application of voice AI discussed for education.
  • Agentic Government Services - A potential future application of voice AI discussed.
  • Voice Cloning - Discussed in the context of creating personalized voices.
  • Voice Sommelier - A role within ElevenLabs that helps clients select appropriate voices.
  • Celebrity Marketplace - A resource for acquiring iconic talent voices.
  • AI Companions - Discussed as a future development in human-AI interaction.
  • Personal AI - Discussed in the context of assistants and understanding individual needs.
  • Robots - Mentioned as a future interface for which voice will be a key input and output.
  • Dictation - Discussed as a key interface for interacting with technology, especially robots.
  • Scribe v2 - ElevenLabs' recently released speech-to-text model.
  • Music Model - A fully licensed music model developed by ElevenLabs.
  • Speech-to-Speech - A future area of investment and development for ElevenLabs.
  • Cascaded Approach - A method for orchestrating speech-to-text and text-to-speech, reliable for enterprise use cases.
  • Fused Approach - A future approach for speech-to-speech, potentially more expressive.
  • Turing Test - Mentioned in the context of achieving human-like conversation.
  • Ecosystem - Discussed as a long-term value driver, including brand, distribution, voices, integrations, and workflows.

---
Handpicked links, AI-assisted summaries. Human judgment, machine efficiency.
This content is a personally curated review and synopsis derived from the original podcast episode.