Beyond Bounding Boxes: High-Skill Data for Frontier AI
Resources
Books
- "The Art of Computer Programming" by Donald Knuth - Mentioned as an example of a foundational work in computer science, implying a similar depth of understanding is needed for advanced AI data research.
Videos & Documentaries
- None
Research & Studies
- Llama 2 Paper - Referenced for its findings on the effectiveness of RLHF over SFT data in training language models.
Tools & Software
- Crowdflower - Mentioned as a previous company the host started in the data collection business.
Articles & Papers
- None
People Mentioned
- Donald Knuth (Author of "The Art of Computer Programming") - Cited as a figure associated with foundational computer science knowledge.
- Terence Tao - Mentioned as an example of someone whose skill level would not differ in drawing a bounding box, highlighting the low complexity of such tasks.
- Albert Einstein - Mentioned as an example of someone whose skill level would not differ in drawing a bounding box, highlighting the low complexity of such tasks.
- Hemingway - Mentioned as an example of a great writer who did not have a PhD.
- Emily Dickinson - Mentioned as an example of a great poet who did not have a PhD.
Organizations & Institutions
- Twitter - Mentioned as a previous employer of the guest where he encountered data collection issues.
- Google - Mentioned as a company where the guest experienced similar data collection problems.
- Facebook - Mentioned as a company where the guest experienced similar data collection problems.
- Airbnb - Mentioned as an early customer of Surge.
- Meta - Mentioned in the context of potentially making Llama open-source models closed-source.
- OpenAI - Mentioned as a major AI lab with a distinct culture and model.
- Anthropic - Mentioned as a major AI lab with a distinct culture and model.
- Elon Musk's X - Mentioned as a major AI lab with a distinct culture and model.
- Stanford University - Mentioned as an example of where a professor might be working on frontier research.
Courses & Educational Resources
- None
Websites & Online Resources
- LLM Arena - Discussed as a popular leaderboard for LLM models that the guest believes has negatively impacted the industry due to flawed evaluation methods.
Other Resources
- Lidar model - Used as an example of a specific application where understanding underlying principles helps in data collection.
- Cloud Code - Mentioned as an example of a coding collaborator.
- AWS - Mentioned as a service that could go down in a simulated RL environment.
- Gmail - Mentioned as a communication tool within a simulated RL environment.
- Slack - Mentioned as a communication tool within a simulated RL environment.
- Jira tickets - Mentioned as a tool within a simulated RL environment.
- Github PRs - Mentioned as a tool within a simulated RL environment.
- Codebases - Mentioned as a component within a simulated RL environment.
- Sat style math problems - Used as an example of narrow, synthetic problems that models can become good at, but at the expense of broader capabilities.
- RL environments - Discussed as a new method for data collection, involving simulated universes with complex tasks.