Speed, Infrastructure, and Evals: The AI Agent Trifecta
Resources
Resources & Recommendations
Organizations & Institutions
- Grok - Benjamin Kleker's employer, an AI company that focuses on fast and affordable inference with their custom chip, the LPU.
- Anthropic - An AI safety and research company known for its work on building effective agents and pushing agentic capabilities.
- Hugging Face - An AI company that develops tools for building, training, and deploying machine learning models, including agent frameworks like Small Agents.
- OpenAI - A research organization that developed Simple QA, a dataset for evaluating models.
- Roma - Mentioned in the context of a blog about "context rot."
Tools & Software
- Compound - Grok's agent platform that enables models to perform tasks like web search, mathematical calculations, and code execution.
- GitHub Copilot - An AI pair programmer that assists developers by suggesting code and entire functions.
- LangChain - An agent framework that helps connect different components for building LLM applications.
- Small Agents - An agent framework from Hugging Face designed for efficiency and lower-level implementation.
- Claude Code - Anthropic's SDK for building general agents.
- Remote MCP - A server setup that allows models to iteratively call servers, essentially turning them into agents.
- Wolfram Alpha - A computational knowledge engine used by Compound for mathematical calculations and factual queries.
Research & Studies
- "Building Effective Agents" (Anthropic) - A paper/blog post that discussed limitations of abstraction in agent frameworks and advocated for understanding underlying implementation details.
- "Context Rot" (Roma) - A concept explaining that the more information provided in a large language model's context window, the lower its performance can become.
Websites & Online Resources
- OpenBench - An open-source infrastructure developed by Grok to reproducibly run and standardize evaluations against models or systems.
- Berkeley Function Calling Leaderboard - An evaluation benchmark for assessing how well models can call tools.