Modular Agentic Architectures for Effective LLM Application Development

TL;DR

Generative AI models offer powerful unstructured data understanding and generation capabilities but incur high latency and computational costs, making them unsuitable for real-time, high-volume, low-latency systems like transaction scoring.
Applications not involving unstructured input, such as inventory optimization or structured predictive analytics, do not benefit from generative AI and are better served by traditional ML models.
High-stakes applications focused on understanding unstructured inputs, like fraud detection or medical diagnosis, should leverage accurate, controllable encoder models trained on task-specific data rather than generative models.
For knowledge-intensive generation tasks, retrieval-augmented generation (RAG) is the initial solution, grounding responses in factual data, with domain-specific fine-tuning of generative models as a more complex alternative.
Building LLM applications effectively requires a modular, agentic architecture with distinct components for planning, orchestration, memory, LLM reasoning, and specialized tools, rather than a single monolithic model.
Validation is critical, employing multi-layered defenses including input/output filtering, prompt safeguards, and human-in-the-loop mechanisms to manage the stochastic and unpredictable nature of generative systems.
Automated prompt optimization, using techniques like those in the DSPy framework, is shifting the focus from manual prompt engineering to automatically generating and refining prompts for complex LLM systems.

Deep Dive

The rapid evolution of Large Language Models (LLMs) and generative AI presents both transformative opportunities and critical decision points for application development. While LLMs offer powerful capabilities for understanding and creating unstructured content, their inherent latency, computational cost, and potential for hallucination necessitate a nuanced approach. Applications are best built not as monolithic black boxes, but as modular, agentic systems, where specialized components handle distinct tasks like planning, orchestration, memory, and tool execution, guided by the core LLM intelligence. Strategic use of Retrieval Augmented Generation (RAG) and carefully designed evaluation metrics are crucial for ensuring accuracy, safety, and efficacy, particularly when moving beyond simple generative tasks to knowledge-intensive applications.

The effective integration of LLMs into applications hinges on a modular, agentic architecture that mirrors a well-coordinated project team. At its core, an LLM acts as the central intelligence, akin to an expert strategist, providing reasoning and guidance. This intelligence is directed by a planner, which breaks down overall requirements into actionable tasks, and an orchestrator, which manages the execution of these tasks by coordinating various components. Memory modules store context and decisions, enabling coherent operation across sessions, while a tool layer comprises specialized APIs or models that perform specific functions, such as retrieval of factual information or execution of actions. This architecture contrasts with simpler monolithic models or RAG systems, offering greater flexibility and control for complex applications requiring dynamic planning, diverse tool utilization, and iterative reflection.

Crucially, the application of LLMs is not universally beneficial. They are ill-suited for tasks that do not involve unstructured input, such as structured predictive analytics or inventory optimization. High-stakes applications focused solely on understanding unstructured data, like fraud detection or medical diagnosis, are better served by highly accurate, controllable encoder models trained on task-specific data, rather than generative models prone to hallucination. Furthermore, real-time, high-volume, low-latency systems like transaction scoring or product search ranking are impractical for the inherent computational costs and latency of generative models. For applications where existing, accurate solutions exist and latency requirements are strict, adopting LLMs may not be optimal.

When building LLM-powered applications, a structured approach to model selection and evaluation is paramount. This involves identifying candidate models based on input/output modalities and then assessing both qualitative factors (e.g., open vs. closed access, licensing, copyright liabilities, cost, throughput limits) and quantitative metrics. Quantitative evaluation should focus on response quality specific to the task--precision and recall for understanding/retrieval, fluency and realism for world-knowledge generation, and factual accuracy and completeness for knowledge-intensive generation. Beyond these, metrics for out-of-scope detection, safety, and compliance are vital, especially in regulated domains. Establishing clear, business-driven success criteria with representative test data sets, and automating evaluation mechanisms, allows for continuous monitoring and efficient model swapping as new LLMs emerge.

Prompting, the primary interface for communicating with LLMs, has evolved from manual crafting to automated optimization. While techniques like zero-shot, few-shot, and chain-of-thought prompting remain valuable for guiding LLM behavior, modern models increasingly incorporate these methods internally. The focus is shifting towards automated prompt optimization, where systems like DSPy can learn to generate and refine prompts based on labeled data, effectively treating prompts as tunable parameters. This development streamlines the process of adapting LLMs to specific tasks and architectures, reducing the burden of manual prompt engineering.

The future of LLM applications points towards multi-sensory integration and enhanced safety measures. Innovations like Meta's ImageBind demonstrate the ability to unify representations across diverse modalities including touch and thermal signals, crucial for robotics and embodied AI. Advancements in artificial smell and taste sensors, coupled with concepts like the "Internet of Senses," promise richer contextual understanding and immersive experiences. However, this expansion also raises concerns about AI perceiving the world beyond human capacity. Consequently, robust validation remains critical. Layered defense mechanisms, including input and output validation, prompt-level safeguards, and human-in-the-loop review, are essential for mitigating risks such as offensive content, copyright infringement, and factual inaccuracies, ensuring responsible AI deployment.

Action Items

Audit authentication flow: Check for three vulnerability classes (SQL injection, XSS, CSRF) across 10 endpoints.
Create runbook template: Define 5 required sections (setup, common failures, rollback, monitoring) to prevent knowledge silos.
Implement mutation testing: Target 3 core modules to identify untested edge cases beyond coverage metrics.
Profile build pipeline: Identify 5 slowest steps and establish 10-minute CI target to maintain fast feedback.
Track 5-10 high-variance events per game to measure outcome impact.

Related Episodes

Agentic AI Augments Human Intuition for Outcome-Driven Transformation

Jan 29, 2026 Harvard Data Science Review Podcast

Agentic AI transforms workflows by augmenting human capabilities, not just automating tasks. Focus on desired outcomes to unlock unprecedented productivity and competitive differentiation.

View Episode Notes →

AI's Uncertain Trajectory: Platform Shift, Bubble Risk, and Productization

Dec 12, 2025 The a16z Show

AI's potential rivals electricity, but its true impact and widespread daily use remain unpredictable due to unknown physical limits and the risk of market bubbles.

View Episode Notes →

AI Accelerates Biotech Drug Discovery Through Workflow Systematization

Nov 13, 2025 No Priors: Artificial Intelligence | Technology | Startups

AI agents slash drug discovery timelines, transforming weeks of work into hours and unlocking institutional knowledge to accelerate the creation of life-saving medicines.

View Episode Notes →

Empromtu AI: Democratizing Production-Ready AI Development

Dec 17, 2025 NVIDIA AI Podcast

Build production-ready AI applications without coding expertise. Empromtu's platform achieves up to 98% task success by optimizing models, data, and prompts in real-time, making AI accessible for business transformation.

View Episode Notes →

Safety-Critical AI: Engineering the Physical World's Future

Oct 28, 2025 Beyond The Prompt - How to use AI in your company

Deploying AI in safety-critical physical systems is exponentially harder due to catastrophic failure risks, contrasting with consumer AI's information retrieval capabilities.

View Episode Notes →

Agentic AI Security Risks: Expanding Attack Surface and Unpredictable Behavior

Dec 18, 2025 Hanselminutes with Scott Hanselman

Agentic AI's expanding attack surface and "jagged intelligence" create unprecedented security risks, demanding provable guarantees for trustworthy, critical decision-making.

View Episode Notes →

Key Quotes

"So ML is sort of nested inside AI though nowadays they've become nearly synonymous now within machine learning the current dominant subclass of algorithms are deep learning methods and these are methods based on multi layered neural networks and these include models such as RNNS recurrent neural networks and CNNS convolutional neural networks and also the most recent crop of generative AI models."

Srujana Merugu explains the hierarchical relationship between Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning. She clarifies that ML is a subset of AI, and Deep Learning is a dominant subclass within ML, characterized by multi-layered neural networks. This establishes a foundational understanding of the terminology used in the field.

"The transformer architecture that is the core model architecture powering today's LLMs so this was first proposed in 2017 in the famous attention is all you need paper and it's designed around the idea of attention which is essentially a mechanism that lets the model focus on the most relevant parts of the input when it's trying to either create an embedding or generate predictions."

Srujana Merugu highlights the significance of the transformer architecture, introduced in the "Attention Is All You Need" paper, as the foundational technology behind modern Large Language Models (LLMs). She emphasizes the "attention mechanism" as the key innovation, enabling models to prioritize relevant input information for generating embeddings or predictions.

"So if you think about it language is discrete and sequential so words come in a fixed order and that results in a distinct meaning and so predicting the next token made a lot of sense images on the other hand are continuous and spatial and the meaning emerges from patterns across pixels rather than their order so transformer architectures which excelled at handling dependencies across a sequence of tokens were a natural fit for language but for image models we needed architectures that could deal with noise and spatial coherence so it turned out that diffusion models which were first studied in physics were a good fit."

Srujana Merugu contrasts the nature of language and images to explain the development of different AI models. She notes that language's discrete and sequential nature made transformer architectures suitable, while images' continuous and spatial properties necessitated different approaches like diffusion models, which are adept at handling noise and spatial coherence.

"The first one is really where it started for most of us and that is chatbots or software knowledge copilots like chat gpt gemini claude and these are systems designed to augment human intelligence and productivity what makes them so powerful is their dual capability on one hand these copilots interact with humans in our natural modalities like text audio images or even video and on the other hand they can also talk to automated tools through api calls."

Srujana Merugu categorizes chatbots and knowledge copilots as a primary use case for Generative AI (GenAI), emphasizing their role in augmenting human intelligence and productivity. She points to their powerful dual capability of interacting naturally with humans across various modalities and communicating with automated tools via APIs as a key differentiator.

"Broadly speaking, there are three groups of applications where GenAI models turn out to be a poor fit when we do this evaluation of benefits and let me call out these three groups the first one is applications that do not involve understanding or generation over unstructured input like text and images and obviously because the models' forte is this unstructured input so if we are not dealing with it maybe we don't need them as much and examples here include inventory optimization or structured predictive analytics based on tabular numeric data."

Srujana Merugu identifies specific scenarios where GenAI models are not optimal. She explains that applications not involving the understanding or generation of unstructured data, such as inventory optimization or predictive analytics on tabular data, are poor fits because these tasks fall outside the core strengths of GenAI models.

"The thing that's like working for a lot of people when it comes to building real large scale applications is this agentic approach and most of the applications that we build typically will not you know be the right fit for one big monolithic model you know because you can't just have one model serving you know your text and images so it's you can't control it as efficiently so when you craft it well it is a modular agentic system with multiple components and together they take care of planning execution memory reasoning actions and even reflection and learning over time."

Srujana Merugu advocates for an agentic approach in building large-scale GenAI applications, suggesting that monolithic models are often insufficient. She describes agentic systems as modular, composed of multiple components that collaboratively manage planning, execution, memory, reasoning, and learning, offering greater control and efficiency than a single model.

Resources

External Resources

Books

"Attention Is All You Need" - Mentioned as the paper that first proposed the transformer architecture.

Articles & Papers

"Attention Is All You Need" (2017) - Mentioned as the paper that first proposed the transformer architecture.

People

Srujana Merugu - AI researcher with decades of experience, guest on the podcast.
Priyanka Raghavan - Host of Software Engineering Radio.

Organizations & Institutions

Software Engineering Radio (SE Radio) - Podcast for professional software developers.
IEEE Computer Society - Sponsor of Software Engineering Radio.
IEEE Software Magazine - Sponsor of Software Engineering Radio.
Yahoo - Former research lab Srujana Merugu worked with.
IBM - Former research lab Srujana Merugu worked with.
Google - Former research lab Srujana Merugu worked with; provider of LLM fine-tuning.
Amazon - Former research lab Srujana Merugu worked with.
University of Texas at Austin - Srujana Merugu's alma mater for PhD.
Indian Institute of Technology Madras - Srujana Merugu's alma mater for Bachelor's.
OpenAI - Provider of LLM fine-tuning; developer of GPT models.
Microsoft - Developer of Microsoft Copilot.
GitHub - Developer of GitHub Copilot.
Meta - Developer of ImageBind.
Stanford NLP - Associated with the dspy framework.

Tools & Software

GPT (Generative Pre-trained Transformer) - Transformer architecture powering LLMs.
LLM (Large Language Model) - Models that ingest inputs and generate artifacts like text.
RNN (Recurrent Neural Network) - Type of deep learning method.
CNN (Convolutional Neural Network) - Type of deep learning method.
BERT - Model that uses only the encoder part of the transformer architecture.
T5 - Model that has both encoder and decoder components.
Lora (Low-Rank Adaptation) - Lightweight technique for fine-tuning models.
RLHF (Reinforcement Learning from Human Feedback) - Technique used for model alignment.
PPO (Proximal Policy Optimization) - Technique used for model alignment.
DPO (Direct Preference Optimization) - Technique used for model alignment.
Diffusion Models - Architecture used for image generation.
Vision Transformers (ViTs) - Architecture used for image understanding.
Diffusion Transformers (DiTs) - Merged diffusion and transformer architectures.
DALL-E - Model for generating images.
Stable Diffusion - Model for generating images.
DreamFusion - Model for generating 3D scenes.
ChatGPT - Chatbot and software knowledge copilot.
Gemini - Chatbot and software knowledge copilot.
Claude - Chatbot and software knowledge copilot.
Cursor AI - Code-centric GenAI tool.
Microsoft Copilot - Personal assistant for knowledge workers.
Assistant API - Framework for building autonomous agents.
AutoGPT - Framework for building autonomous agents.
CrewAI - Framework for building autonomous agents.
AlphaFold3 - Model helping design protein structures and molecules.
Elasticsearch - Tool for retrieval.
TF-IDF (Term Frequency-Inverse Document Frequency) - Retrieval method.
BM25 - Retrieval method.
Hugging Face - Platform maintaining public leaderboards.
Artificial Analysis AI - Platform maintaining public leaderboards.
LMRNA - Platform maintaining public leaderboards.
Notion - Tool for note-taking and organization.
NotebookLM - Tool for note-taking and organization.
ImageBind - Unified representation across multiple data modalities.
dspy - Declarative self-improving language programs framework.

Other Resources

Generative AI - AI models that ingest inputs and generate artifacts.
Deep Learning - Subclass of machine learning algorithms based on multi-layered neural networks.
AI (Artificial Intelligence) - Encompasses all kinds of intelligence that do not have a natural origin.
Machine Learning (ML) - Methods where patterns are learned from data.
Pre-training - First stage of generative model training, giving broad education.
Fine-tuning - Second stage of generative model training, tweaking model parameters for a specific task.
Alignment - Third stage of generative model training, guiding model behavior.
Transformer Architecture - Core model architecture powering today's LLMs.
Attention Mechanism - Mechanism within transformers that lets the model focus on relevant input parts.
Encoder - Component of the transformer architecture that processes input tokens.
Decoder - Component of the transformer architecture that generates output text.
Embeddings - Numeric vectors representing token meaning.
Multimodal AI - AI that handles multiple data modalities like text, images, and audio.
Diffusion Models - Architecture that learns to create images through iterative denoising.
Vision Transformers (ViTs) - Architecture that treats images like sentences by breaking them into patches.
Chatbots - Systems designed to augment human intelligence and productivity.
Software Knowledge Copilots - Systems designed to augment human intelligence and productivity.
Process Automation - Use case involving autonomous agents that plan and execute multi-step tasks.
Agentic Systems - Autonomous agents that plan and execute multi-step tasks.
Multimodal Content Design and Creation - Creative frontier of GenAI for generating images, video, 3D models, and audio.
Scientific Discovery - Use case where GenAI acts as an instrument of science for custom models and synthetic datasets.
Retrieval-Augmented Generation (RAG) - System that retrieves relevant information before generating a response.
Agentic Approach - System design with multiple components for planning, execution, memory, reasoning, and reflection.
Planner - Component in agentic systems that understands requirements and breaks them into tasks.
Orchestrator - Component in agentic systems that coordinates execution between components.
Memory - Component in agentic systems responsible for maintaining context and decisions.
Tool Layer - Component in agentic systems that performs specific jobs via APIs or external models.
Retriever Tool - Specialized tool for fetching information from databases or repositories.
Safety/Guardrails Tool - Tool that ensures input is within scope and output is safe and aligned with policies.
Zero-shot Prompting - Direct instruction given to an LLM.
Few-shot Prompting (In-context Learning) - Providing examples to an LLM to guide its task execution.
Chain-of-Thought Prompting - Instructing an LLM to think step-by-step.
Chain-of-Verification - Prompting technique where the model generates an answer and then critiques it.
Prompt Chaining - Breaking down complex tasks into multiple prompts where the output of one feeds into the next.
Automated Prompt Optimization - Methods that automatically tweak prompts based on examples.
Declarative Self-Improving Language Programs - Framework for declaring system architecture and optimizing prompts.
Multi-sensory AI - AI models that integrate multiple human senses.
Internet of Senses (IoS) - AI model serving as a cognitive hub to connect and interpret sensory inputs.
Electronic Noses - Systems using chemical sensors to detect organic compounds.
Lifelong Learning - Continuous learning throughout one's career.
Curated Knowledge Sources - Selecting trustworthy sources for information.
Positional Placement - Research on the importance of prompt element placement.
Gradient Descent - Learning algorithm for updating parameters.
Text Gradients - Techniques for updating prompts based on loss metrics.