Shifts AI systems from stateless prompt-response generation to goal-driven autonomous loops in which an agent perceives its environment, plans multi-step actions, invokes external tools, reflects on outcomes, and iterates until the goal is reached.
Category
Abstraction level
Operation level
2024
Agenci researchowiAutomation of office and knowledge workAssistants executing end-to-end tasksAgent workflows and task orchestrationHandling processes that require planning and action
The agentic system receives a goal, then independently plans steps, selects tools, gathers data, executes actions, and evaluates intermediate results. In simpler variants, a single agent handles this using tool use; in more advanced configurations, multiple agents collaborate on subtasks within a shared workflow.
Traditional generative models handle single prompts well but struggle with extended tasks that require planning, working memory, tool use, and adaptation to changing context. Agentic AI addresses this by combining reasoning, planning, and action execution.
01
Perception / Input Layer
Receives and encodes environmental inputs into the model's context window.
Modular
Accepts observations from the environment (user messages, tool results, file contents, API responses) and formats them as context for the base model. This may include RAG retrieval to fetch relevant documents.
02
Planning Module
Goal decomposition into actions and execution plan generation
Modular
Decomposes a high-level goal into a sequence of subgoals or actions. The agent may generate an explicit plan or reason step by step using chain-of-thought.
03
Memory
State and history management across agent loop steps
Modular
Stores and retrieves information between steps within a session (short-term memory) and optionally across sessions (long-term memory).
04
Tools / Actions Layer
Extends the model's action space with calls to external systems.
Modular
The agent is provided with callable external functions: web search, code execution, database queries, file operations, API calls, and browser control. Tool interfaces are defined through schemas such as JSON Schema, OpenAPI, and MCP.
05
Reflection / Evaluation
Output quality control and decision to continue or terminate the loop.
Modular
Evaluates whether the current result meets the success criterion. Triggers a retry, replanning, or loop termination. Corresponds to the evaluator-optimizer pattern described by Anthropic.
06
Orkiestrator
Coordinates multi-agent collaboration and manages task flow.
Modular
In multi-agent systems, it directs sub-agents, assigns tasks, and aggregates results. The orchestrator can be an LLM or a statically coded deterministic controller.
Time
…
N = number of agent loop steps; C_step = cost of a single LLM inference call (typically O(L²·d) for a Transformer with context length L). Tool call costs are added on top of this and vary independently of the model.
Agentic AI's time complexity is not an intrinsic property of the paradigm — it depends entirely on the underlying LLM and the number of reasoning–action iterations. Multi-step tasks multiply cost linearly by the number of steps, and the growing context window (accumulated history + tool outputs) increases per-step cost on each iteration. Multi-agent systems with fan-out can parallelize parts of the work, but the critical path remains sequential.
Memory complexity
…
L_ctx = current LLM context window size (in tokens); S_mem = size of external memory store (e.g. vector database) if used.
The memory required by the agentic loop itself is modest (state structures, action history), but grows linearly with the model's context window length. Systems with persistent long-term memory add separate storage costs independent of a single step.
Wąskie gardło: LLM inference per action step
Each step of the agent loop requires at least one LLM inference call. Multi-step tasks with long context windows and multiple tool calls multiply latency and computational cost linearly.
Parallelism
Conditionally parallel
Parallelism is achievable when subtasks are independent (e.g., parallel web searches, concurrent subagent execution). Sequential loops are required when each step depends on the results of previous tool calls.
Paradigm
Conditional
Input dependent
The execution path is not predetermined — it is determined at runtime through the model's reasoning over accumulated context. Workflows with predefined paths represent a degenerate case.
Toolkit
Critical
web_search + code_executionTypical of research agents.
file_read + file_write + bashTypical of coding agents.
The set of external tools available to an agent (web search, code execution, file operations, APIs, browser control). It defines the space of possible actions.
Maximum Number of Steps
Standard
10Conservative limit for short tasks.
50–200Used in long-running coding and research agents.
A hard limit on the number of reasoning-action iterations before forced termination. Guards against infinite loops.
Memory Type
Standard
in_context_only
in_context + vector_store
Whether the agent relies solely on in-context memory or also on external persistent storage (vector database, key-value store).
Number of Agents (Single vs. Multi-Agent)
Standard
1Single-agent loop.
2–10+Multi-agent orchestrator-worker system.
Whether the system uses a single agent or a network of specialized agents coordinated by an orchestrator.
Human-in-the-Loop Checkpoints
Standard
noneFully autonomous.
before_irreversible_actionsRecommended for safety-critical deployments.
Whether and at which steps the agent pauses to await human confirmation before taking irreversible actions.
Context Window Size
Standard
128k tokenów
1M tokenówRequired for very long-term tasks.
The maximum number of tokens processed by the underlying LLM in a single call. This limits the amount of accumulated history, tool outputs, and instructions that can fit within a single inference step.
Common pitfalls
Hallucinations in action
CRITICAL
Model may invoke tools with fabricated parameters or claim to have performed actions it never actually executed — leading to silent failures in multi-step pipelines.
Validate all tool calls against schemas before execution; use deterministic parsers; introduce explicit confirmation steps for irreversible actions.
Infinite loops
HIGH
Without a hard step limit or an effective termination criterion, an agent can loop indefinitely, consuming computational resources and hitting API rate limits.
Set explicit max_steps limits; implement loop detection based on repeated action signatures; use an evaluator to enforce stopping conditions.
Prompt injection via observed content
CRITICAL
Malicious instructions embedded in tool outputs (web pages, documents, emails) can hijack agent behavior by impersonating system-level instructions.
Isolate untrusted content from system instructions; require explicit user confirmation before acting on instructions found in observed content; apply content filtering.
Context window overflow
HIGH
Accumulated tool outputs and conversation history can exceed the model's context window, causing earlier steps to be silently truncated.
Implement context compaction/summarization; use external memory stores; monitor the token budget at each step.
Tool misuse and irreversible side effects
CRITICAL
Agents with access to write-enabled tools (file deletion, email sending, database writes) can cause real-world harm when acting on faulty reasoning.
Use tool sets with minimal permission scope; require human confirmation for irreversible actions; prefer reversible operations where possible.
Creeping complexity — building agents where a workflow suffices
MEDIUM
Using agentic autonomy for deterministic, well-defined tasks introduces latency, unpredictability, and failure modes that a simple workflow would avoid.
Use predefined workflows by default; introduce agentic autonomy only when a task genuinely requires dynamic decision-making across multiple unpredictable steps.
Russell and Norvig formalize rational agents as entities that perceive their environment and take goal-directed actions. BDI (Belief-Desire-Intention) agent architectures are established.
2022
ReAct: Reasoning + Acting with LLMs
breakthrough
Yao et al. (2022) propose ReAct — interleaving chain-of-thought reasoning traces with action execution in LLMs, demonstrating that language models can serve as a reasoning engine within tool-augmented agentic loops.
API for tool calling and first commercial agentic systems
breakthrough
OpenAI introduced function calling in GPT-4 in June 2023. AutoGPT, BabyAGI, and LangChain agent abstractions gained widespread adoption. The term "Agentic AI" entered common industry usage.
2024
Four Agentic AI Design Patterns by Andrew Ng
Andrew Ng's series of blog posts identifies four fundamental design patterns — Reflection, Tool Use, Planning, and Multi-Agent Collaboration — widely cited as a practical taxonomy of agentic systems.
Anthropic "Building Effective Agents" — compositional patterns for production
breakthrough
Anthropic published practical guidelines distinguishing workflows (predefined paths) from agents (model-driven execution) and formalized five compositional patterns: prompt chaining, routing, parallelization, orchestrator-workers, and evaluator-optimizer.
Model Context Protocol (MCP) standardizes tool connectivity
Anthropic publishes MCP as an open standard for connecting LLMs to external tool servers, enabling interoperable agentic ecosystems across providers.
2025
Agentic AI in Robotics — Embodied Agent Loops
LLM-based planners drive robotic actions through perception-planning-action loops, extending agentic paradigms to physical systems and connecting Agentic AI with real-world motor execution.
Hardware agnosticPRIMARY
Agentic AI is an architectural paradigm, not a specific computational kernel. Hardware requirements are entirely determined by the underlying LLM and tools, not by the agent loop itself.
GPU tensor cores or TPU are required by the underlying LLM for efficient inference; the agent orchestration layer (routing, tool calls, memory management) runs on CPU.
BUILT ON
LLM
A Large Language Model (LLM) is a class of machine learning models based on the Transformer architecture, trained on large text datasets via autoregressive language modeling (next-token prediction). These models have billions of parameters and can generate coherent text, answer questions, write code, translate languages, and perform many other language-cognitive tasks without task-specific fine-tuning. The term covers models such as GPT, LLaMA, Gemini, Claude, and Mistral. Most modern LLMs are instruction-tuned (SFT + RLHF) after the pre-training phase.
Chain-of-Thought (CoT) Reasoning is a prompting technique introduced by Wei et al. (2022) in which a large language model is induced to generate a series of intermediate natural-language reasoning steps as part of its output, prior to producing a final answer. The technique was shown to significantly improve LLM performance on arithmetic, commonsense, and symbolic reasoning benchmarks where standard few-shot prompting yields flat or poor results.
In the original formulation (few-shot CoT), a small number of exemplar question-answer pairs are included in the prompt, where each answer consists of a chain of thought followed by the final answer. The model learns from these demonstrations to produce its own reasoning chains. A subsequent zero-shot variant (Kojima et al., 2022) showed that appending the phrase 'Let's think step by step' to a question is sufficient to elicit reasoning chains from large models without any exemplars.
CoT is an emergent property: empirical results in the originating paper show that reasoning ability via CoT prompting appears only in models above a certain parameter threshold (approximately 100B parameters for the models tested in 2022), with smaller models not benefiting or performing worse. This relationship has been revisited by subsequent work as smaller models have been fine-tuned on CoT data.
Key extensions include Self-Consistency CoT (Wang et al., 2022), which samples multiple reasoning paths and selects the most frequent final answer; Tree of Thoughts (Yao et al., 2023), which frames reasoning as search over a tree of intermediate thoughts; and native reasoning models such as OpenAI o1 (2024) and DeepSeek-R1 (2025), which internalize extended reasoning through reinforcement learning on process reward signals rather than relying on prompting.
Tool-augmented LLM is an architectural pattern in which a large language model is equipped with access to one or more external tools that it can invoke during inference by generating structured function-call or API-call outputs. The model learns when and how to call tools by producing special tokens or structured output (e.g., JSON function calls) that are intercepted by a host runtime, executed against the tool, and whose results are returned to the model as new context for continued generation.
The canonical formalization appeared in the Toolformer paper (Schick et al., Meta AI, 2023), which demonstrated that LLMs can learn to self-supervise their own tool-use through API call annotation without requiring large labeled datasets. Toolformer showed that models trained this way can decide which tools to call, when, and with which arguments, and that tool use substantially improves performance on tasks requiring fresh information, arithmetic, multilingual lookup, and question answering.
The pattern encompasses several mechanisms: (1) in-context tool specification, where tool interfaces are described in the system prompt or context (JSON Schema, OpenAPI, natural language); (2) function calling APIs, where the model produces structured output matched to a defined schema and the host dispatches the call; (3) ReAct-style interleaving, where the model alternates reasoning traces with tool-use observations; and (4) parallel tool calling, where the model emits multiple tool calls simultaneously to be executed concurrently.
Key implementations include OpenAI function calling (GPT-4, June 2023), Anthropic tool use (Claude, 2023), Google Gemini function calling, and the Model Context Protocol (MCP, 2024) which standardizes tool server connectivity.
Tool-augmented LLM is an architectural pattern in which a large language model is equipped with access to one or more external tools that it can invoke during inference by generating structured function-call or API-call outputs. The model learns when and how to call tools by producing special tokens or structured output (e.g., JSON function calls) that are intercepted by a host runtime, executed against the tool, and whose results are returned to the model as new context for continued generation.
The canonical formalization appeared in the Toolformer paper (Schick et al., Meta AI, 2023), which demonstrated that LLMs can learn to self-supervise their own tool-use through API call annotation without requiring large labeled datasets. Toolformer showed that models trained this way can decide which tools to call, when, and with which arguments, and that tool use substantially improves performance on tasks requiring fresh information, arithmetic, multilingual lookup, and question answering.
The pattern encompasses several mechanisms: (1) in-context tool specification, where tool interfaces are described in the system prompt or context (JSON Schema, OpenAPI, natural language); (2) function calling APIs, where the model produces structured output matched to a defined schema and the host dispatches the call; (3) ReAct-style interleaving, where the model alternates reasoning traces with tool-use observations; and (4) parallel tool calling, where the model emits multiple tool calls simultaneously to be executed concurrently.
Key implementations include OpenAI function calling (GPT-4, June 2023), Anthropic tool use (Claude, 2023), Google Gemini function calling, and the Model Context Protocol (MCP, 2024) which standardizes tool server connectivity.
Chain-of-Thought (CoT) Reasoning is a prompting technique introduced by Wei et al. (2022) in which a large language model is induced to generate a series of intermediate natural-language reasoning steps as part of its output, prior to producing a final answer. The technique was shown to significantly improve LLM performance on arithmetic, commonsense, and symbolic reasoning benchmarks where standard few-shot prompting yields flat or poor results.
In the original formulation (few-shot CoT), a small number of exemplar question-answer pairs are included in the prompt, where each answer consists of a chain of thought followed by the final answer. The model learns from these demonstrations to produce its own reasoning chains. A subsequent zero-shot variant (Kojima et al., 2022) showed that appending the phrase 'Let's think step by step' to a question is sufficient to elicit reasoning chains from large models without any exemplars.
CoT is an emergent property: empirical results in the originating paper show that reasoning ability via CoT prompting appears only in models above a certain parameter threshold (approximately 100B parameters for the models tested in 2022), with smaller models not benefiting or performing worse. This relationship has been revisited by subsequent work as smaller models have been fine-tuned on CoT data.
Key extensions include Self-Consistency CoT (Wang et al., 2022), which samples multiple reasoning paths and selects the most frequent final answer; Tree of Thoughts (Yao et al., 2023), which frames reasoning as search over a tree of intermediate thoughts; and native reasoning models such as OpenAI o1 (2024) and DeepSeek-R1 (2025), which internalize extended reasoning through reinforcement learning on process reward signals rather than relying on prompting.
Retrieval-Augmented Generation (RAG) was introduced by Lewis et al. (2020) as a general-purpose fine-tuning recipe combining pre-trained parametric memory (a seq2seq language model, specifically BART in the original paper) with non-parametric memory (a dense vector index of Wikipedia, accessed via Dense Passage Retrieval, DPR). In the original formulation, both the retriever and the generator are fine-tuned end-to-end: given an input query x, the retriever retrieves top-k documents z from the corpus, and the generator produces an output y conditioned on x and z. Two formulations were proposed: RAG-Sequence (the same retrieved documents condition the full output sequence) and RAG-Token (different documents may be used per generated token, marginalized during generation).
In widespread contemporary usage (post-2022, with the growth of LLM applications), 'RAG' has expanded to describe a broader class of retrieve-then-generate pipelines, typically with a frozen LLM, a vector store containing pre-computed dense embeddings of document chunks, and a retrieval step that fetches top-k relevant chunks based on embedding similarity to the query. The retrieved chunks are appended to the prompt as context before the LLM generates a response. This non-trainable pipeline variant is technically distinct from the original Lewis et al. formulation but is the dominant practical interpretation of RAG as of 2023–2025.
The canonical modern RAG pipeline consists of an offline indexing phase (document chunking, embedding computation, storage in a vector database) and an online query phase (query embedding, approximate nearest neighbor search, context-augmented generation). Key design decisions include: chunk size and overlap, embedding model choice, retrieval strategy (dense, sparse/BM25, or hybrid), number of retrieved documents k, and context integration method (prepend to prompt, cross-attention injection, or fusion-in-decoder).
RAG addresses two fundamental limitations of parametric-only LLMs: the knowledge cutoff problem (inability to access post-training information) and hallucination (generation of factually incorrect content). However, RAG introduces its own failure modes, including retrieval of irrelevant or misleading context and the LLM's susceptibility to being distracted by retrieved content that contradicts its parametric knowledge.
Model Context Protocol (MCP) is an open protocol developed by Anthropic and released in November 2024. It addresses the M×N integration problem in AI systems: connecting M different LLM applications to N different external tools previously required M×N bespoke connectors. MCP defines a standardized client-host-server architecture where hosts (LLM applications) manage one or more clients, each maintaining a stateful session with a specific server. Servers expose capabilities as three primitives: Resources (structured data for context), Prompts (templated instructions), and Tools (executable functions). Clients expose two primitives: Roots (filesystem entry points) and Sampling (server-initiated LLM completions). Communication is based on JSON-RPC 2.0. Capability negotiation occurs at session initialization. The protocol is transport-agnostic and has been implemented in Python, TypeScript, C#, and Java SDKs. In December 2025, Anthropic donated MCP governance to the Agentic AI Foundation (AAIF) under the Linux Foundation.
Multi-Agent Systems (MAS) are a paradigm in Distributed Artificial Intelligence in which multiple autonomous software entities — agents — interact within a shared environment to achieve individual or collective goals. Each agent perceives its environment through sensors or interfaces, reasons about its state, and acts through actuators or API calls. In the context of LLM-based MAS (emerging prominently from 2023 onward), agents are powered by large language models that provide the cognitive core (planning, reasoning, natural language communication), supplemented by memory modules, tool-use interfaces, and role-specific prompts. The system architecture defines how agents coordinate: coordination topologies include sequential pipelines, hierarchical orchestration (orchestrator-worker), parallel fan-out/fan-in, publish-subscribe messaging, and decentralized peer-to-peer communication. Core agent properties, as defined by Wooldridge and Jennings (1995), include autonomy, social ability, reactivity, and pro-activeness. In LLM-based systems, key components are: the agent (an LLM with a system prompt defining its role), a communication channel (natural language messages, structured function calls, or shared memory), an orchestrator or coordinator (managing task decomposition, routing, and state), tool-use interfaces (external APIs, code execution, web search), and a memory subsystem (short-term context, long-term vector storage). Prominent frameworks implementing LLM-based MAS include AutoGen (Microsoft, 2023), CAMEL (2023), MetaGPT (2023), CrewAI, and LangGraph.
A reasoning model (also: large reasoning model, LRM, reasoning language model, RLM) is a type of large language model that has been specifically post-trained to solve complex multi-step problems by explicitly generating intermediate reasoning steps before committing to a final response. Unlike standard LLMs that generate a direct response in a single forward pass, reasoning models allocate additional computation at inference time — a property known as test-time compute scaling — by producing a long internal chain of thought (CoT). The reasoning trace typically includes steps such as problem decomposition, hypothesis generation, self-verification, reflection, and correction of errors.
The defining characteristics of reasoning models are: (1) post-training via large-scale reinforcement learning (RL) using reward signals based on final answer correctness (and sometimes intermediate step quality via process reward models); (2) the emergence of extended, often hidden, reasoning traces that precede the final answer; (3) a consistent empirical relationship between the length or computational budget allocated to the reasoning trace and final answer quality (test-time scaling law); (4) superior performance on verifiable tasks requiring multi-step logic, such as mathematics, competitive programming, and scientific reasoning.
The term 'reasoning model' was introduced as a product category by OpenAI in September 2024 with the release of the o1-preview model. OpenAI described o1 as trained via a large-scale RL algorithm teaching the model to use chain of thought productively. The approach does not rely on explicit tree search algorithms; instead, implicit search emerges via RL-trained CoT generation. In January 2025, DeepSeek published the first detailed open technical description of this class of models in the DeepSeek-R1 paper (arXiv:2501.12948), demonstrating that reasoning capabilities can be incentivized via pure RL without supervised fine-tuning, using Group Relative Policy Optimization (GRPO) as the RL framework.
Reasoning models typically employ the same base Transformer decoder architecture as standard LLMs, with the key difference residing entirely in the post-training pipeline: RL replaces or augments standard RLHF/SFT, and reward signals are grounded in verifiable outcomes. The resulting models generate substantially longer token sequences during inference (reasoning tokens), which are often hidden from end users but incur real compute costs. Performance consistently improves with both more training-time RL compute and more inference-time thinking budget.