Augmentation

Tool-augmented LLM

Key innovation

Extends large language models with the ability to invoke external tools — search engines, calculators, APIs, and code interpreters — by generating structured calls during text generation, enabling access to current knowledge and precise computation unavailable in model parameters.

Components

Tool specificationInforms the model about available tools, their interfaces, and expected call parameters.

A formal definition of a tool interface passed to the model — typically in JSON Schema or OpenAPI format — describing the tool's name, its parameters, and input data types. Provided in the system prompt or a dedicated API field.

JSON Schema

Natural Language Description

Model Context Protocol (MCP)

Official

Tool call generationGenerates a structured tool call during model decoding.

The module responsible for generating a structured tool call by an LLM — typically in the form of special tokens, a JSON block, or a function_call object. The model decides when and with what arguments to invoke a tool, based on context.

Function calling (OpenAI/Anthropic)

ReAct — text-prompted reasoning invocation

Toolformer — API call tokens in text

Tool Executor / HostExecutes tool calls and returns results to the model to continue generation.

The runtime environment outside the model that intercepts tool calls generated by the LLM, executes them (by calling APIs, running code, or querying a database), and returns the results to the model as new context.

Direct API Call

Sandboxed Code Execution Environment

Serwer MCP

Official

Tool Result Injection into ContextIntegrates tool outputs into the model context for subsequent generation.

The mechanism for returning tool execution results to the model's context window, enabling the model to continue generation incorporating the retrieved data. Results may be injected as a tool_result block, a new message, or special tokens.

Official

Implementation

Reference implementations

Toolformer (community reproduction)

Python · lucidrains (community reproduction)

Anthropic tool use — official documentation and examples

Python, JavaScript · Anthropic

Official

LangChain Tools

Python · LangChain AI

Implementation pitfalls

Tool call argument hallucinationHigh

The model may generate tool calls with fabricated or incorrect parameters — such as fictitious function names, wrong data types, or invalid date and identifier formats. This causes silent execution errors on the host side.

Fix:Validate all tool-call arguments against a schema before execution; use schemas with strict types and constraints; log and monitor failed calls.

Tool result prompt injectionCritical

Results returned by tools (web pages, documents, API responses) may contain malicious instructions that the model treats as system commands — a classic prompt injection attack via observed content.

Fix:Isolate tool outputs from system instructions; apply explicit delimiters and source metadata; require user confirmation before executing irreversible actions based on observed content.

Excessive or infinite tool-call loopsHigh

Without hard limits, a model can invoke tools in a loop — for example, repeatedly searching the web for information unavailable in any source — exhausting the token budget and generating unnecessary API costs.

Fix:Set a hard limit on tool calls per turn/session; implement repeated-call detection; require human-in-the-loop for critical or costly calls.

Tool output context overflowHigh

Results from external APIs or search engines can be very long — HTML pages, JSON responses with many fields — quickly filling the context window and causing earlier conversational context to be lost.

Fix:Apply extraction or summarization of tool outputs before injecting them into the context; constrain output size via API parameters (token limits, pagination); monitor the context token budget.

Unnecessary tool calls for known factsMedium

Models may invoke tools (e.g., a search engine) for information already present in their parametric knowledge, unnecessarily increasing latency and cost. This is especially relevant for models with a low confidence threshold for tool invocation.

Fix:Calibrate model confidence thresholds; explicitly instruct the model in system prompts when to invoke tools versus rely on parametric knowledge; apply reflection mechanisms before tool calls.

Evolution

Original paper · 2023 · NeurIPS 2023 · Timo Schick

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, Thomas Scialom

2021

WebGPT — GPT-3 augmented with a web browser

Inflection point

Nakano et al. (OpenAI) augment GPT-3 with the ability to search the web via a text-based browser interface. This was the first demonstration that an LLM can use an external information source through reinforcement learning from human feedback.

WebGPT: Browser-assisted question-answering with human feedback (paper)

2022

TALM — tool bootstrapping via self-annotation

Parisi et al. (Google) propose TALM (Tool Augmented Language Models), in which an LLM iteratively expands its set of tool calls by filtering out those that improve results — an early step toward self-supervised learning of tool use.

TALM: Tool Augmented Language Models (paper)

2022

ReAct — interleaved reasoning and tool use

Inflection point

Yao et al. (Princeton / Google) propose ReAct: an LLM alternately generates reasoning traces (Thought) and tool calls (Action), receiving observations (Observation) from the environment. The work establishes the interleaved reasoning + tool use pattern.

ReAct: Synergizing Reasoning and Acting in Language Models (paper)

2023

Toolformer — LLM learns to use tools autonomously

Inflection point

Schick et al. (Meta AI) introduce Toolformer — a model trained via self-annotated API call insertions in text, without large manually labeled datasets. The model learns when and how to invoke external tools (calculator, search engine, translator, QA system) and how to integrate their outputs.

Toolformer: Language Models Can Teach Themselves to Use Tools (paper)

2023

OpenAI Function Calling — commercial standardization of tool invocation

Inflection point

OpenAI introduces function calling in GPT-4 and GPT-3.5 Turbo (June 2023) — a structured API enabling the model to generate function calls in JSON Schema format. This becomes the de facto industry standard for tool augmentation.

2023

Anthropic tool use and parallel tool calls

Anthropic introduces tool use in the Claude API with support for parallel tool calling, enabling the model to generate multiple tool calls that the host executes simultaneously.

2024

Model Context Protocol (MCP) — standardization of tool connectivity

Inflection point

Anthropic published the Model Context Protocol as an open standard connecting models to external tool servers — analogous to the Language Server Protocol for developer tooling. MCP standardizes both the tool description format and the communication protocol between LLMs and tool servers.

Technical details

Hyperparameters (configurable axes)

Available Tool SetCritical

A set of tools made available to the model, defining the space of possible calls. Tools may include search engines, calculators, code interpreters, databases, external APIs, and system utilities.

web_search + code_interpreter

calculator + calendar + email

custom_domain_api + vector_db_retrieval

Parallel tool callsHigh

Whether the model can generate multiple tool calls simultaneously in a single turn (parallel tool calling), which the host executes in parallel before returning the results.

trueSupported by the OpenAI and Anthropic APIs.

falseSequential ReAct-style calls.

Tool Specification FormatHigh

The format in which tools are described to the model affects the precision of generated calls and compatibility with the host.

JSON Schema (OpenAI/Anthropic format)

Model Context Protocol (MCP)

Opis w języku naturalnym

Maximum tool calls per turnMedium

Limit the number of tool calls within a single conversational turn; guard against infinite call loops.

1One call per turn — easier to debug.

5–20Typical limit for agents with multiple tools.

Compute bottleneck

Latency of external tool calls and context size

The primary bottlenecks are latency from external tool calls (API response times, code execution) and the linear growth of context length as tool results are injected — which increases the cost of each subsequent LLM inference step.

Depends on

Opóźnienie zewnętrznych API / narzędziAkumulacja długości kontekstu

Execution paradigm

Primary mode

conditional

The base LLM remains dense — all parameters are active at every step. The conditional nature applies to external tool invocation, not to the model's internal structure.

Activation pattern

input_dependent

Routing mechanism

The model decides during decoding whether to invoke a tool (and which one) or continue generating text, based on the context content and tool specifications. This decision is endogenous: it arises from the probability distribution over output tokens.

Parallelism

Parallelism level

conditionally_parallel

Parallelism is possible when the model generates multiple tool calls in a single turn with no dependencies between them. LLM execution is sequential; parallelism applies only to tool execution by the host.

Scope

inference

Constraints

!Sekwencyjna zależność narzędzi w łańcuchu wywołań

!Równoległe wywołania niezależnych narzędzi

Hardware requirements

Primary

Tool-augmented LLM is a runtime architectural pattern — hardware requirements are determined solely by the underlying LLM and the tools themselves, not by the tool augmentation mechanism.

Sources

Toolformer: Language Models Can Teach Themselves to Use Tools