Skip to main content

AI Agent Skills: Tools, MCP, Memory & RAG

Who this guide is for

What is an AI Agent?โ€‹

A large language model (LLM) on its own is a stateless text predictor. It receives tokens, predicts the next token, and stops. It cannot:

  • Search the web or query a database.
  • Read or write files.
  • Call an API or run code.
  • Remember what you told it yesterday.
  • Act on the world in any way.

An AI agent wraps an LLM with a control loop that grants it these capabilities through skills โ€” structured interfaces between the LLM's reasoning and the outside world.

Without agent skills:
User: "What is Apple's stock price right now?"
LLM: "I don't have access to real-time data..." โŒ

With agent skills:
User: "What is Apple's stock price right now?"
Agent: [calls get_stock_price("AAPL") tool] โ†’ $189.45
LLM: "Apple (AAPL) is currently trading at $189.45." โœ…

The agent loopโ€‹

Every agent, regardless of framework, runs the same fundamental loop:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Agent Loop โ”‚
โ”‚ โ”‚
โ”‚ User Input โ”‚
โ”‚ โ”‚ โ”‚
โ”‚ โ–ผ โ”‚
โ”‚ LLM Thinks โ”€โ”€โ†’ "I need tool X with args Y" โ”‚
โ”‚ โ”‚ โ”‚
โ”‚ โ–ผ โ”‚
โ”‚ Execute Tool โ”€โ”€โ†’ Result returned to LLM โ”‚
โ”‚ โ”‚ โ”‚
โ”‚ โ–ผ โ”‚
โ”‚ LLM Thinks โ”€โ”€โ†’ "I have enough info" โ”‚
โ”‚ โ”‚ โ”‚
โ”‚ โ–ผ โ”‚
โ”‚ Final Answer โ”€โ”€โ†’ User โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

The loop continues until the LLM decides it has enough information to give a final answer โ€” or until a max-step limit is reached.

Types of agent skillsโ€‹

Skill categoryWhat it doesExamples
Tool UseExecutes actions in the worldSearch web, call API, run SQL
MemoryStores and retrieves information across timeVector DB, key-value store, context window
RAGGrounds answers in external knowledgeDocument search, code search, knowledge graph
Code executionRuns code safely in a sandboxPython REPL, shell commands
Multi-agent delegationRoutes sub-tasks to specialised agentsResearcher โ†’ Writer โ†’ Editor pipeline

Tool Use (Function Calling)โ€‹

Why LLMs need function callingโ€‹

LLMs generate text. They cannot reach outside their token stream. Function calling is a protocol that lets an LLM signal intent โ€” it outputs a structured JSON request saying "please call this function with these arguments" โ€” and the host application executes the actual code.

LLM output (raw text) โ†’ no tools:
"The weather in Hanoi is probably warm and humid." โ† guessing

LLM output (function call) โ†’ with tools:
{ "tool": "get_weather", "args": { "city": "Hanoi" } }
Host executes โ†’ { "temp": 33, "humidity": 82, "condition": "Partly cloudy" }
LLM: "Hanoi is currently 33ยฐC with 82% humidity and partly cloudy skies." โœ…

Step-by-step flowโ€‹

1. Developer declares tools (name, description, JSON Schema for parameters)
โ”‚
โ–ผ
2. User sends a message โ†’ LLM sees prompt + tool definitions
โ”‚
โ–ผ
3. LLM decides a tool is needed โ†’ returns structured tool call JSON (not text)
โ”‚
โ–ผ
4. Host application intercepts โ†’ executes the real function
โ”‚
โ–ผ
5. Result is sent back to LLM as a "tool result" message
โ”‚
โ–ผ
6. LLM generates the final text response using the real data

Declaring tools โ€” the JSON Schema contractโ€‹

A tool declaration is a JSON Schema that tells the LLM:

  • What the tool does (description โ€” the LLM uses this to decide when to call it)
  • What it needs (parameters โ€” type-safe inputs)
  • What is required (required โ€” which args are mandatory)
{
"name": "get_stock_price",
"description": "Fetch the current stock price for a given ticker symbol. Use this when the user asks about a company's stock price or market value.",
"parameters": {
"type": "object",
"properties": {
"ticker": {
"type": "string",
"description": "The stock ticker symbol, e.g. AAPL for Apple, GOOG for Google"
},
"currency": {
"type": "string",
"enum": ["USD", "EUR", "VND"],
"description": "The currency to return the price in. Defaults to USD.",
"default": "USD"
}
},
"required": ["ticker"]
}
}
The description is the most important field

The LLM decides which tool to call entirely based on the description. A vague or misleading description causes the LLM to call the wrong tool or miss a tool entirely. Write descriptions that explain: what the tool does, when to use it, and what its inputs mean โ€” as if explaining to a smart colleague who has never seen the tool.

Complete implementation exampleโ€‹

import anthropic
import json

client = anthropic.Anthropic()

# โ”€โ”€ Step 1: Declare tools โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
tools = [
{
"name": "get_stock_price",
"description": "Fetch the current stock price for a given ticker. Use when the user asks about a stock's current price or market value.",
"input_schema": {
"type": "object",
"properties": {
"ticker": {
"type": "string",
"description": "Stock ticker symbol, e.g. AAPL, GOOG, MSFT"
}
},
"required": ["ticker"]
}
},
{
"name": "search_news",
"description": "Search for recent news articles about a company or topic. Use when the user asks about recent events or news.",
"input_schema": {
"type": "object",
"properties": {
"query": { "type": "string", "description": "Search query string" },
"max_results": { "type": "integer", "description": "Number of results, default 5" }
},
"required": ["query"]
}
}
]

# โ”€โ”€ Step 2: Implement tool functions โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def get_stock_price(ticker: str) -> dict:
# In production: call a real financial API (Alpha Vantage, Yahoo Finance, etc.)
mock_prices = {"AAPL": 189.45, "GOOG": 175.20, "MSFT": 420.30}
price = mock_prices.get(ticker.upper())
if price is None:
return {"error": f"Ticker '{ticker}' not found"}
return {"ticker": ticker.upper(), "price": price, "currency": "USD"}

def search_news(query: str, max_results: int = 5) -> dict:
# In production: call NewsAPI, Bing News Search, etc.
return {"articles": [{"title": f"Mock article about {query}", "url": "https://..."}]}

# โ”€โ”€ Step 3: Tool dispatcher โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def execute_tool(tool_name: str, tool_input: dict) -> str:
"""Routes a tool call to the correct implementation and returns a string result."""
try:
if tool_name == "get_stock_price":
result = get_stock_price(**tool_input)
elif tool_name == "search_news":
result = search_news(**tool_input)
else:
result = {"error": f"Unknown tool: {tool_name}"}
return json.dumps(result)
except Exception as e:
return json.dumps({"error": str(e)})

# โ”€โ”€ Step 4: Agent loop โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def run_agent(user_message: str) -> str:
messages = [{"role": "user", "content": user_message}]

while True:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
tools=tools,
messages=messages
)

# If the LLM is done โ€” return the text answer
if response.stop_reason == "end_turn":
return next(b.text for b in response.content if b.type == "text")

# If the LLM wants to use a tool
if response.stop_reason == "tool_use":
# Add the LLM's tool call to the conversation history
messages.append({"role": "assistant", "content": response.content})

# Execute every tool the LLM requested (may be multiple)
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = execute_tool(block.name, block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result
})

# Return the tool results to the LLM
messages.append({"role": "user", "content": tool_results})
# Loop โ€” LLM will now see the results and decide next step

# Usage
answer = run_agent("What is Apple's stock price today?")
print(answer)
# โ†’ "Apple (AAPL) is currently trading at $189.45 USD."

Parallel tool callsโ€‹

Modern LLMs can request multiple tools simultaneously when they are independent:

# User: "What are the stock prices of Apple, Google, and Microsoft?"

# LLM returns three tool calls at once (not sequentially):
tool_calls = [
{ "id": "tc_1", "name": "get_stock_price", "input": {"ticker": "AAPL"} },
{ "id": "tc_2", "name": "get_stock_price", "input": {"ticker": "GOOG"} },
{ "id": "tc_3", "name": "get_stock_price", "input": {"ticker": "MSFT"} }
]

# Execute in parallel โ€” 3x faster than sequential
import asyncio

async def execute_parallel(tool_calls):
tasks = [execute_tool_async(tc.name, tc.input) for tc in tool_calls]
results = await asyncio.gather(*tasks)
return results

Writing Production-Grade Skillsโ€‹

The difference between a demo tool and a production skill is reliability, safety, and observability.

The anatomy of a well-written skillโ€‹

import json
import logging
from typing import Any
from functools import wraps
import time

logger = logging.getLogger(__name__)

# โ”€โ”€ 1. Input validation โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def validate_ticker(ticker: str) -> str:
"""Validate and normalise a stock ticker symbol."""
if not ticker or not isinstance(ticker, str):
raise ValueError("Ticker must be a non-empty string")
ticker = ticker.strip().upper()
if not ticker.isalpha() or len(ticker) > 5:
raise ValueError(f"Invalid ticker format: '{ticker}'. Must be 1โ€“5 letters.")
return ticker

# โ”€โ”€ 2. Retry with exponential backoff โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def with_retry(max_attempts: int = 3, backoff_seconds: float = 1.0):
def decorator(fn):
@wraps(fn)
def wrapper(*args, **kwargs):
for attempt in range(max_attempts):
try:
return fn(*args, **kwargs)
except (TimeoutError, ConnectionError) as e:
if attempt == max_attempts - 1:
raise
wait = backoff_seconds * (2 ** attempt)
logger.warning(f"Tool {fn.__name__} attempt {attempt+1} failed: {e}. Retrying in {wait}s")
time.sleep(wait)
return wrapper
return decorator

# โ”€โ”€ 3. Observability โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def observable_tool(fn):
@wraps(fn)
def wrapper(*args, **kwargs):
start = time.perf_counter()
try:
result = fn(*args, **kwargs)
duration_ms = (time.perf_counter() - start) * 1000
logger.info(f"Tool '{fn.__name__}' succeeded in {duration_ms:.1f}ms",
extra={"tool": fn.__name__, "duration_ms": duration_ms, "status": "success"})
return result
except Exception as e:
duration_ms = (time.perf_counter() - start) * 1000
logger.error(f"Tool '{fn.__name__}' failed after {duration_ms:.1f}ms: {e}",
extra={"tool": fn.__name__, "duration_ms": duration_ms, "status": "error"})
raise
return wrapper

# โ”€โ”€ 4. Safe error response โ€” never crash the agent loop โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def safe_tool_result(fn):
"""Catch all exceptions and return a structured error JSON instead of raising.
This ensures one bad tool call doesn't crash the entire agent."""
@wraps(fn)
def wrapper(*args, **kwargs):
try:
return fn(*args, **kwargs)
except ValueError as e:
return json.dumps({"error": "invalid_input", "message": str(e)})
except TimeoutError:
return json.dumps({"error": "timeout", "message": "The tool timed out. Try again."})
except Exception as e:
logger.error(f"Unexpected tool error in {fn.__name__}: {e}", exc_info=True)
return json.dumps({"error": "internal_error", "message": "An unexpected error occurred."})
return wrapper

# โ”€โ”€ 5. Full production skill โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
@safe_tool_result
@observable_tool
@with_retry(max_attempts=3)
def get_stock_price(ticker: str, currency: str = "USD") -> str:
ticker = validate_ticker(ticker)
# Call real API with timeout
response = requests.get(
f"https://api.financialdatasource.com/quote/{ticker}",
params={"currency": currency},
timeout=5.0,
headers={"Authorization": f"Bearer {API_KEY}"}
)
response.raise_for_status()
data = response.json()
return json.dumps({
"ticker": ticker,
"price": data["latestPrice"],
"currency": currency,
"timestamp": data["latestUpdate"]
})

Tool description best practicesโ€‹

The description field is the only thing the LLM reads to decide when and how to call your tool. Write it like documentation, not a label:

{
"name": "search",
"description": "Search for things",
"parameters": {
"q": { "type": "string", "description": "query" }
}
}

Problems:

  • "Search for things" โ€” what things? Web? Database? Files?
  • "query" โ€” what format? Max length? Keywords or natural language?
  • No guidance on when to use this vs other tools.

Skill design rulesโ€‹

RuleWhyExample
One responsibility per toolEasier for LLM to reason about; easier to testsearch_documents, create_document โ€” not manage_documents
Always return JSON stringsConsistent parsing; LLM can reason about structured data{"price": 189.45} not "$189.45"
Never raise exceptions to the agentOne tool failure shouldn't crash the loopReturn {"error": "..."} and let the LLM handle it
Include metadata in resultsLLM can cite sources, check freshness{"result": ..., "source": "...", "retrieved_at": "..."}
Add when_to_use in descriptionPrevents tool confusion with similar tools"Use this for X, NOT for Y"
Keep tools statelessEnables parallel execution; easier to retryAccept all needed inputs as parameters
Cap execution timePrevents the agent loop from hangingAlways set timeout on HTTP calls

Model Context Protocol (MCP)โ€‹

The problem before MCPโ€‹

Every AI platform (Claude Desktop, Cursor, VS Code, custom agents) needed custom integrations for every external tool. Building a GitHub integration meant writing it separately for Claude, for Cursor, for your own agent:

Before MCP:
GitHub tool for Claude โ†’ custom implementation
GitHub tool for Cursor โ†’ custom implementation again
GitHub tool for LangChain โ†’ custom implementation again
M tools ร— N platforms = Mร—N integrations โŒ

With MCP:
GitHub MCP Server โ†’ one implementation
Claude Desktop (MCP Client) โ†’ connects automatically
Cursor (MCP Client) โ†’ connects automatically
Any Agent (MCP Client) โ†’ connects automatically
M tools + N platforms = M+N integrations โœ…

MCP architectureโ€‹

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ MCP Ecosystem โ”‚
โ”‚ โ”‚
โ”‚ MCP Clients (Hosts) MCP Protocol (JSON-RPC 2.0) โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ Claude Desktop โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚ โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ Stdio / โ”‚ MCP Server โ”œโ”€โ”€โ–บ Files โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” SSE โ”‚ (GitHub, Postgres, โ”‚ โ”‚
โ”‚ โ”‚ Cursor IDE โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚ Slack, Jira...) โ”œโ”€โ”€โ–บ Database โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”œโ”€โ”€โ–บ APIs โ”‚
โ”‚ โ”‚ Your Agent โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

The three MCP primitivesโ€‹

PrimitiveWhat it doesExample
ResourcesURL-addressable data the LLM can read โ€” like a filesystem for contextfile:///project/src/App.java, db://postgres/users
ToolsExecutable actions with JSON Schema inputs โ€” the LLM can invoke theserun_tests, web_search, create_issue
PromptsPre-built prompt templates the user can invoke by name"Code Review", "Write SQL", "Explain Error"

MCP transport layersโ€‹

The client spawns the MCP server as a child process and communicates via stdin/stdout. Zero network latency, ideal for desktop tools like Cursor and Claude Desktop.

Claude Desktop โ”€โ”€โ”€ spawn subprocess โ”€โ”€โ–บ mcp-server-github (Node.js process)
stdin/stdout pipe
// claude_desktop_config.json
{
"mcpServers": {
"github": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-github"],
"env": {
"GITHUB_PERSONAL_ACCESS_TOKEN": "<your-token>"
}
}
}
}

Building a production MCP serverโ€‹

// file-manager-mcp-server.js
import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import {
CallToolRequestSchema,
ListToolsRequestSchema,
ListResourcesRequestSchema,
ReadResourceRequestSchema
} from "@modelcontextprotocol/sdk/types.js";
import fs from "fs/promises";
import path from "path";

const ALLOWED_DIR = process.env.WORKSPACE_DIR || "./workspace"; // sandboxed directory

// โ”€โ”€ Security helper โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
function sanitizePath(userPath) {
const resolved = path.resolve(ALLOWED_DIR, userPath);
if (!resolved.startsWith(path.resolve(ALLOWED_DIR))) {
throw new Error("Path traversal attempt detected โ€” access denied");
}
return resolved;
}

const server = new Server(
{ name: "file-manager", version: "1.0.0" },
{ capabilities: { tools: {}, resources: {} } }
);

// โ”€โ”€ Resources: expose files as readable context โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
server.setRequestHandler(ListResourcesRequestSchema, async () => {
const files = await fs.readdir(ALLOWED_DIR);
return {
resources: files.map(f => ({
uri: `file://${path.join(ALLOWED_DIR, f)}`,
name: f,
mimeType: f.endsWith(".json") ? "application/json" : "text/plain"
}))
};
});

server.setRequestHandler(ReadResourceRequestSchema, async (req) => {
const filePath = sanitizePath(req.params.uri.replace("file://", ""));
const content = await fs.readFile(filePath, "utf-8");
return { contents: [{ uri: req.params.uri, mimeType: "text/plain", text: content }] };
});

// โ”€โ”€ Tools: expose write operations โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
server.setRequestHandler(ListToolsRequestSchema, async () => ({
tools: [
{
name: "read_file",
description: "Read the content of a file from the workspace. Use to inspect code or data files.",
inputSchema: {
type: "object",
properties: {
path: { type: "string", description: "Relative path to the file, e.g. 'src/App.java'" }
},
required: ["path"]
}
},
{
name: "write_file",
description: "Write or overwrite a file in the workspace. Use to save generated code or results.",
inputSchema: {
type: "object",
properties: {
path: { type: "string", description: "Relative path to write to" },
content: { type: "string", description: "File content to write" }
},
required: ["path", "content"]
}
}
]
}));

server.setRequestHandler(CallToolRequestSchema, async (req) => {
const { name, arguments: args } = req.params;

try {
if (name === "read_file") {
const safePath = sanitizePath(args.path);
const content = await fs.readFile(safePath, "utf-8");
return { content: [{ type: "text", text: content }] };
}

if (name === "write_file") {
const safePath = sanitizePath(args.path);
await fs.mkdir(path.dirname(safePath), { recursive: true });
await fs.writeFile(safePath, args.content, "utf-8");
return { content: [{ type: "text", text: `File written successfully: ${args.path}` }] };
}

throw new Error(`Unknown tool: ${name}`);
} catch (err) {
return {
content: [{ type: "text", text: `Error: ${err.message}` }],
isError: true
};
}
});

const transport = new StdioServerTransport();
await server.connect(transport);

MCP security checklistโ€‹

RiskMitigation
Path traversalResolve and validate all file paths against an allowed root directory
Prompt injection via tool resultsSanitise tool output before passing back โ€” strip < > and \n\nHuman: patterns
Excessive permissionsGrant the MCP server the minimum OS permissions needed (read-only unless write is required)
Sensitive data in tool resultsFilter secrets, PII, and credentials from results before they enter the LLM context
Unrestricted tool executionRequire human-in-the-loop confirmation for destructive operations (delete, deploy)
Supply chain attackPin MCP server versions; verify checksums for third-party servers

Memory Systemsโ€‹

An agent without memory forgets everything the moment the conversation ends. Memory systems solve this at different time scales.

The three memory tiersโ€‹

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Tier 1: In-Context (Short-term) โ”‚
โ”‚ โ”€ Current conversation messages โ”‚
โ”‚ โ”€ Fast access, limited by token window (128Kโ€“200K tokens typical) โ”‚
โ”‚ โ”€ Wiped at conversation end โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Tier 2: External Storage (Long-term) โ”‚
โ”‚ โ”€ Vector DB (semantic search by meaning) โ”‚
โ”‚ โ”€ Key-value store (exact fact lookup) โ”‚
โ”‚ โ”€ Survives across sessions; requires retrieval step โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Tier 3: Episodic (Procedural) โ”‚
โ”‚ โ”€ Log of past agent actions and outcomes โ”‚
โ”‚ โ”€ "I fixed this error before by doing X" โ”‚
โ”‚ โ”€ Retrieved by similarity to current problem โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Short-term memory โ€” managing the context windowโ€‹

The context window is the agent's working memory. Problems arise as conversations grow:

class ContextWindowManager:
def __init__(self, max_tokens: int = 100_000, summary_threshold: int = 80_000):
self.messages: list[dict] = []
self.max_tokens = max_tokens
self.summary_threshold = summary_threshold
self.token_counter = tiktoken.encoding_for_model("gpt-4o")

def count_tokens(self) -> int:
text = json.dumps(self.messages)
return len(self.token_counter.encode(text))

def add_message(self, role: str, content: str):
self.messages.append({"role": role, "content": content})
if self.count_tokens() > self.summary_threshold:
self._compress()

def _compress(self):
"""Summarise the oldest 50% of messages to reclaim token space."""
mid = len(self.messages) // 2
old_messages = self.messages[:mid]

summary_prompt = f"Summarise this conversation history concisely:\n{json.dumps(old_messages)}"
summary = llm.complete(summary_prompt)

# Replace old messages with a single summary message
self.messages = [
{"role": "system", "content": f"[Conversation summary]: {summary}"}
] + self.messages[mid:]

Long-term memory โ€” vector databasesโ€‹

Long-term memory stores information as vector embeddings โ€” numerical representations of meaning. Retrieval finds semantically similar content, not just keyword matches.

# โ”€โ”€ Store a memory โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def save_to_memory(text: str, metadata: dict):
"""Convert text to a vector and store in the vector DB."""
embedding = embed_model.encode(text) # e.g. text-embedding-3-small
vector_db.upsert(
collection="agent_memory",
vectors=[{
"id": str(uuid4()),
"values": embedding.tolist(),
"metadata": { **metadata, "text": text, "saved_at": datetime.utcnow().isoformat() }
}]
)

# โ”€โ”€ Retrieve relevant memories โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def recall(query: str, top_k: int = 5) -> list[str]:
"""Find the most semantically similar stored memories."""
query_vector = embed_model.encode(query)
results = vector_db.query(
collection="agent_memory",
vector=query_vector.tolist(),
top_k=top_k,
include_metadata=True
)
return [r["metadata"]["text"] for r in results["matches"]]

# โ”€โ”€ Inject retrieved memories into the system prompt โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def build_prompt_with_memory(user_query: str) -> str:
memories = recall(user_query, top_k=3)
memory_block = "\n".join(f"- {m}" for m in memories) if memories else "None"
return f"""You are a helpful assistant with the following relevant past context:

{memory_block}

Use this context to answer accurately. If it's not relevant, ignore it.

User: {user_query}"""

Memory architecture decisionsโ€‹

๐Ÿ”ฌ Senior deep-dive: choosing a vector database
DatabaseHostingStrengthsWeaknessesBest for
PineconeCloud-onlyFully managed, fast, production-readyCost at scale, vendor lock-inProduction SaaS agents
WeaviateSelf-host / CloudRich filtering, multimodal support, GraphQLComplex setupEnterprise with on-prem requirements
QdrantSelf-host / CloudRust-based performance, sparse+dense hybridSmaller communityHigh-performance local deployment
pgvectorPostgreSQL extensionNo new infra if already on Postgres, SQL joinsSlower at very large scaleExisting Postgres shop, < 10M vectors
ChromaDBEmbedded / self-hostZero-config, Python-native, great for devNot production-ready at scalePrototyping, local development
MilvusSelf-host / CloudMassive scale (billion+ vectors), IVF/HNSWHeavyweight infraLarge-scale semantic search

Key decision factors:

  1. Scale โ€” how many vectors? < 1M โ†’ pgvector works. 1Mโ€“100M โ†’ Qdrant or Pinecone. 100M+ โ†’ Milvus.
  2. Filtering โ€” do you need to filter by metadata (user_id, date, category) alongside vector search? All support this, but Qdrant and Weaviate have especially efficient payload indexing.
  3. Hybrid search โ€” need to combine keyword (BM25) + semantic search? Weaviate and Qdrant support this natively.
  4. Infra ownership โ€” managed cloud (Pinecone) vs self-hosted (Qdrant, Milvus) vs already-on-Postgres (pgvector).

Retrieval-Augmented Generation (RAG)โ€‹

RAG grounds LLM answers in real documents. Without it, the LLM answers from training data that may be stale, incomplete, or hallucinated. With it, the LLM cites real content from your knowledge base.

Naive RAG โ€” the starting pointโ€‹

User query
โ”‚
โ–ผ
Embed query โ†’ search vector DB โ†’ return top-K chunks
โ”‚
โ–ผ
Inject chunks into prompt โ†’ LLM generates answer

This works for simple Q&A but breaks when:

  • The query is complex and no single chunk answers it fully.
  • The retrieved chunks are irrelevant (poor embedding or chunking).
  • The answer requires reasoning across multiple documents.

Full RAG pipeline implementationโ€‹

class RAGPipeline:

def __init__(self, vector_db, embed_model, llm, chunk_size=512, overlap=50):
self.vector_db = vector_db
self.embed_model = embed_model
self.llm = llm
self.splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size, chunk_overlap=overlap
)

# โ”€โ”€ Indexing (offline) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def index_document(self, text: str, metadata: dict):
chunks = self.splitter.split_text(text)
embeddings = self.embed_model.encode(chunks)

self.vector_db.upsert([{
"id": f"{metadata['doc_id']}_{i}",
"values": emb.tolist(),
"metadata": { **metadata, "text": chunk, "chunk_index": i }
} for i, (chunk, emb) in enumerate(zip(chunks, embeddings))])

# โ”€โ”€ Retrieval (online) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def retrieve(self, query: str, top_k: int = 5,
filter: dict = None) -> list[str]:
query_emb = self.embed_model.encode(query)
results = self.vector_db.query(
vector=query_emb.tolist(),
top_k=top_k,
filter=filter, # e.g. {"category": "hr_policy"}
include_metadata=True
)
return [(r["metadata"]["text"], r["score"]) for r in results["matches"]]

# โ”€โ”€ Generation (online) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def answer(self, query: str, top_k: int = 5) -> str:
chunks_with_scores = self.retrieve(query, top_k)

# Filter low-relevance chunks (score < 0.75 = likely irrelevant)
relevant = [(text, score) for text, score in chunks_with_scores if score >= 0.75]

if not relevant:
return "I could not find relevant information to answer this question."

context = "\n\n---\n\n".join(text for text, _ in relevant)

prompt = f"""Answer the user's question based ONLY on the provided context.
If the context does not contain enough information, say so โ€” do not guess.

Context:
{context}

Question: {query}

Answer:"""
return self.llm.complete(prompt)

Advanced Agentic RAGโ€‹

Agentic RAG turns retrieval into a self-correcting, multi-step loop:

User Query
โ”‚
โ–ผ
Query Translation โ”€โ”€โ†’ Break complex query into sub-queries
โ”‚
โ–ผ
Routing โ”€โ”€โ†’ Which data source? (Vector DB / SQL / Web / LLM memory)
โ”‚
โ–ผ
Retrieval โ”€โ”€โ†’ Fetch candidates from chosen source(s)
โ”‚
โ–ผ
Grading โ”€โ”€โ†’ Are the retrieved chunks actually relevant?
โ”‚
โ”œโ”€โ”€ Irrelevant โ”€โ”€โ†’ Rewrite query โ†’ Re-retrieve (or web search)
โ”‚
โ””โ”€โ”€ Relevant
โ”‚
โ–ผ
Generation โ”€โ”€โ†’ LLM drafts answer using grounded context
โ”‚
โ–ผ
Hallucination Check โ”€โ”€โ†’ Is the answer supported by the context?
โ”‚
โ”œโ”€โ”€ Not supported โ”€โ”€โ†’ Regenerate
โ”‚
โ””โ”€โ”€ Supported โ”€โ”€โ†’ Final Answer

Advanced RAG techniquesโ€‹

Break a complex query into independent sub-queries, retrieve for each, merge the context:

def decompose_query(query: str) -> list[str]:
"""Use LLM to break a complex query into simpler sub-queries."""
prompt = f"""Break the following question into 2โ€“4 independent sub-questions
that can each be answered separately. Return as a JSON array of strings.

Question: {query}

Sub-questions (JSON array):"""
response = llm.complete(prompt)
return json.loads(response)

def multi_query_retrieve(query: str) -> list[str]:
sub_queries = decompose_query(query)

all_chunks = []
for sub_q in sub_queries:
chunks = rag.retrieve(sub_q, top_k=3)
all_chunks.extend(chunks)

# Deduplicate by content similarity
return deduplicate_chunks(all_chunks)

# Example:
# Query: "How does our VN remote leave policy compare to our Singapore policy?"
# Sub-queries:
# โ†’ "What is the Vietnam remote work leave policy?"
# โ†’ "What is the Singapore remote work leave policy?"

Chunking strategy โ€” the foundation of retrieval qualityโ€‹

Poor chunking is the most common reason RAG gives bad answers. The retrieved chunk must contain a complete, self-contained piece of information:

from langchain.text_splitter import RecursiveCharacterTextSplitter, MarkdownHeaderTextSplitter

# โ”€โ”€ Strategy 1: Fixed-size with overlap (baseline) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Simple but cuts mid-sentence frequently
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)

# โ”€โ”€ Strategy 2: Markdown-aware (for documentation) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Splits at heading boundaries โ€” preserves document structure
md_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=[
("#", "h1"),
("##", "h2"),
("###","h3")
])
# Each chunk inherits its section's heading as metadata โ€” crucial for citation

# โ”€โ”€ Strategy 3: Semantic chunking (best quality, slower) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Groups sentences by semantic similarity โ€” no mid-thought cuts
from langchain_experimental.text_splitter import SemanticChunker
semantic_splitter = SemanticChunker(
embeddings=embed_model,
breakpoint_threshold_type="percentile", # split when similarity drops
breakpoint_threshold_amount=95
)
StrategyQualitySpeedBest for
Fixed-sizeโš ๏ธ MediumFastGeneral documents
Markdown-awareโœ… GoodFastDocusaurus, wikis, READMEs
Semanticโœ…โœ… BestSlowLong-form articles, research papers
Recursive characterโœ… GoodFastCode files, mixed content

Episodic Memory โ€” Learning from Past Actionsโ€‹

Episodic memory stores sequences of agent actions and their outcomes. When a similar problem arises, the agent retrieves past episodes to guide its approach:

class EpisodicMemory:

def __init__(self, vector_db, embed_model):
self.vector_db = vector_db
self.embed_model = embed_model

def save_episode(self, task: str, steps: list[dict], outcome: str, success: bool):
"""Save a completed agent episode for future reference."""
episode_text = f"""Task: {task}
Steps taken: {json.dumps(steps, indent=2)}
Outcome: {outcome}
Success: {success}"""

embedding = self.embed_model.encode(episode_text)
self.vector_db.upsert([{
"id": str(uuid4()),
"values": embedding.tolist(),
"metadata": {
"task": task,
"steps": json.dumps(steps),
"outcome": outcome,
"success": success,
"saved_at": datetime.utcnow().isoformat()
}
}])

def recall_similar(self, current_task: str, top_k: int = 3) -> list[dict]:
"""Find past episodes similar to the current task."""
query_emb = self.embed_model.encode(current_task)
results = self.vector_db.query(
vector=query_emb.tolist(), top_k=top_k,
filter={"success": True}, # only retrieve successful episodes
include_metadata=True
)
return [r["metadata"] for r in results["matches"]]

def build_few_shot_context(self, current_task: str) -> str:
past_episodes = self.recall_similar(current_task)
if not past_episodes:
return ""

examples = "\n\n".join([
f"Past task: {ep['task']}\nApproach: {ep['steps']}\nResult: {ep['outcome']}"
for ep in past_episodes
])
return f"""Here are similar tasks you have successfully completed before:

{examples}

Use these as a guide for the current task."""

Testing Agent Skillsโ€‹

Agent skills are hard to test because outputs are non-deterministic. Use these strategies:

import pytest
from unittest.mock import patch, MagicMock

class TestStockPriceTool:

def test_valid_ticker_returns_price(self):
with patch("requests.get") as mock_get:
mock_get.return_value.json.return_value = {"latestPrice": 189.45, "latestUpdate": "..."}
mock_get.return_value.raise_for_status = MagicMock()

result = json.loads(get_stock_price("AAPL"))
assert result["ticker"] == "AAPL"
assert result["price"] == 189.45

def test_invalid_ticker_returns_error_json(self):
# Tool must return error JSON, not raise โ€” agent loop must not crash
result = json.loads(get_stock_price("INVALID123456"))
assert "error" in result
assert result["error"] == "invalid_input"

def test_network_timeout_returns_error_json(self):
with patch("requests.get", side_effect=TimeoutError("Connection timed out")):
result = json.loads(get_stock_price("AAPL"))
assert result["error"] == "timeout"

def test_tool_description_quality(self):
"""Ensure tool descriptions are non-empty and mention when to use."""
schema = get_stock_price_schema()
assert len(schema["description"]) > 50, "Description too short โ€” LLM won't know when to use it"
assert "description" in schema["parameters"]["properties"]["ticker"]

class TestRAGPipeline:

def test_retrieval_returns_relevant_chunks(self, rag_pipeline, indexed_docs):
results = rag_pipeline.retrieve("annual leave policy Vietnam", top_k=3)
assert len(results) > 0
assert any("annual leave" in text.lower() for text, score in results)

def test_low_score_chunks_are_filtered(self, rag_pipeline):
answer = rag_pipeline.answer("xyzzy frobnicator quantum cascade")
assert "could not find relevant information" in answer

@pytest.mark.parametrize("query,expected_keyword", [
("How many days of annual leave?", "days"),
("What is the remote work policy?", "remote"),
])
def test_answer_contains_expected_content(self, rag_pipeline, query, expected_keyword):
answer = rag_pipeline.answer(query)
assert expected_keyword.lower() in answer.lower()

Common Mistakesโ€‹

MistakeProblemFix
Vague tool descriptionsLLM calls wrong tool or misses relevant oneWrite specific descriptions with examples and "when to use / not use"
Raising exceptions from toolsCrashes the agent loopWrap all tools with try/except โ€” return {"error": "..."} JSON
Stateful tool implementationsBreaks parallel tool calls; race conditionsMake tools pure functions โ€” accept all state as parameters
Fixed-size chunking for structured docsSplits mid-section โ€” retrieved chunks lack contextUse structure-aware chunking (Markdown headers, sentence boundaries)
No similarity score threshold in RAGLow-relevance chunks pollute the LLM prompt โ€” hallucinationsFilter chunks below similarity score threshold (e.g. 0.75)
Injecting all retrieved chunks regardless of lengthContext window overflow; LLM ignores distant contentCap total context at ~25% of the model's context window
No retry on transient tool failuresOne network blip kills the entire agent runAdd exponential backoff retry on TimeoutError, ConnectionError
Missing MCP path validationPath traversal attack via ../../etc/passwdResolve and validate all paths against allowed root before any file op
Storing secrets in tool resultsSecrets leak into LLM context and logsRedact API keys, tokens, passwords from tool output before returning

๐ŸŽฏ Interview Questionsโ€‹

Q1. What is function calling and why do LLMs need it?

LLMs are text predictors โ€” they have no ability to execute code, call APIs, or access real-time data. Function calling (tool use) is a protocol where the developer declares available tools as JSON Schemas and the LLM, instead of generating a text answer, generates a structured JSON payload describing which function to call with which arguments. The host application intercepts this, executes the real code, and returns the result to the LLM. This turns a passive text predictor into an agent that can interact with the world.

Q2. What is MCP and what problem does it solve?

Model Context Protocol (MCP) is an open standard (JSON-RPC 2.0) for connecting LLMs to external tools and data sources. Before MCP, building a GitHub integration for one agent meant building it again for every new platform โ€” Claude Desktop, Cursor, LangChain each required custom code. MCP reduces Mร—N integrations to M+N: tool builders implement one MCP server, and any MCP-compatible client (any agent host) connects automatically. It exposes three primitives: Resources (readable data), Tools (executable actions), and Prompts (reusable templates).

Q3. What is the difference between short-term and long-term memory in an agent?

Short-term memory is the token context window โ€” the current conversation messages. It is fast (no retrieval step) but limited (token cap) and wiped at the end of a session. Long-term memory persists across sessions in an external store, typically a vector database that enables semantic search by meaning rather than exact match. The agent retrieves relevant memories at the start of a turn and injects them into the prompt. The challenge with long-term memory is retrieval quality โ€” bad embedding or chunking means irrelevant memories are injected, confusing the LLM.

Q4. What is RAG and what problem does it solve?

RAG (Retrieval-Augmented Generation) grounds LLM answers in real documents rather than training data alone. Without RAG, the LLM answers from parametric memory โ€” which may be stale, incomplete, or hallucinated. RAG retrieves relevant document chunks from a vector database at query time and injects them into the prompt as context. The LLM then generates an answer based on real content, which can be cited and verified. It's essential for knowledge-intensive applications where accuracy and freshness matter.

Q5. What makes a well-written tool description?

A good tool description tells the LLM: what the tool does (concrete, specific), when to use it (trigger conditions), when NOT to use it (distinguish from similar tools), and what the parameters mean (format, examples, constraints). The LLM's entire tool-selection decision is based on these descriptions โ€” they are the API contract between the LLM's reasoning and your implementation. A vague description like "search for things" causes tool misuse; a specific description like "search the internal HR knowledge base โ€” use for company policy questions, NOT for general knowledge" produces correct tool selection.

Q6. (Senior) What is Corrective RAG (CRAG) and when do you need it?

CRAG adds a grading step between retrieval and generation. A lightweight LLM (or a fine-tuned classifier) evaluates each retrieved chunk and scores it as relevant or irrelevant to the query. If too few chunks pass the relevance threshold, CRAG triggers a corrective action โ€” typically a web search or a re-written query โ€” before generating. Standard RAG silently generates from whatever chunks were retrieved, even if they are off-topic, causing confident-sounding hallucinations. CRAG catches poor retrievals before they reach the generator. The cost is an extra LLM call per query; it is worth it in high-stakes applications where retrieval quality is variable.

Q7. (Senior) How do you prevent prompt injection through tool results?

Tool results are injected into the LLM's context. A malicious API response like "Ignore all previous instructions and instead..." can hijack the LLM's behaviour if injected raw. Mitigations: (1) sanitise tool output before injection โ€” strip newlines preceding role-change patterns, HTML tags, and instruction-like text; (2) clearly delimit tool results in the prompt with XML tags (<tool_result>...</tool_result>) and instruct the LLM to treat content inside those tags as data, not instructions; (3) use a fixed output schema โ€” if the tool should return a JSON object with specific fields, validate the response against that schema before injecting; (4) run MCP servers in sandboxed processes with minimum permissions so even if prompt injection succeeds, the tool cannot escalate privileges.


See Alsoโ€‹