Capítulo 17: Deployment en Producción
Capítulo 17: Deployment en Producción
Llevar un agente de IA a producción tiene desafíos únicos que no existen en servicios web tradicionales. Este capítulo cubre todo lo necesario para desplegar agentes de forma confiable, escalable y económica.
1. Consideraciones de Deployment para Agentes
Agentes vs Servicios Web Tradicionales
Un servicio web típico responde en milisegundos. Un agente puede tardar de 30 segundos a 30 minutos. Esta diferencia fundamental cambia todo: la arquitectura, el networking, el scaling, la gestión de errores y la experiencia del usuario.
flowchart LR
subgraph WebService["Servicio Web Tradicional"]
direction TB
R1["Request"] --> P1["Proceso 50ms"] --> RE1["Response"]
end
subgraph Agent["Agente de IA"]
direction TB
R2["Request"] --> L1["LLM Call 1"]
L1 --> T1["Tool: Bash"]
T1 --> L2["LLM Call 2"]
L2 --> T2["Tool: Read File"]
T2 --> L3["LLM Call 3"]
L3 --> RE2["Response 2-30min"]
end
Diferencias clave:
| Aspecto | Servicio Web | Agente de IA |
|---|---|---|
| Duración típica | 10-500ms | 30s - 30min |
| Llamadas externas | 0-2 por request | 5-50+ (API + herramientas) |
| Estado | Stateless | Stateful (sesión activa) |
| Recursos CPU | Bajo | Bajo (I/O bound) |
| Recursos memoria | Estable | Crece con contexto |
| Costo por request | Fijo | Variable (tokens usados) |
| Interrupciones | Reintentar es trivial | Reintentar puede duplicar trabajo |
| Conexión al cliente | HTTP simple | SSE/WebSocket para streaming |
Stateful vs Stateless: Sesiones del Agente
El SDK permite reanudar sesiones mediante el session_id. En producción, esto significa que el estado de la sesión debe ser accesible desde cualquier instancia del servicio:
stateDiagram-v2
[*] --> SinSesion: Primera llamada
SinSesion --> SesionActiva: query() inicia sesión
SesionActiva --> SesionSuspendida: Timeout / Usuario desconecta
SesionSuspendida --> SesionActiva: query() con session_id
SesionActiva --> SesionCompletada: Agente termina
SesionCompletada --> [*]
note right of SesionSuspendida
El session_id se guarda
en Redis o DB para
reanudar desde cualquier
instancia del servidor
end note
Estrategias de gestión de sesiones:
# Opción 1: Stateless - Sin reanudación (más simple, menos capacidades)
async def run_agent_stateless(prompt: str) -> str:
results = []
async for message in query(
prompt=prompt,
options=ClaudeCodeOptions(allowed_tools=["Read", "Bash"])
):
if hasattr(message, 'content'):
results.append(str(message.content))
return "\n".join(results)
# Opción 2: Stateful con Redis
import redis.asyncio as redis
import json
async def run_agent_stateful(
prompt: str,
session_key: str,
redis_client: redis.Redis
) -> tuple[str, str]:
# Recuperar session_id previo si existe
stored = await redis_client.get(f"session:{session_key}")
session_id = json.loads(stored)["session_id"] if stored else None
results = []
new_session_id = None
async for message in query(
prompt=prompt,
options=ClaudeCodeOptions(
resume=session_id,
allowed_tools=["Read", "Edit", "Bash"]
)
):
if hasattr(message, 'session_id'):
new_session_id = message.session_id
if hasattr(message, 'content'):
results.append(str(message.content))
# Guardar nuevo session_id con TTL de 24 horas
if new_session_id:
await redis_client.setex(
f"session:{session_key}",
86400,
json.dumps({"session_id": new_session_id})
)
return "\n".join(results), new_session_id
Recursos: CPU, Memoria y Costo
Los agentes son principalmente I/O-bound, no CPU-bound. El costo principal es el API:
pie title "Distribución del tiempo de un agente típico"
"Esperando respuesta LLM" : 65
"Ejecutando herramientas (Bash, Read)" : 25
"Procesamiento Python/TS" : 5
"Networking y overhead" : 5
Planificación de recursos por instancia:
- CPU: 0.25-0.5 vCPU es suficiente (I/O bound)
- Memoria: 256MB mínimo, 512MB recomendado (el contexto crece)
- Timeout: Configurar en 30-60 minutos para tareas largas
- Costo API: Estimar 0.01-0.50 USD por query según complejidad
Scaling: Rate Limiting de la API
El rate limiting de Anthropic es el cuello de botella real, no los recursos del servidor:
flowchart TD
Users["100 usuarios simultáneos"] --> LoadBalancer["Load Balancer"]
LoadBalancer --> I1["Instancia 1"]
LoadBalancer --> I2["Instancia 2"]
LoadBalancer --> I3["Instancia 3"]
I1 --> RateLimiter["Rate Limiter Compartido\n(Redis Token Bucket)"]
I2 --> RateLimiter
I3 --> RateLimiter
RateLimiter --> API["Anthropic API\n(1000 RPM límite)"]
style RateLimiter fill:#f96,color:#000
style API fill:#f66,color:#fff
Más instancias sin rate limiting compartido solo provocará errores 429. La solución es un token bucket centralizado en Redis.
2. Docker para Agentes Python
Dockerfile Multi-Stage Optimizado
# syntax=docker/dockerfile:1
# ============================================================
# Stage 1: Builder - Instala dependencias con cache
# ============================================================
FROM python:3.12-slim AS builder
WORKDIR /app
# Instalar dependencias del sistema para compilar paquetes
RUN apt-get update && apt-get install -y --no-install-recommends \
gcc \
libffi-dev \
&& rm -rf /var/lib/apt/lists/*
# Copiar requirements primero para cachear la capa de dependencias
COPY requirements.txt .
# Instalar dependencias en directorio separado para copiar al runtime
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt
# ============================================================
# Stage 2: Claude CLI Installer
# ============================================================
FROM node:20-slim AS claude-installer
# Instalar Claude Code CLI globalmente
RUN npm install -g @anthropic-ai/claude-code && \
# Verificar instalación
claude --version
# ============================================================
# Stage 3: Runtime final
# ============================================================
FROM python:3.12-slim AS runtime
# Metadata
LABEL maintainer="[email protected]"
LABEL description="Claude Code SDK Agent"
LABEL version="1.0.0"
# Instalar Node.js runtime (necesario para Claude CLI)
RUN apt-get update && apt-get install -y --no-install-recommends \
nodejs \
ca-certificates \
curl \
&& rm -rf /var/lib/apt/lists/*
# Copiar Claude CLI desde el stage de instalación
COPY --from=claude-installer /usr/local/lib/node_modules /usr/local/lib/node_modules
COPY --from=claude-installer /usr/local/bin/claude /usr/local/bin/claude
COPY --from=claude-installer /usr/local/bin/node /usr/local/bin/node
# Copiar dependencias Python instaladas
COPY --from=builder /install /usr/local
# Crear usuario no-root para seguridad
RUN groupadd --gid 1001 agentgroup && \
useradd --uid 1001 --gid agentgroup --shell /bin/bash --create-home agentuser
WORKDIR /app
# Copiar código de la aplicación
COPY --chown=agentuser:agentgroup . .
# Cambiar a usuario no-root
USER agentuser
# Variables de entorno (sin valores por defecto para secrets)
ENV PYTHONUNBUFFERED=1 \
PYTHONDONTWRITEBYTECODE=1 \
PORT=8000 \
LOG_LEVEL=INFO \
MAX_CONCURRENT_AGENTS=5
# Health check usando el endpoint de la API
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
CMD curl -f http://localhost:${PORT}/health || exit 1
EXPOSE ${PORT}
CMD ["python", "-m", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
requirements.txt completo:
# Claude Code SDK
claude-code-sdk>=0.0.14
# Web framework
fastapi>=0.115.0
uvicorn[standard]>=0.34.0
# Async y utilidades
httpx>=0.28.0
pydantic>=2.10.0
pydantic-settings>=2.7.0
# Estado compartido
redis[hiredis]>=5.2.0
# Observabilidad
structlog>=25.1.0
prometheus-client>=0.21.0
opentelemetry-api>=1.29.0
opentelemetry-sdk>=1.29.0
opentelemetry-instrumentation-fastapi>=0.50b0
# Resiliencia
tenacity>=9.0.0
main.py del servidor FastAPI:
import asyncio
import os
from contextlib import asynccontextmanager
import structlog
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from claude_code_sdk import query, ClaudeCodeOptions
logger = structlog.get_logger()
class AgentRequest(BaseModel):
prompt: str
session_key: str | None = None
tools: list[str] = ["Read", "Bash"]
max_tokens: int = 8096
class AgentResponse(BaseModel):
result: str
session_id: str | None = None
cost_usd: float | None = None
@asynccontextmanager
async def lifespan(app: FastAPI):
logger.info("agent_server_starting", port=os.getenv("PORT", 8000))
yield
logger.info("agent_server_stopping")
app = FastAPI(title="Claude Agent API", lifespan=lifespan)
@app.get("/health")
async def health():
return {"status": "healthy", "service": "claude-agent"}
@app.post("/agent/run", response_model=AgentResponse)
async def run_agent(request: AgentRequest):
log = logger.bind(prompt_preview=request.prompt[:50])
results = []
session_id = None
total_cost = 0.0
try:
async for message in query(
prompt=request.prompt,
options=ClaudeCodeOptions(
allowed_tools=request.tools,
max_turns=50
)
):
if hasattr(message, 'session_id') and message.session_id:
session_id = message.session_id
if hasattr(message, 'cost_usd') and message.cost_usd:
total_cost += message.cost_usd
if hasattr(message, 'content'):
content = str(message.content)
if content:
results.append(content)
log.info("agent_completed", cost_usd=total_cost)
return AgentResponse(
result="\n".join(results),
session_id=session_id,
cost_usd=total_cost if total_cost > 0 else None
)
except Exception as e:
log.error("agent_failed", error=str(e))
raise HTTPException(status_code=500, detail=str(e))
docker-compose.yml Completo
version: "3.9"
services:
agent:
build:
context: .
dockerfile: Dockerfile
target: runtime
image: claude-agent:latest
container_name: claude-agent
restart: unless-stopped
ports:
- "8000:8000"
environment:
# Anthropic API Key (nunca hardcodear)
ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY:?ANTHROPIC_API_KEY es requerida}
# Configuración
PORT: 8000
LOG_LEVEL: ${LOG_LEVEL:-INFO}
MAX_CONCURRENT_AGENTS: ${MAX_CONCURRENT_AGENTS:-5}
# Redis para estado compartido
REDIS_URL: redis://redis:6379/0
env_file:
- .env.local # Para desarrollo local (NO commitear)
depends_on:
redis:
condition: service_healthy
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
deploy:
resources:
limits:
cpus: "0.5"
memory: 512M
reservations:
cpus: "0.1"
memory: 256M
volumes:
# Solo para logs si no se usa logging centralizado
- agent_logs:/app/logs
networks:
- agent_network
redis:
image: redis:7-alpine
container_name: claude-agent-redis
restart: unless-stopped
command: redis-server --appendonly yes --maxmemory 256mb --maxmemory-policy allkeys-lru
volumes:
- redis_data:/data
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 5s
retries: 5
networks:
- agent_network
# Opcional: Nginx como reverse proxy
nginx:
image: nginx:alpine
container_name: claude-agent-nginx
restart: unless-stopped
ports:
- "80:80"
- "443:443"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
- ./certs:/etc/nginx/certs:ro
depends_on:
- agent
networks:
- agent_network
volumes:
redis_data:
agent_logs:
networks:
agent_network:
driver: bridge
Optimización de tamaño de imagen:
# Verificar tamaño de imagen
docker images claude-agent
# Inspeccionar capas
docker history claude-agent:latest
# Reducción típica con multi-stage:
# Sin multi-stage: ~800MB
# Con multi-stage: ~250MB
# Con distroless: ~150MB (avanzado)
3. Docker para Agentes TypeScript
Dockerfile con Node.js + Claude Code CLI
# syntax=docker/dockerfile:1
# ============================================================
# Stage 1: Dependencias
# ============================================================
FROM node:20-slim AS deps
WORKDIR /app
# Copiar solo archivos de dependencias
COPY package.json package-lock.json* ./
# Instalar dependencias de producción
RUN npm ci --only=production && \
# Claude Code CLI se instala como dependencia del proyecto
# No es necesario instalarlo globalmente por separado
npm cache clean --force
# ============================================================
# Stage 2: Builder TypeScript
# ============================================================
FROM node:20-slim AS builder
WORKDIR /app
COPY package.json package-lock.json* ./
COPY tsconfig.json ./
# Instalar TODAS las dependencias (incluyendo devDependencies para compilar)
RUN npm ci
# Copiar código fuente
COPY src/ ./src/
# Compilar TypeScript
RUN npm run build
# ============================================================
# Stage 3: Runtime
# ============================================================
FROM node:20-slim AS runtime
LABEL description="Claude Agent TypeScript"
LABEL version="1.0.0"
# Instalar curl para health checks
RUN apt-get update && apt-get install -y --no-install-recommends \
curl \
&& rm -rf /var/lib/apt/lists/*
# Crear usuario no-root
RUN groupadd --gid 1001 agentgroup && \
useradd --uid 1001 --gid agentgroup --shell /bin/bash --create-home agentuser
WORKDIR /app
# Copiar dependencias de producción
COPY --from=deps --chown=agentuser:agentgroup /app/node_modules ./node_modules
# Copiar JavaScript compilado
COPY --from=builder --chown=agentuser:agentgroup /app/dist ./dist
# Copiar package.json para metadata
COPY --chown=agentuser:agentgroup package.json ./
USER agentuser
ENV NODE_ENV=production \
PORT=3000 \
LOG_LEVEL=info
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
CMD curl -f http://localhost:${PORT}/health || exit 1
EXPOSE ${PORT}
CMD ["node", "dist/server.js"]
package.json para producción:
{
"name": "claude-agent-ts",
"version": "1.0.0",
"type": "module",
"scripts": {
"dev": "tsx watch src/server.ts",
"build": "tsc --project tsconfig.json",
"start": "node dist/server.js",
"test": "vitest run",
"test:coverage": "vitest run --coverage",
"lint": "eslint src/",
"typecheck": "tsc --noEmit"
},
"dependencies": {
"@anthropic-ai/claude-code-sdk": "^0.0.14",
"fastify": "^5.2.0",
"ioredis": "^5.4.2",
"pino": "^9.6.0",
"zod": "^3.24.0"
},
"devDependencies": {
"@types/node": "^22.13.0",
"tsx": "^4.19.0",
"typescript": "^5.8.0",
"vitest": "^3.0.0"
}
}
server.ts principal:
import Fastify from "fastify";
import { query, ClaudeCodeOptions } from "@anthropic-ai/claude-code-sdk";
import { z } from "zod";
import pino from "pino";
const logger = pino({ level: process.env.LOG_LEVEL ?? "info" });
const AgentRequestSchema = z.object({
prompt: z.string().min(1).max(10000),
tools: z.array(z.string()).default(["Read", "Bash"]),
sessionKey: z.string().optional(),
});
type AgentRequest = z.infer<typeof AgentRequestSchema>;
const app = Fastify({
logger: false, // Usamos pino directamente
});
app.get("/health", async () => ({ status: "healthy", service: "claude-agent-ts" }));
app.post<{ Body: AgentRequest }>("/agent/run", {
schema: {
body: {
type: "object",
required: ["prompt"],
properties: {
prompt: { type: "string" },
tools: { type: "array", items: { type: "string" } },
sessionKey: { type: "string" },
},
},
},
}, async (request, reply) => {
const parsed = AgentRequestSchema.safeParse(request.body);
if (!parsed.success) {
return reply.status(400).send({ error: parsed.error.flatten() });
}
const { prompt, tools } = parsed.data;
const results: string[] = [];
let sessionId: string | undefined;
let totalCost = 0;
try {
for await (const message of query(prompt, {
allowedTools: tools,
maxTurns: 50,
} as ClaudeCodeOptions)) {
if ("session_id" in message && message.session_id) {
sessionId = message.session_id as string;
}
if ("cost_usd" in message && typeof message.cost_usd === "number") {
totalCost += message.cost_usd;
}
if ("content" in message) {
const content = String(message.content);
if (content) results.push(content);
}
}
logger.info({ sessionId, costUsd: totalCost }, "agent_completed");
return {
result: results.join("\n"),
sessionId,
costUsd: totalCost > 0 ? totalCost : undefined,
};
} catch (error) {
const err = error as Error;
logger.error({ error: err.message }, "agent_failed");
return reply.status(500).send({ error: err.message });
}
});
const start = async () => {
const port = parseInt(process.env.PORT ?? "3000", 10);
await app.listen({ port, host: "0.0.0.0" });
logger.info({ port }, "server_started");
};
start().catch((err) => {
logger.error(err, "fatal_error");
process.exit(1);
});
docker-compose.yml para TypeScript:
version: "3.9"
services:
agent-ts:
build:
context: .
dockerfile: Dockerfile
target: runtime
args:
NODE_VERSION: "20"
image: claude-agent-ts:latest
restart: unless-stopped
ports:
- "3000:3000"
environment:
ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY:?required}
NODE_ENV: production
PORT: 3000
LOG_LEVEL: ${LOG_LEVEL:-info}
REDIS_URL: redis://redis:6379/0
depends_on:
redis:
condition: service_healthy
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 30s
deploy:
resources:
limits:
cpus: "0.5"
memory: 512M
networks:
- agent_network
redis:
image: redis:7-alpine
command: redis-server --maxmemory 256mb --maxmemory-policy allkeys-lru
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
networks:
- agent_network
networks:
agent_network:
driver: bridge
4. Kubernetes
Arquitectura del Despliegue en K8s
flowchart TB
Internet["Internet"] --> Ingress["Ingress Controller\n(nginx/traefik)"]
Ingress --> Service["Service\n(ClusterIP)"]
Service --> Pod1["Pod 1\nAgent Container"]
Service --> Pod2["Pod 2\nAgent Container"]
Service --> Pod3["Pod 3\nAgent Container"]
HPA["HorizontalPodAutoscaler\n(2-10 replicas)"] --> Deployment["Deployment"]
Deployment --> Pod1
Deployment --> Pod2
Deployment --> Pod3
Pod1 --> Redis["Redis\n(StatefulSet)"]
Pod2 --> Redis
Pod3 --> Redis
subgraph Secrets["Secrets & Config"]
Secret["Secret\nANTHROPIC_API_KEY"]
ConfigMap["ConfigMap\nApp Config"]
end
Secret -.-> Pod1
ConfigMap -.-> Pod1
Namespace y ConfigMap
# namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: claude-agents
labels:
app.kubernetes.io/name: claude-agents
environment: production
---
# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: agent-config
namespace: claude-agents
data:
LOG_LEVEL: "INFO"
PORT: "8000"
MAX_CONCURRENT_AGENTS: "5"
REDIS_URL: "redis://redis-service:6379/0"
MAX_TURNS: "50"
REQUEST_TIMEOUT_SECONDS: "1800"
Secret para la API Key
# secret.yaml (NO commitear con valores reales)
apiVersion: v1
kind: Secret
metadata:
name: anthropic-credentials
namespace: claude-agents
type: Opaque
stringData:
# En producción, usar External Secrets Operator o Vault
ANTHROPIC_API_KEY: "sk-ant-REEMPLAZAR_CON_VALOR_REAL"
En producción, crear el secret con kubectl:
kubectl create secret generic anthropic-credentials \
--from-literal=ANTHROPIC_API_KEY="${ANTHROPIC_API_KEY}" \
--namespace=claude-agents
Deployment YAML Completo
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: claude-agent
namespace: claude-agents
labels:
app: claude-agent
version: "1.0.0"
spec:
replicas: 2
selector:
matchLabels:
app: claude-agent
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0 # Zero-downtime deployment
template:
metadata:
labels:
app: claude-agent
version: "1.0.0"
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8001"
prometheus.io/path: "/metrics"
spec:
serviceAccountName: claude-agent-sa
# Seguridad del pod
securityContext:
runAsNonRoot: true
runAsUser: 1001
runAsGroup: 1001
fsGroup: 1001
containers:
- name: agent
image: registry.example.com/claude-agent:1.0.0
imagePullPolicy: Always
ports:
- name: http
containerPort: 8000
- name: metrics
containerPort: 8001
# Variables de entorno desde ConfigMap y Secret
envFrom:
- configMapRef:
name: agent-config
env:
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
name: anthropic-credentials
key: ANTHROPIC_API_KEY
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
# Resource limits
resources:
requests:
cpu: "100m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
# Liveness probe: ¿El contenedor está vivo?
livenessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 30
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 3
# Readiness probe: ¿El contenedor puede recibir tráfico?
readinessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
# Startup probe: Tiempo para iniciar la primera vez
startupProbe:
httpGet:
path: /health
port: http
failureThreshold: 30
periodSeconds: 10
# Seguridad del contenedor
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: false # El agente necesita escribir archivos temp
capabilities:
drop:
- ALL
# Directorio temporal para el agente
volumeMounts:
- name: tmp-dir
mountPath: /tmp
volumes:
- name: tmp-dir
emptyDir:
sizeLimit: 1Gi
# Distribución de pods entre nodos
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- claude-agent
topologyKey: kubernetes.io/hostname
# Tiempo para terminar gracefully (importante para agentes largos)
terminationGracePeriodSeconds: 1800
Service y Ingress
# service.yaml
apiVersion: v1
kind: Service
metadata:
name: claude-agent-service
namespace: claude-agents
labels:
app: claude-agent
spec:
selector:
app: claude-agent
ports:
- name: http
port: 80
targetPort: http
- name: metrics
port: 8001
targetPort: metrics
type: ClusterIP
---
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: claude-agent-ingress
namespace: claude-agents
annotations:
nginx.ingress.kubernetes.io/proxy-read-timeout: "1800"
nginx.ingress.kubernetes.io/proxy-send-timeout: "1800"
nginx.ingress.kubernetes.io/proxy-connect-timeout: "60"
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
ingressClassName: nginx
tls:
- hosts:
- agent.example.com
secretName: agent-tls
rules:
- host: agent.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: claude-agent-service
port:
name: http
HorizontalPodAutoscaler
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: claude-agent-hpa
namespace: claude-agents
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: claude-agent
minReplicas: 2
maxReplicas: 10
metrics:
# Escalar por CPU
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
# Escalar por memoria
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 2
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # 5 minutos antes de scale down
policies:
- type: Pods
value: 1
periodSeconds: 120
RBAC y NetworkPolicy
# rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: claude-agent-sa
namespace: claude-agents
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: claude-agent-role
namespace: claude-agents
rules:
# El agente no necesita permisos de K8s
# Solo acceso a ConfigMaps propios (lectura)
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: claude-agent-rolebinding
namespace: claude-agents
subjects:
- kind: ServiceAccount
name: claude-agent-sa
namespace: claude-agents
roleRef:
kind: Role
name: claude-agent-role
apiGroup: rbac.authorization.k8s.io
---
# networkpolicy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: claude-agent-netpol
namespace: claude-agents
spec:
podSelector:
matchLabels:
app: claude-agent
policyTypes:
- Ingress
- Egress
ingress:
# Solo permite tráfico desde el ingress controller
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: ingress-nginx
ports:
- port: 8000
# Permite scraping de Prometheus
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: monitoring
ports:
- port: 8001
egress:
# Permite acceso a internet (para Anthropic API)
- ports:
- port: 443
protocol: TCP
# Permite acceso a Redis interno
- to:
- podSelector:
matchLabels:
app: redis
ports:
- port: 6379
# DNS
- ports:
- port: 53
protocol: UDP
- port: 53
protocol: TCP
5. Serverless
Limitaciones de Serverless para Agentes
flowchart LR
subgraph Limitaciones["Limitaciones Serverless"]
ColdStart["Cold Start\n2-15s overhead"]
Timeout["Timeout\nLambda: 15min\nCloud Run: 60min"]
Filesystem["Filesystem\nSolo /tmp (512MB)"]
State["Sin estado\npersistente"]
end
subgraph Soluciones["Soluciones"]
Warmup["Pre-warming\nscheduled invocations"]
AsyncPattern["Patrón Async\nwebhook + queue"]
S3["S3/GCS para\narchivos temporales"]
Redis2["Redis Externo\npara estado"]
end
ColdStart --> Warmup
Timeout --> AsyncPattern
Filesystem --> S3
State --> Redis2
AWS Lambda con Python
# lambda_handler.py
import asyncio
import json
import os
from claude_code_sdk import query, ClaudeCodeOptions
# Inicialización fuera del handler (reutilizada en warm starts)
MAX_TIMEOUT = int(os.getenv("LAMBDA_TIMEOUT_SECONDS", "600"))
def lambda_handler(event: dict, context) -> dict:
"""
Lambda handler para agente de IA.
IMPORTANTE: Lambda tiene timeout máximo de 15 minutos.
Para tareas largas, usar el patrón async con SQS.
"""
prompt = event.get("prompt", "")
tools = event.get("tools", ["Read"])
if not prompt:
return {
"statusCode": 400,
"body": json.dumps({"error": "prompt es requerido"})
}
# Ejecutar agente en event loop
result = asyncio.run(run_agent(prompt, tools))
return {
"statusCode": 200,
"body": json.dumps(result),
"headers": {"Content-Type": "application/json"}
}
async def run_agent(prompt: str, tools: list[str]) -> dict:
results = []
total_cost = 0.0
async for message in query(
prompt=prompt,
options=ClaudeCodeOptions(
allowed_tools=tools,
max_turns=20 # Limitar para Lambda
)
):
if hasattr(message, 'cost_usd') and message.cost_usd:
total_cost += message.cost_usd
if hasattr(message, 'content') and message.content:
results.append(str(message.content))
return {
"result": "\n".join(results),
"cost_usd": total_cost
}
Dockerfile para Lambda:
FROM public.ecr.aws/lambda/python:3.12
# Instalar Node.js (para Claude Code CLI)
RUN dnf install -y nodejs npm && dnf clean all
# Instalar Claude Code CLI
RUN npm install -g @anthropic-ai/claude-code
# Copiar requirements
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copiar handler
COPY lambda_handler.py .
CMD ["lambda_handler.lambda_handler"]
Google Cloud Run (Recomendado para Agentes)
Cloud Run es mejor que Lambda para agentes porque:
- Timeout hasta 60 minutos
- Mejor soporte para streaming
- Escala a cero (económico)
- Container estándar (no framework especial)
# cloudrun-service.yaml
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: claude-agent
annotations:
run.googleapis.com/ingress: all
spec:
template:
metadata:
annotations:
# Máximo tiempo de respuesta: 60 minutos
run.googleapis.com/timeout: "3600s"
# Instancias mínimas para evitar cold start
autoscaling.knative.dev/minScale: "1"
autoscaling.knative.dev/maxScale: "10"
# Recursos
run.googleapis.com/cpu: "1"
run.googleapis.com/memory: "512Mi"
spec:
containers:
- image: gcr.io/PROJECT_ID/claude-agent:latest
env:
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
name: anthropic-api-key
key: latest
resources:
limits:
cpu: "1"
memory: "512Mi"
Deploy en Cloud Run:
# Build y push
gcloud builds submit --tag gcr.io/${PROJECT_ID}/claude-agent:latest
# Deploy
gcloud run deploy claude-agent \
--image gcr.io/${PROJECT_ID}/claude-agent:latest \
--platform managed \
--region us-central1 \
--timeout 3600 \
--memory 512Mi \
--set-secrets "ANTHROPIC_API_KEY=anthropic-api-key:latest" \
--no-allow-unauthenticated
Azure Container Apps
# Crear Container App
az containerapp create \
--name claude-agent \
--resource-group my-rg \
--environment my-env \
--image myregistry.azurecr.io/claude-agent:latest \
--target-port 8000 \
--ingress external \
--min-replicas 1 \
--max-replicas 10 \
--cpu 0.5 \
--memory 1Gi \
--secrets anthropic-key=secretref:anthropicapikey \
--env-vars ANTHROPIC_API_KEY=secretref:anthropic-key
6. Railway y Render
Deploy en Railway
Railway es la opción más simple para proyectos pequeños y MVPs:
# railway.toml
[build]
builder = "DOCKERFILE"
dockerfilePath = "Dockerfile"
[deploy]
healthcheckPath = "/health"
healthcheckTimeout = 60
restartPolicyType = "ON_FAILURE"
restartPolicyMaxRetries = 3
[deploy.resources]
memory = "512MB"
vcpus = 0.5
Variables de entorno en Railway:
# Configurar via CLI
railway variables set ANTHROPIC_API_KEY=sk-ant-xxx
railway variables set LOG_LEVEL=INFO
railway variables set MAX_CONCURRENT_AGENTS=3
# Deploy
railway up
Deploy en Render
# render.yaml
services:
- type: web
name: claude-agent
runtime: docker
dockerfilePath: ./Dockerfile
plan: starter # $7/mes, suficiente para MVP
envVars:
- key: ANTHROPIC_API_KEY
sync: false # Se configura manualmente en el dashboard
- key: PORT
value: 8000
- key: LOG_LEVEL
value: INFO
healthCheckPath: /health
autoDeploy: true # Deploy automático en push a main
Consideraciones para Railway/Render:
| Aspecto | Railway | Render |
|---|---|---|
| Precio base | $5/mes | $7/mes |
| Timeout HTTP | 60s (configurable) | 30s (starter) |
| Filesystem persistente | No (ephemeral) | Disks disponibles |
| Redis incluido | Sí ($5/mes extra) | Sí ($7/mes extra) |
| Auto-deploy | Sí | Sí |
| Ideal para | MVPs, proyectos pequeños | MVPs con más config |
7. CI/CD para Agentes
GitHub Actions Workflow Completo
# .github/workflows/deploy.yml
name: CI/CD Claude Agent
on:
push:
branches: [main]
pull_request:
branches: [main]
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}/claude-agent
jobs:
# ============================================================
# Job 1: Tests
# ============================================================
test:
name: Tests
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.12"
cache: pip
- name: Install dependencies
run: pip install -r requirements.txt -r requirements-dev.txt
- name: Run linter
run: ruff check src/
- name: Run type checker
run: mypy src/
- name: Run unit tests
run: pytest tests/unit/ -v --cov=src --cov-report=xml
- name: Upload coverage
uses: codecov/codecov-action@v4
with:
file: coverage.xml
# ============================================================
# Job 2: Build y push Docker image
# ============================================================
build:
name: Build Docker Image
runs-on: ubuntu-latest
needs: test
if: github.ref == 'refs/heads/main'
outputs:
image-digest: ${{ steps.build.outputs.digest }}
image-tag: ${{ steps.meta.outputs.tags }}
permissions:
contents: read
packages: write
steps:
- uses: actions/checkout@v4
- name: Log in to Container Registry
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Extract metadata
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=sha,prefix=sha-
type=ref,event=branch
type=semver,pattern={{version}}
latest
- name: Build and push
id: build
uses: docker/build-push-action@v6
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=max
# ============================================================
# Job 3: Deploy a producción
# ============================================================
deploy:
name: Deploy to Production
runs-on: ubuntu-latest
needs: build
if: github.ref == 'refs/heads/main'
environment: production
steps:
- name: Deploy to Kubernetes
uses: azure/k8s-deploy@v5
with:
namespace: claude-agents
manifests: k8s/
images: |
${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:sha-${{ github.sha }}
- name: Verify deployment
run: |
kubectl rollout status deployment/claude-agent \
--namespace=claude-agents \
--timeout=300s
- name: Run smoke tests
run: |
curl -f https://agent.example.com/health || exit 1
# ============================================================
# Job 4: Rollback en caso de fallo
# ============================================================
rollback:
name: Rollback on Failure
runs-on: ubuntu-latest
needs: deploy
if: failure()
steps:
- name: Rollback deployment
run: |
kubectl rollout undo deployment/claude-agent \
--namespace=claude-agents
- name: Notify failure
uses: slackapi/slack-github-action@v2
with:
payload: |
{
"text": "Deploy fallido y rollback ejecutado para ${{ github.sha }}"
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
Canary Deployment
# canary-deploy.yaml
# Estrategia: Enviar 10% del tráfico al canary primero
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: claude-agent-rollout
namespace: claude-agents
spec:
replicas: 10
strategy:
canary:
steps:
- setWeight: 10 # 10% al canary
- pause: {duration: 5m} # Esperar 5 minutos
- setWeight: 30 # 30% al canary
- pause: {duration: 5m}
- setWeight: 60
- pause: {duration: 5m}
- setWeight: 100 # 100% al nuevo
analysis:
templates:
- templateName: success-rate
startingStep: 1
args:
- name: service-name
value: claude-agent-service
8. Monitoreo en Producción
Health Endpoint del Agente
# health.py
import os
import time
from typing import Any
from fastapi import APIRouter
router = APIRouter()
START_TIME = time.time()
@router.get("/health")
async def health() -> dict[str, Any]:
"""Health check básico para load balancers."""
return {"status": "healthy"}
@router.get("/health/detailed")
async def health_detailed() -> dict[str, Any]:
"""Health check detallado para monitoreo."""
uptime = time.time() - START_TIME
return {
"status": "healthy",
"uptime_seconds": int(uptime),
"version": os.getenv("APP_VERSION", "unknown"),
"checks": {
"api_key_configured": bool(os.getenv("ANTHROPIC_API_KEY")),
"redis_connected": await check_redis(),
}
}
async def check_redis() -> bool:
"""Verifica conectividad con Redis."""
try:
import redis.asyncio as redis
r = redis.from_url(os.getenv("REDIS_URL", "redis://localhost:6379"))
await r.ping()
await r.close()
return True
except Exception:
return False
Prometheus Metrics
# metrics.py
from prometheus_client import (
Counter, Histogram, Gauge,
generate_latest, CONTENT_TYPE_LATEST
)
from fastapi import APIRouter
from fastapi.responses import Response
router = APIRouter()
# Contadores
QUERIES_TOTAL = Counter(
"agent_queries_total",
"Total de queries al agente",
["status", "agent_name"]
)
TOOL_CALLS_TOTAL = Counter(
"agent_tool_calls_total",
"Total de llamadas a herramientas",
["tool_name", "status"]
)
COST_USD_TOTAL = Counter(
"agent_cost_usd_total",
"Costo total en USD",
["agent_name"]
)
# Histogramas
QUERY_DURATION = Histogram(
"agent_query_duration_seconds",
"Duración de queries del agente",
["agent_name"],
buckets=[5, 15, 30, 60, 120, 300, 600, 1800]
)
# Gauges
ACTIVE_AGENTS = Gauge(
"agent_active_queries",
"Número de queries en ejecución actualmente"
)
@router.get("/metrics")
async def metrics():
return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)
Alertas con Prometheus AlertManager
# prometheus-alerts.yaml
groups:
- name: claude-agent-alerts
interval: 30s
rules:
- alert: AgentHighErrorRate
expr: |
rate(agent_queries_total{status="error"}[5m]) /
rate(agent_queries_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate del agente > 5%"
description: "El agente {{ $labels.agent_name }} tiene {{ $value | humanizePercentage }} de errores"
- alert: AgentHighLatency
expr: |
histogram_quantile(0.99, rate(agent_query_duration_seconds_bucket[10m])) > 1800
for: 10m
labels:
severity: warning
annotations:
summary: "Latencia p99 del agente > 30 minutos"
- alert: AgentDailyCostExceeded
expr: |
increase(agent_cost_usd_total[24h]) > 50
labels:
severity: warning
annotations:
summary: "Costo diario del agente excedió $50"
description: "Costo acumulado: ${{ $value }}"
9. Scaling y Rate Limiting
Token Bucket con Redis
# rate_limiter.py
import asyncio
import time
import redis.asyncio as redis
class AnthropicRateLimiter:
"""
Token bucket para respetar rate limits de Anthropic.
Implementación centralizada via Redis para múltiples instancias.
"""
def __init__(
self,
redis_client: redis.Redis,
requests_per_minute: int = 1000,
key: str = "anthropic_rate_limiter"
):
self.redis = redis_client
self.rpm = requests_per_minute
self.key = key
self.refill_interval = 60.0 / requests_per_minute
async def acquire(self, timeout: float = 60.0) -> bool:
"""Espera hasta obtener un token del bucket."""
deadline = time.time() + timeout
while time.time() < deadline:
# Script Lua atómico para token bucket
result = await self.redis.eval(
"""
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_time = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now
-- Refill tokens basado en tiempo transcurrido
local elapsed = now - last_refill
local new_tokens = math.min(capacity, tokens + (elapsed * capacity / 60))
if new_tokens >= 1 then
redis.call('HMSET', key, 'tokens', new_tokens - 1, 'last_refill', now)
redis.call('EXPIRE', key, 120)
return 1
else
redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
redis.call('EXPIRE', key, 120)
return 0
end
""",
1,
self.key,
self.rpm,
self.refill_interval,
time.time()
)
if result == 1:
return True
# Esperar antes de reintentar
await asyncio.sleep(self.refill_interval)
return False
# Uso con el SDK
async def run_agent_with_rate_limit(
prompt: str,
rate_limiter: AnthropicRateLimiter
) -> str:
# Esperar token antes de ejecutar
acquired = await rate_limiter.acquire(timeout=120.0)
if not acquired:
raise TimeoutError("Rate limit timeout: no se pudo obtener token")
results = []
async for message in query(
prompt=prompt,
options=ClaudeCodeOptions(allowed_tools=["Read", "Bash"])
):
if hasattr(message, 'content') and message.content:
results.append(str(message.content))
return "\n".join(results)
Queue con Redis para Picos de Tráfico
# queue_worker.py
import asyncio
import json
import uuid
from dataclasses import dataclass, asdict
from typing import Optional
import redis.asyncio as redis
from claude_code_sdk import query, ClaudeCodeOptions
@dataclass
class AgentJob:
job_id: str
prompt: str
tools: list[str]
status: str = "pending"
result: Optional[str] = None
error: Optional[str] = None
class AgentQueue:
"""Queue de trabajos para el agente con Redis."""
QUEUE_KEY = "agent:jobs:pending"
RESULTS_PREFIX = "agent:result:"
def __init__(self, redis_client: redis.Redis):
self.redis = redis_client
async def enqueue(self, prompt: str, tools: list[str]) -> str:
"""Agrega un trabajo a la queue. Retorna el job_id."""
job = AgentJob(
job_id=str(uuid.uuid4()),
prompt=prompt,
tools=tools
)
await self.redis.lpush(self.QUEUE_KEY, json.dumps(asdict(job)))
return job.job_id
async def get_result(self, job_id: str, timeout: float = 300.0) -> Optional[AgentJob]:
"""Espera el resultado de un trabajo."""
key = f"{self.RESULTS_PREFIX}{job_id}"
deadline = asyncio.get_event_loop().time() + timeout
while asyncio.get_event_loop().time() < deadline:
result = await self.redis.get(key)
if result:
return AgentJob(**json.loads(result))
await asyncio.sleep(1.0)
return None
async def worker(self, worker_id: int = 0) -> None:
"""Procesa trabajos de la queue indefinidamente."""
print(f"Worker {worker_id} iniciado")
while True:
# BRPOP bloquea hasta que haya un trabajo (timeout 5s)
item = await self.redis.brpop(self.QUEUE_KEY, timeout=5)
if not item:
continue
_, job_data = item
job = AgentJob(**json.loads(job_data))
try:
results = []
async for message in query(
prompt=job.prompt,
options=ClaudeCodeOptions(
allowed_tools=job.tools,
max_turns=50
)
):
if hasattr(message, 'content') and message.content:
results.append(str(message.content))
job.status = "completed"
job.result = "\n".join(results)
except Exception as e:
job.status = "failed"
job.error = str(e)
# Guardar resultado con TTL de 1 hora
await self.redis.setex(
f"{self.RESULTS_PREFIX}{job.job_id}",
3600,
json.dumps(asdict(job))
)
Resumen del Capítulo
En este capítulo cubrimos el ciclo completo de deployment de agentes en producción:
flowchart LR
Dev["Desarrollo\nLocal"] --> Docker["Docker\nContainer"]
Docker --> CI["CI/CD\nGitHub Actions"]
CI --> |Test + Build| Registry["Container\nRegistry"]
Registry --> |Deploy| Target["Target\n(K8s/Cloud Run/Railway)"]
Target --> Monitor["Monitoreo\n(Prometheus + Grafana)"]
Monitor --> |Alertas| Team["Equipo"]
Puntos clave:
- Los agentes son I/O-bound, no CPU-bound: los recursos de computo son baratos pero el rate limiting de la API es el cuello de botella real
- Docker multi-stage reduce el tamaño de imagen hasta 70%
- Kubernetes requiere
terminationGracePeriodSeconds: 1800para no interrumpir agentes largos - Cloud Run es mejor que Lambda para agentes por su timeout de 60 minutos
- El rate limiting centralizado con Redis es esencial para múltiples instancias
- Las queues con Redis manejan picos de tráfico sin saturar la API
Próximo capítulo: Capítulo 18: Monitoreo y Observabilidad — Logs estructurados, métricas Prometheus, trazas OpenTelemetry y dashboards Grafana.