Capítulo 17: Deployment en Producción

Por: Artiko
claudeagent-sdkdeploymentdockerkubernetesproduccion

Capítulo 17: Deployment en Producción

Llevar un agente de IA a producción tiene desafíos únicos que no existen en servicios web tradicionales. Este capítulo cubre todo lo necesario para desplegar agentes de forma confiable, escalable y económica.


1. Consideraciones de Deployment para Agentes

Agentes vs Servicios Web Tradicionales

Un servicio web típico responde en milisegundos. Un agente puede tardar de 30 segundos a 30 minutos. Esta diferencia fundamental cambia todo: la arquitectura, el networking, el scaling, la gestión de errores y la experiencia del usuario.

flowchart LR
    subgraph WebService["Servicio Web Tradicional"]
        direction TB
        R1["Request"] --> P1["Proceso 50ms"] --> RE1["Response"]
    end

    subgraph Agent["Agente de IA"]
        direction TB
        R2["Request"] --> L1["LLM Call 1"]
        L1 --> T1["Tool: Bash"]
        T1 --> L2["LLM Call 2"]
        L2 --> T2["Tool: Read File"]
        T2 --> L3["LLM Call 3"]
        L3 --> RE2["Response 2-30min"]
    end

Diferencias clave:

AspectoServicio WebAgente de IA
Duración típica10-500ms30s - 30min
Llamadas externas0-2 por request5-50+ (API + herramientas)
EstadoStatelessStateful (sesión activa)
Recursos CPUBajoBajo (I/O bound)
Recursos memoriaEstableCrece con contexto
Costo por requestFijoVariable (tokens usados)
InterrupcionesReintentar es trivialReintentar puede duplicar trabajo
Conexión al clienteHTTP simpleSSE/WebSocket para streaming

Stateful vs Stateless: Sesiones del Agente

El SDK permite reanudar sesiones mediante el session_id. En producción, esto significa que el estado de la sesión debe ser accesible desde cualquier instancia del servicio:

stateDiagram-v2
    [*] --> SinSesion: Primera llamada
    SinSesion --> SesionActiva: query() inicia sesión
    SesionActiva --> SesionSuspendida: Timeout / Usuario desconecta
    SesionSuspendida --> SesionActiva: query() con session_id
    SesionActiva --> SesionCompletada: Agente termina
    SesionCompletada --> [*]

    note right of SesionSuspendida
        El session_id se guarda
        en Redis o DB para
        reanudar desde cualquier
        instancia del servidor
    end note

Estrategias de gestión de sesiones:

# Opción 1: Stateless - Sin reanudación (más simple, menos capacidades)
async def run_agent_stateless(prompt: str) -> str:
    results = []
    async for message in query(
        prompt=prompt,
        options=ClaudeCodeOptions(allowed_tools=["Read", "Bash"])
    ):
        if hasattr(message, 'content'):
            results.append(str(message.content))
    return "\n".join(results)

# Opción 2: Stateful con Redis
import redis.asyncio as redis
import json

async def run_agent_stateful(
    prompt: str,
    session_key: str,
    redis_client: redis.Redis
) -> tuple[str, str]:
    # Recuperar session_id previo si existe
    stored = await redis_client.get(f"session:{session_key}")
    session_id = json.loads(stored)["session_id"] if stored else None

    results = []
    new_session_id = None

    async for message in query(
        prompt=prompt,
        options=ClaudeCodeOptions(
            resume=session_id,
            allowed_tools=["Read", "Edit", "Bash"]
        )
    ):
        if hasattr(message, 'session_id'):
            new_session_id = message.session_id
        if hasattr(message, 'content'):
            results.append(str(message.content))

    # Guardar nuevo session_id con TTL de 24 horas
    if new_session_id:
        await redis_client.setex(
            f"session:{session_key}",
            86400,
            json.dumps({"session_id": new_session_id})
        )

    return "\n".join(results), new_session_id

Recursos: CPU, Memoria y Costo

Los agentes son principalmente I/O-bound, no CPU-bound. El costo principal es el API:

pie title "Distribución del tiempo de un agente típico"
    "Esperando respuesta LLM" : 65
    "Ejecutando herramientas (Bash, Read)" : 25
    "Procesamiento Python/TS" : 5
    "Networking y overhead" : 5

Planificación de recursos por instancia:

Scaling: Rate Limiting de la API

El rate limiting de Anthropic es el cuello de botella real, no los recursos del servidor:

flowchart TD
    Users["100 usuarios simultáneos"] --> LoadBalancer["Load Balancer"]
    LoadBalancer --> I1["Instancia 1"]
    LoadBalancer --> I2["Instancia 2"]
    LoadBalancer --> I3["Instancia 3"]
    I1 --> RateLimiter["Rate Limiter Compartido\n(Redis Token Bucket)"]
    I2 --> RateLimiter
    I3 --> RateLimiter
    RateLimiter --> API["Anthropic API\n(1000 RPM límite)"]

    style RateLimiter fill:#f96,color:#000
    style API fill:#f66,color:#fff

Más instancias sin rate limiting compartido solo provocará errores 429. La solución es un token bucket centralizado en Redis.


2. Docker para Agentes Python

Dockerfile Multi-Stage Optimizado

# syntax=docker/dockerfile:1
# ============================================================
# Stage 1: Builder - Instala dependencias con cache
# ============================================================
FROM python:3.12-slim AS builder

WORKDIR /app

# Instalar dependencias del sistema para compilar paquetes
RUN apt-get update && apt-get install -y --no-install-recommends \
    gcc \
    libffi-dev \
    && rm -rf /var/lib/apt/lists/*

# Copiar requirements primero para cachear la capa de dependencias
COPY requirements.txt .

# Instalar dependencias en directorio separado para copiar al runtime
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt

# ============================================================
# Stage 2: Claude CLI Installer
# ============================================================
FROM node:20-slim AS claude-installer

# Instalar Claude Code CLI globalmente
RUN npm install -g @anthropic-ai/claude-code && \
    # Verificar instalación
    claude --version

# ============================================================
# Stage 3: Runtime final
# ============================================================
FROM python:3.12-slim AS runtime

# Metadata
LABEL maintainer="[email protected]"
LABEL description="Claude Code SDK Agent"
LABEL version="1.0.0"

# Instalar Node.js runtime (necesario para Claude CLI)
RUN apt-get update && apt-get install -y --no-install-recommends \
    nodejs \
    ca-certificates \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Copiar Claude CLI desde el stage de instalación
COPY --from=claude-installer /usr/local/lib/node_modules /usr/local/lib/node_modules
COPY --from=claude-installer /usr/local/bin/claude /usr/local/bin/claude
COPY --from=claude-installer /usr/local/bin/node /usr/local/bin/node

# Copiar dependencias Python instaladas
COPY --from=builder /install /usr/local

# Crear usuario no-root para seguridad
RUN groupadd --gid 1001 agentgroup && \
    useradd --uid 1001 --gid agentgroup --shell /bin/bash --create-home agentuser

WORKDIR /app

# Copiar código de la aplicación
COPY --chown=agentuser:agentgroup . .

# Cambiar a usuario no-root
USER agentuser

# Variables de entorno (sin valores por defecto para secrets)
ENV PYTHONUNBUFFERED=1 \
    PYTHONDONTWRITEBYTECODE=1 \
    PORT=8000 \
    LOG_LEVEL=INFO \
    MAX_CONCURRENT_AGENTS=5

# Health check usando el endpoint de la API
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
    CMD curl -f http://localhost:${PORT}/health || exit 1

EXPOSE ${PORT}

CMD ["python", "-m", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

requirements.txt completo:

# Claude Code SDK
claude-code-sdk>=0.0.14

# Web framework
fastapi>=0.115.0
uvicorn[standard]>=0.34.0

# Async y utilidades
httpx>=0.28.0
pydantic>=2.10.0
pydantic-settings>=2.7.0

# Estado compartido
redis[hiredis]>=5.2.0

# Observabilidad
structlog>=25.1.0
prometheus-client>=0.21.0
opentelemetry-api>=1.29.0
opentelemetry-sdk>=1.29.0
opentelemetry-instrumentation-fastapi>=0.50b0

# Resiliencia
tenacity>=9.0.0

main.py del servidor FastAPI:

import asyncio
import os
from contextlib import asynccontextmanager

import structlog
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from claude_code_sdk import query, ClaudeCodeOptions

logger = structlog.get_logger()

class AgentRequest(BaseModel):
    prompt: str
    session_key: str | None = None
    tools: list[str] = ["Read", "Bash"]
    max_tokens: int = 8096

class AgentResponse(BaseModel):
    result: str
    session_id: str | None = None
    cost_usd: float | None = None

@asynccontextmanager
async def lifespan(app: FastAPI):
    logger.info("agent_server_starting", port=os.getenv("PORT", 8000))
    yield
    logger.info("agent_server_stopping")

app = FastAPI(title="Claude Agent API", lifespan=lifespan)

@app.get("/health")
async def health():
    return {"status": "healthy", "service": "claude-agent"}

@app.post("/agent/run", response_model=AgentResponse)
async def run_agent(request: AgentRequest):
    log = logger.bind(prompt_preview=request.prompt[:50])

    results = []
    session_id = None
    total_cost = 0.0

    try:
        async for message in query(
            prompt=request.prompt,
            options=ClaudeCodeOptions(
                allowed_tools=request.tools,
                max_turns=50
            )
        ):
            if hasattr(message, 'session_id') and message.session_id:
                session_id = message.session_id
            if hasattr(message, 'cost_usd') and message.cost_usd:
                total_cost += message.cost_usd
            if hasattr(message, 'content'):
                content = str(message.content)
                if content:
                    results.append(content)

        log.info("agent_completed", cost_usd=total_cost)
        return AgentResponse(
            result="\n".join(results),
            session_id=session_id,
            cost_usd=total_cost if total_cost > 0 else None
        )

    except Exception as e:
        log.error("agent_failed", error=str(e))
        raise HTTPException(status_code=500, detail=str(e))

docker-compose.yml Completo

version: "3.9"

services:
  agent:
    build:
      context: .
      dockerfile: Dockerfile
      target: runtime
    image: claude-agent:latest
    container_name: claude-agent
    restart: unless-stopped
    ports:
      - "8000:8000"
    environment:
      # Anthropic API Key (nunca hardcodear)
      ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY:?ANTHROPIC_API_KEY es requerida}
      # Configuración
      PORT: 8000
      LOG_LEVEL: ${LOG_LEVEL:-INFO}
      MAX_CONCURRENT_AGENTS: ${MAX_CONCURRENT_AGENTS:-5}
      # Redis para estado compartido
      REDIS_URL: redis://redis:6379/0
    env_file:
      - .env.local  # Para desarrollo local (NO commitear)
    depends_on:
      redis:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s
    deploy:
      resources:
        limits:
          cpus: "0.5"
          memory: 512M
        reservations:
          cpus: "0.1"
          memory: 256M
    volumes:
      # Solo para logs si no se usa logging centralizado
      - agent_logs:/app/logs
    networks:
      - agent_network

  redis:
    image: redis:7-alpine
    container_name: claude-agent-redis
    restart: unless-stopped
    command: redis-server --appendonly yes --maxmemory 256mb --maxmemory-policy allkeys-lru
    volumes:
      - redis_data:/data
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5
    networks:
      - agent_network

  # Opcional: Nginx como reverse proxy
  nginx:
    image: nginx:alpine
    container_name: claude-agent-nginx
    restart: unless-stopped
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
      - ./certs:/etc/nginx/certs:ro
    depends_on:
      - agent
    networks:
      - agent_network

volumes:
  redis_data:
  agent_logs:

networks:
  agent_network:
    driver: bridge

Optimización de tamaño de imagen:

# Verificar tamaño de imagen
docker images claude-agent

# Inspeccionar capas
docker history claude-agent:latest

# Reducción típica con multi-stage:
# Sin multi-stage: ~800MB
# Con multi-stage:  ~250MB
# Con distroless:   ~150MB (avanzado)

3. Docker para Agentes TypeScript

Dockerfile con Node.js + Claude Code CLI

# syntax=docker/dockerfile:1
# ============================================================
# Stage 1: Dependencias
# ============================================================
FROM node:20-slim AS deps

WORKDIR /app

# Copiar solo archivos de dependencias
COPY package.json package-lock.json* ./

# Instalar dependencias de producción
RUN npm ci --only=production && \
    # Claude Code CLI se instala como dependencia del proyecto
    # No es necesario instalarlo globalmente por separado
    npm cache clean --force

# ============================================================
# Stage 2: Builder TypeScript
# ============================================================
FROM node:20-slim AS builder

WORKDIR /app

COPY package.json package-lock.json* ./
COPY tsconfig.json ./

# Instalar TODAS las dependencias (incluyendo devDependencies para compilar)
RUN npm ci

# Copiar código fuente
COPY src/ ./src/

# Compilar TypeScript
RUN npm run build

# ============================================================
# Stage 3: Runtime
# ============================================================
FROM node:20-slim AS runtime

LABEL description="Claude Agent TypeScript"
LABEL version="1.0.0"

# Instalar curl para health checks
RUN apt-get update && apt-get install -y --no-install-recommends \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Crear usuario no-root
RUN groupadd --gid 1001 agentgroup && \
    useradd --uid 1001 --gid agentgroup --shell /bin/bash --create-home agentuser

WORKDIR /app

# Copiar dependencias de producción
COPY --from=deps --chown=agentuser:agentgroup /app/node_modules ./node_modules

# Copiar JavaScript compilado
COPY --from=builder --chown=agentuser:agentgroup /app/dist ./dist

# Copiar package.json para metadata
COPY --chown=agentuser:agentgroup package.json ./

USER agentuser

ENV NODE_ENV=production \
    PORT=3000 \
    LOG_LEVEL=info

HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
    CMD curl -f http://localhost:${PORT}/health || exit 1

EXPOSE ${PORT}

CMD ["node", "dist/server.js"]

package.json para producción:

{
  "name": "claude-agent-ts",
  "version": "1.0.0",
  "type": "module",
  "scripts": {
    "dev": "tsx watch src/server.ts",
    "build": "tsc --project tsconfig.json",
    "start": "node dist/server.js",
    "test": "vitest run",
    "test:coverage": "vitest run --coverage",
    "lint": "eslint src/",
    "typecheck": "tsc --noEmit"
  },
  "dependencies": {
    "@anthropic-ai/claude-code-sdk": "^0.0.14",
    "fastify": "^5.2.0",
    "ioredis": "^5.4.2",
    "pino": "^9.6.0",
    "zod": "^3.24.0"
  },
  "devDependencies": {
    "@types/node": "^22.13.0",
    "tsx": "^4.19.0",
    "typescript": "^5.8.0",
    "vitest": "^3.0.0"
  }
}

server.ts principal:

import Fastify from "fastify";
import { query, ClaudeCodeOptions } from "@anthropic-ai/claude-code-sdk";
import { z } from "zod";
import pino from "pino";

const logger = pino({ level: process.env.LOG_LEVEL ?? "info" });

const AgentRequestSchema = z.object({
  prompt: z.string().min(1).max(10000),
  tools: z.array(z.string()).default(["Read", "Bash"]),
  sessionKey: z.string().optional(),
});

type AgentRequest = z.infer<typeof AgentRequestSchema>;

const app = Fastify({
  logger: false, // Usamos pino directamente
});

app.get("/health", async () => ({ status: "healthy", service: "claude-agent-ts" }));

app.post<{ Body: AgentRequest }>("/agent/run", {
  schema: {
    body: {
      type: "object",
      required: ["prompt"],
      properties: {
        prompt: { type: "string" },
        tools: { type: "array", items: { type: "string" } },
        sessionKey: { type: "string" },
      },
    },
  },
}, async (request, reply) => {
  const parsed = AgentRequestSchema.safeParse(request.body);

  if (!parsed.success) {
    return reply.status(400).send({ error: parsed.error.flatten() });
  }

  const { prompt, tools } = parsed.data;
  const results: string[] = [];
  let sessionId: string | undefined;
  let totalCost = 0;

  try {
    for await (const message of query(prompt, {
      allowedTools: tools,
      maxTurns: 50,
    } as ClaudeCodeOptions)) {
      if ("session_id" in message && message.session_id) {
        sessionId = message.session_id as string;
      }
      if ("cost_usd" in message && typeof message.cost_usd === "number") {
        totalCost += message.cost_usd;
      }
      if ("content" in message) {
        const content = String(message.content);
        if (content) results.push(content);
      }
    }

    logger.info({ sessionId, costUsd: totalCost }, "agent_completed");

    return {
      result: results.join("\n"),
      sessionId,
      costUsd: totalCost > 0 ? totalCost : undefined,
    };
  } catch (error) {
    const err = error as Error;
    logger.error({ error: err.message }, "agent_failed");
    return reply.status(500).send({ error: err.message });
  }
});

const start = async () => {
  const port = parseInt(process.env.PORT ?? "3000", 10);
  await app.listen({ port, host: "0.0.0.0" });
  logger.info({ port }, "server_started");
};

start().catch((err) => {
  logger.error(err, "fatal_error");
  process.exit(1);
});

docker-compose.yml para TypeScript:

version: "3.9"

services:
  agent-ts:
    build:
      context: .
      dockerfile: Dockerfile
      target: runtime
      args:
        NODE_VERSION: "20"
    image: claude-agent-ts:latest
    restart: unless-stopped
    ports:
      - "3000:3000"
    environment:
      ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY:?required}
      NODE_ENV: production
      PORT: 3000
      LOG_LEVEL: ${LOG_LEVEL:-info}
      REDIS_URL: redis://redis:6379/0
    depends_on:
      redis:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 30s
    deploy:
      resources:
        limits:
          cpus: "0.5"
          memory: 512M
    networks:
      - agent_network

  redis:
    image: redis:7-alpine
    command: redis-server --maxmemory 256mb --maxmemory-policy allkeys-lru
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
    networks:
      - agent_network

networks:
  agent_network:
    driver: bridge

4. Kubernetes

Arquitectura del Despliegue en K8s

flowchart TB
    Internet["Internet"] --> Ingress["Ingress Controller\n(nginx/traefik)"]
    Ingress --> Service["Service\n(ClusterIP)"]
    Service --> Pod1["Pod 1\nAgent Container"]
    Service --> Pod2["Pod 2\nAgent Container"]
    Service --> Pod3["Pod 3\nAgent Container"]

    HPA["HorizontalPodAutoscaler\n(2-10 replicas)"] --> Deployment["Deployment"]
    Deployment --> Pod1
    Deployment --> Pod2
    Deployment --> Pod3

    Pod1 --> Redis["Redis\n(StatefulSet)"]
    Pod2 --> Redis
    Pod3 --> Redis

    subgraph Secrets["Secrets & Config"]
        Secret["Secret\nANTHROPIC_API_KEY"]
        ConfigMap["ConfigMap\nApp Config"]
    end

    Secret -.-> Pod1
    ConfigMap -.-> Pod1

Namespace y ConfigMap

# namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: claude-agents
  labels:
    app.kubernetes.io/name: claude-agents
    environment: production
---
# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: agent-config
  namespace: claude-agents
data:
  LOG_LEVEL: "INFO"
  PORT: "8000"
  MAX_CONCURRENT_AGENTS: "5"
  REDIS_URL: "redis://redis-service:6379/0"
  MAX_TURNS: "50"
  REQUEST_TIMEOUT_SECONDS: "1800"

Secret para la API Key

# secret.yaml (NO commitear con valores reales)
apiVersion: v1
kind: Secret
metadata:
  name: anthropic-credentials
  namespace: claude-agents
type: Opaque
stringData:
  # En producción, usar External Secrets Operator o Vault
  ANTHROPIC_API_KEY: "sk-ant-REEMPLAZAR_CON_VALOR_REAL"

En producción, crear el secret con kubectl:

kubectl create secret generic anthropic-credentials \
  --from-literal=ANTHROPIC_API_KEY="${ANTHROPIC_API_KEY}" \
  --namespace=claude-agents

Deployment YAML Completo

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: claude-agent
  namespace: claude-agents
  labels:
    app: claude-agent
    version: "1.0.0"
spec:
  replicas: 2
  selector:
    matchLabels:
      app: claude-agent
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0  # Zero-downtime deployment
  template:
    metadata:
      labels:
        app: claude-agent
        version: "1.0.0"
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8001"
        prometheus.io/path: "/metrics"
    spec:
      serviceAccountName: claude-agent-sa

      # Seguridad del pod
      securityContext:
        runAsNonRoot: true
        runAsUser: 1001
        runAsGroup: 1001
        fsGroup: 1001

      containers:
        - name: agent
          image: registry.example.com/claude-agent:1.0.0
          imagePullPolicy: Always
          ports:
            - name: http
              containerPort: 8000
            - name: metrics
              containerPort: 8001

          # Variables de entorno desde ConfigMap y Secret
          envFrom:
            - configMapRef:
                name: agent-config
          env:
            - name: ANTHROPIC_API_KEY
              valueFrom:
                secretKeyRef:
                  name: anthropic-credentials
                  key: ANTHROPIC_API_KEY
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name

          # Resource limits
          resources:
            requests:
              cpu: "100m"
              memory: "256Mi"
            limits:
              cpu: "500m"
              memory: "512Mi"

          # Liveness probe: ¿El contenedor está vivo?
          livenessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 30
            periodSeconds: 30
            timeoutSeconds: 10
            failureThreshold: 3

          # Readiness probe: ¿El contenedor puede recibir tráfico?
          readinessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 10
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 3

          # Startup probe: Tiempo para iniciar la primera vez
          startupProbe:
            httpGet:
              path: /health
              port: http
            failureThreshold: 30
            periodSeconds: 10

          # Seguridad del contenedor
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: false  # El agente necesita escribir archivos temp
            capabilities:
              drop:
                - ALL

          # Directorio temporal para el agente
          volumeMounts:
            - name: tmp-dir
              mountPath: /tmp

      volumes:
        - name: tmp-dir
          emptyDir:
            sizeLimit: 1Gi

      # Distribución de pods entre nodos
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                    - key: app
                      operator: In
                      values:
                        - claude-agent
                topologyKey: kubernetes.io/hostname

      # Tiempo para terminar gracefully (importante para agentes largos)
      terminationGracePeriodSeconds: 1800

Service y Ingress

# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: claude-agent-service
  namespace: claude-agents
  labels:
    app: claude-agent
spec:
  selector:
    app: claude-agent
  ports:
    - name: http
      port: 80
      targetPort: http
    - name: metrics
      port: 8001
      targetPort: metrics
  type: ClusterIP
---
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: claude-agent-ingress
  namespace: claude-agents
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "1800"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "1800"
    nginx.ingress.kubernetes.io/proxy-connect-timeout: "60"
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - agent.example.com
      secretName: agent-tls
  rules:
    - host: agent.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: claude-agent-service
                port:
                  name: http

HorizontalPodAutoscaler

# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: claude-agent-hpa
  namespace: claude-agents
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: claude-agent
  minReplicas: 2
  maxReplicas: 10
  metrics:
    # Escalar por CPU
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    # Escalar por memoria
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300  # 5 minutos antes de scale down
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120

RBAC y NetworkPolicy

# rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: claude-agent-sa
  namespace: claude-agents
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: claude-agent-role
  namespace: claude-agents
rules:
  # El agente no necesita permisos de K8s
  # Solo acceso a ConfigMaps propios (lectura)
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: claude-agent-rolebinding
  namespace: claude-agents
subjects:
  - kind: ServiceAccount
    name: claude-agent-sa
    namespace: claude-agents
roleRef:
  kind: Role
  name: claude-agent-role
  apiGroup: rbac.authorization.k8s.io
---
# networkpolicy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: claude-agent-netpol
  namespace: claude-agents
spec:
  podSelector:
    matchLabels:
      app: claude-agent
  policyTypes:
    - Ingress
    - Egress
  ingress:
    # Solo permite tráfico desde el ingress controller
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: ingress-nginx
      ports:
        - port: 8000
    # Permite scraping de Prometheus
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: monitoring
      ports:
        - port: 8001
  egress:
    # Permite acceso a internet (para Anthropic API)
    - ports:
        - port: 443
          protocol: TCP
    # Permite acceso a Redis interno
    - to:
        - podSelector:
            matchLabels:
              app: redis
      ports:
        - port: 6379
    # DNS
    - ports:
        - port: 53
          protocol: UDP
        - port: 53
          protocol: TCP

5. Serverless

Limitaciones de Serverless para Agentes

flowchart LR
    subgraph Limitaciones["Limitaciones Serverless"]
        ColdStart["Cold Start\n2-15s overhead"]
        Timeout["Timeout\nLambda: 15min\nCloud Run: 60min"]
        Filesystem["Filesystem\nSolo /tmp (512MB)"]
        State["Sin estado\npersistente"]
    end

    subgraph Soluciones["Soluciones"]
        Warmup["Pre-warming\nscheduled invocations"]
        AsyncPattern["Patrón Async\nwebhook + queue"]
        S3["S3/GCS para\narchivos temporales"]
        Redis2["Redis Externo\npara estado"]
    end

    ColdStart --> Warmup
    Timeout --> AsyncPattern
    Filesystem --> S3
    State --> Redis2

AWS Lambda con Python

# lambda_handler.py
import asyncio
import json
import os
from claude_code_sdk import query, ClaudeCodeOptions

# Inicialización fuera del handler (reutilizada en warm starts)
MAX_TIMEOUT = int(os.getenv("LAMBDA_TIMEOUT_SECONDS", "600"))

def lambda_handler(event: dict, context) -> dict:
    """
    Lambda handler para agente de IA.

    IMPORTANTE: Lambda tiene timeout máximo de 15 minutos.
    Para tareas largas, usar el patrón async con SQS.
    """
    prompt = event.get("prompt", "")
    tools = event.get("tools", ["Read"])

    if not prompt:
        return {
            "statusCode": 400,
            "body": json.dumps({"error": "prompt es requerido"})
        }

    # Ejecutar agente en event loop
    result = asyncio.run(run_agent(prompt, tools))

    return {
        "statusCode": 200,
        "body": json.dumps(result),
        "headers": {"Content-Type": "application/json"}
    }

async def run_agent(prompt: str, tools: list[str]) -> dict:
    results = []
    total_cost = 0.0

    async for message in query(
        prompt=prompt,
        options=ClaudeCodeOptions(
            allowed_tools=tools,
            max_turns=20  # Limitar para Lambda
        )
    ):
        if hasattr(message, 'cost_usd') and message.cost_usd:
            total_cost += message.cost_usd
        if hasattr(message, 'content') and message.content:
            results.append(str(message.content))

    return {
        "result": "\n".join(results),
        "cost_usd": total_cost
    }

Dockerfile para Lambda:

FROM public.ecr.aws/lambda/python:3.12

# Instalar Node.js (para Claude Code CLI)
RUN dnf install -y nodejs npm && dnf clean all

# Instalar Claude Code CLI
RUN npm install -g @anthropic-ai/claude-code

# Copiar requirements
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copiar handler
COPY lambda_handler.py .

CMD ["lambda_handler.lambda_handler"]

Google Cloud Run (Recomendado para Agentes)

Cloud Run es mejor que Lambda para agentes porque:

# cloudrun-service.yaml
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: claude-agent
  annotations:
    run.googleapis.com/ingress: all
spec:
  template:
    metadata:
      annotations:
        # Máximo tiempo de respuesta: 60 minutos
        run.googleapis.com/timeout: "3600s"
        # Instancias mínimas para evitar cold start
        autoscaling.knative.dev/minScale: "1"
        autoscaling.knative.dev/maxScale: "10"
        # Recursos
        run.googleapis.com/cpu: "1"
        run.googleapis.com/memory: "512Mi"
    spec:
      containers:
        - image: gcr.io/PROJECT_ID/claude-agent:latest
          env:
            - name: ANTHROPIC_API_KEY
              valueFrom:
                secretKeyRef:
                  name: anthropic-api-key
                  key: latest
          resources:
            limits:
              cpu: "1"
              memory: "512Mi"

Deploy en Cloud Run:

# Build y push
gcloud builds submit --tag gcr.io/${PROJECT_ID}/claude-agent:latest

# Deploy
gcloud run deploy claude-agent \
  --image gcr.io/${PROJECT_ID}/claude-agent:latest \
  --platform managed \
  --region us-central1 \
  --timeout 3600 \
  --memory 512Mi \
  --set-secrets "ANTHROPIC_API_KEY=anthropic-api-key:latest" \
  --no-allow-unauthenticated

Azure Container Apps

# Crear Container App
az containerapp create \
  --name claude-agent \
  --resource-group my-rg \
  --environment my-env \
  --image myregistry.azurecr.io/claude-agent:latest \
  --target-port 8000 \
  --ingress external \
  --min-replicas 1 \
  --max-replicas 10 \
  --cpu 0.5 \
  --memory 1Gi \
  --secrets anthropic-key=secretref:anthropicapikey \
  --env-vars ANTHROPIC_API_KEY=secretref:anthropic-key

6. Railway y Render

Deploy en Railway

Railway es la opción más simple para proyectos pequeños y MVPs:

# railway.toml
[build]
builder = "DOCKERFILE"
dockerfilePath = "Dockerfile"

[deploy]
healthcheckPath = "/health"
healthcheckTimeout = 60
restartPolicyType = "ON_FAILURE"
restartPolicyMaxRetries = 3

[deploy.resources]
memory = "512MB"
vcpus = 0.5

Variables de entorno en Railway:

# Configurar via CLI
railway variables set ANTHROPIC_API_KEY=sk-ant-xxx
railway variables set LOG_LEVEL=INFO
railway variables set MAX_CONCURRENT_AGENTS=3

# Deploy
railway up

Deploy en Render

# render.yaml
services:
  - type: web
    name: claude-agent
    runtime: docker
    dockerfilePath: ./Dockerfile
    plan: starter  # $7/mes, suficiente para MVP
    envVars:
      - key: ANTHROPIC_API_KEY
        sync: false  # Se configura manualmente en el dashboard
      - key: PORT
        value: 8000
      - key: LOG_LEVEL
        value: INFO
    healthCheckPath: /health
    autoDeploy: true  # Deploy automático en push a main

Consideraciones para Railway/Render:

AspectoRailwayRender
Precio base$5/mes$7/mes
Timeout HTTP60s (configurable)30s (starter)
Filesystem persistenteNo (ephemeral)Disks disponibles
Redis incluidoSí ($5/mes extra)Sí ($7/mes extra)
Auto-deploy
Ideal paraMVPs, proyectos pequeñosMVPs con más config

7. CI/CD para Agentes

GitHub Actions Workflow Completo

# .github/workflows/deploy.yml
name: CI/CD Claude Agent

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}/claude-agent

jobs:
  # ============================================================
  # Job 1: Tests
  # ============================================================
  test:
    name: Tests
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"
          cache: pip

      - name: Install dependencies
        run: pip install -r requirements.txt -r requirements-dev.txt

      - name: Run linter
        run: ruff check src/

      - name: Run type checker
        run: mypy src/

      - name: Run unit tests
        run: pytest tests/unit/ -v --cov=src --cov-report=xml

      - name: Upload coverage
        uses: codecov/codecov-action@v4
        with:
          file: coverage.xml

  # ============================================================
  # Job 2: Build y push Docker image
  # ============================================================
  build:
    name: Build Docker Image
    runs-on: ubuntu-latest
    needs: test
    if: github.ref == 'refs/heads/main'
    outputs:
      image-digest: ${{ steps.build.outputs.digest }}
      image-tag: ${{ steps.meta.outputs.tags }}

    permissions:
      contents: read
      packages: write

    steps:
      - uses: actions/checkout@v4

      - name: Log in to Container Registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=sha,prefix=sha-
            type=ref,event=branch
            type=semver,pattern={{version}}
            latest

      - name: Build and push
        id: build
        uses: docker/build-push-action@v6
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  # ============================================================
  # Job 3: Deploy a producción
  # ============================================================
  deploy:
    name: Deploy to Production
    runs-on: ubuntu-latest
    needs: build
    if: github.ref == 'refs/heads/main'
    environment: production

    steps:
      - name: Deploy to Kubernetes
        uses: azure/k8s-deploy@v5
        with:
          namespace: claude-agents
          manifests: k8s/
          images: |
            ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:sha-${{ github.sha }}

      - name: Verify deployment
        run: |
          kubectl rollout status deployment/claude-agent \
            --namespace=claude-agents \
            --timeout=300s

      - name: Run smoke tests
        run: |
          curl -f https://agent.example.com/health || exit 1

  # ============================================================
  # Job 4: Rollback en caso de fallo
  # ============================================================
  rollback:
    name: Rollback on Failure
    runs-on: ubuntu-latest
    needs: deploy
    if: failure()

    steps:
      - name: Rollback deployment
        run: |
          kubectl rollout undo deployment/claude-agent \
            --namespace=claude-agents

      - name: Notify failure
        uses: slackapi/slack-github-action@v2
        with:
          payload: |
            {
              "text": "Deploy fallido y rollback ejecutado para ${{ github.sha }}"
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}

Canary Deployment

# canary-deploy.yaml
# Estrategia: Enviar 10% del tráfico al canary primero
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: claude-agent-rollout
  namespace: claude-agents
spec:
  replicas: 10
  strategy:
    canary:
      steps:
        - setWeight: 10   # 10% al canary
        - pause: {duration: 5m}  # Esperar 5 minutos
        - setWeight: 30   # 30% al canary
        - pause: {duration: 5m}
        - setWeight: 60
        - pause: {duration: 5m}
        - setWeight: 100  # 100% al nuevo
      analysis:
        templates:
          - templateName: success-rate
        startingStep: 1
        args:
          - name: service-name
            value: claude-agent-service

8. Monitoreo en Producción

Health Endpoint del Agente

# health.py
import os
import time
from typing import Any
from fastapi import APIRouter

router = APIRouter()
START_TIME = time.time()

@router.get("/health")
async def health() -> dict[str, Any]:
    """Health check básico para load balancers."""
    return {"status": "healthy"}

@router.get("/health/detailed")
async def health_detailed() -> dict[str, Any]:
    """Health check detallado para monitoreo."""
    uptime = time.time() - START_TIME

    return {
        "status": "healthy",
        "uptime_seconds": int(uptime),
        "version": os.getenv("APP_VERSION", "unknown"),
        "checks": {
            "api_key_configured": bool(os.getenv("ANTHROPIC_API_KEY")),
            "redis_connected": await check_redis(),
        }
    }

async def check_redis() -> bool:
    """Verifica conectividad con Redis."""
    try:
        import redis.asyncio as redis
        r = redis.from_url(os.getenv("REDIS_URL", "redis://localhost:6379"))
        await r.ping()
        await r.close()
        return True
    except Exception:
        return False

Prometheus Metrics

# metrics.py
from prometheus_client import (
    Counter, Histogram, Gauge,
    generate_latest, CONTENT_TYPE_LATEST
)
from fastapi import APIRouter
from fastapi.responses import Response

router = APIRouter()

# Contadores
QUERIES_TOTAL = Counter(
    "agent_queries_total",
    "Total de queries al agente",
    ["status", "agent_name"]
)

TOOL_CALLS_TOTAL = Counter(
    "agent_tool_calls_total",
    "Total de llamadas a herramientas",
    ["tool_name", "status"]
)

COST_USD_TOTAL = Counter(
    "agent_cost_usd_total",
    "Costo total en USD",
    ["agent_name"]
)

# Histogramas
QUERY_DURATION = Histogram(
    "agent_query_duration_seconds",
    "Duración de queries del agente",
    ["agent_name"],
    buckets=[5, 15, 30, 60, 120, 300, 600, 1800]
)

# Gauges
ACTIVE_AGENTS = Gauge(
    "agent_active_queries",
    "Número de queries en ejecución actualmente"
)

@router.get("/metrics")
async def metrics():
    return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)

Alertas con Prometheus AlertManager

# prometheus-alerts.yaml
groups:
  - name: claude-agent-alerts
    interval: 30s
    rules:
      - alert: AgentHighErrorRate
        expr: |
          rate(agent_queries_total{status="error"}[5m]) /
          rate(agent_queries_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate del agente > 5%"
          description: "El agente {{ $labels.agent_name }} tiene {{ $value | humanizePercentage }} de errores"

      - alert: AgentHighLatency
        expr: |
          histogram_quantile(0.99, rate(agent_query_duration_seconds_bucket[10m])) > 1800
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Latencia p99 del agente > 30 minutos"

      - alert: AgentDailyCostExceeded
        expr: |
          increase(agent_cost_usd_total[24h]) > 50
        labels:
          severity: warning
        annotations:
          summary: "Costo diario del agente excedió $50"
          description: "Costo acumulado: ${{ $value }}"

9. Scaling y Rate Limiting

Token Bucket con Redis

# rate_limiter.py
import asyncio
import time
import redis.asyncio as redis


class AnthropicRateLimiter:
    """
    Token bucket para respetar rate limits de Anthropic.
    Implementación centralizada via Redis para múltiples instancias.
    """

    def __init__(
        self,
        redis_client: redis.Redis,
        requests_per_minute: int = 1000,
        key: str = "anthropic_rate_limiter"
    ):
        self.redis = redis_client
        self.rpm = requests_per_minute
        self.key = key
        self.refill_interval = 60.0 / requests_per_minute

    async def acquire(self, timeout: float = 60.0) -> bool:
        """Espera hasta obtener un token del bucket."""
        deadline = time.time() + timeout

        while time.time() < deadline:
            # Script Lua atómico para token bucket
            result = await self.redis.eval(
                """
                local key = KEYS[1]
                local capacity = tonumber(ARGV[1])
                local refill_time = tonumber(ARGV[2])
                local now = tonumber(ARGV[3])

                local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
                local tokens = tonumber(bucket[1]) or capacity
                local last_refill = tonumber(bucket[2]) or now

                -- Refill tokens basado en tiempo transcurrido
                local elapsed = now - last_refill
                local new_tokens = math.min(capacity, tokens + (elapsed * capacity / 60))

                if new_tokens >= 1 then
                    redis.call('HMSET', key, 'tokens', new_tokens - 1, 'last_refill', now)
                    redis.call('EXPIRE', key, 120)
                    return 1
                else
                    redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
                    redis.call('EXPIRE', key, 120)
                    return 0
                end
                """,
                1,
                self.key,
                self.rpm,
                self.refill_interval,
                time.time()
            )

            if result == 1:
                return True

            # Esperar antes de reintentar
            await asyncio.sleep(self.refill_interval)

        return False


# Uso con el SDK
async def run_agent_with_rate_limit(
    prompt: str,
    rate_limiter: AnthropicRateLimiter
) -> str:
    # Esperar token antes de ejecutar
    acquired = await rate_limiter.acquire(timeout=120.0)
    if not acquired:
        raise TimeoutError("Rate limit timeout: no se pudo obtener token")

    results = []
    async for message in query(
        prompt=prompt,
        options=ClaudeCodeOptions(allowed_tools=["Read", "Bash"])
    ):
        if hasattr(message, 'content') and message.content:
            results.append(str(message.content))

    return "\n".join(results)

Queue con Redis para Picos de Tráfico

# queue_worker.py
import asyncio
import json
import uuid
from dataclasses import dataclass, asdict
from typing import Optional
import redis.asyncio as redis
from claude_code_sdk import query, ClaudeCodeOptions


@dataclass
class AgentJob:
    job_id: str
    prompt: str
    tools: list[str]
    status: str = "pending"
    result: Optional[str] = None
    error: Optional[str] = None


class AgentQueue:
    """Queue de trabajos para el agente con Redis."""

    QUEUE_KEY = "agent:jobs:pending"
    RESULTS_PREFIX = "agent:result:"

    def __init__(self, redis_client: redis.Redis):
        self.redis = redis_client

    async def enqueue(self, prompt: str, tools: list[str]) -> str:
        """Agrega un trabajo a la queue. Retorna el job_id."""
        job = AgentJob(
            job_id=str(uuid.uuid4()),
            prompt=prompt,
            tools=tools
        )
        await self.redis.lpush(self.QUEUE_KEY, json.dumps(asdict(job)))
        return job.job_id

    async def get_result(self, job_id: str, timeout: float = 300.0) -> Optional[AgentJob]:
        """Espera el resultado de un trabajo."""
        key = f"{self.RESULTS_PREFIX}{job_id}"
        deadline = asyncio.get_event_loop().time() + timeout

        while asyncio.get_event_loop().time() < deadline:
            result = await self.redis.get(key)
            if result:
                return AgentJob(**json.loads(result))
            await asyncio.sleep(1.0)

        return None

    async def worker(self, worker_id: int = 0) -> None:
        """Procesa trabajos de la queue indefinidamente."""
        print(f"Worker {worker_id} iniciado")

        while True:
            # BRPOP bloquea hasta que haya un trabajo (timeout 5s)
            item = await self.redis.brpop(self.QUEUE_KEY, timeout=5)

            if not item:
                continue

            _, job_data = item
            job = AgentJob(**json.loads(job_data))

            try:
                results = []
                async for message in query(
                    prompt=job.prompt,
                    options=ClaudeCodeOptions(
                        allowed_tools=job.tools,
                        max_turns=50
                    )
                ):
                    if hasattr(message, 'content') and message.content:
                        results.append(str(message.content))

                job.status = "completed"
                job.result = "\n".join(results)

            except Exception as e:
                job.status = "failed"
                job.error = str(e)

            # Guardar resultado con TTL de 1 hora
            await self.redis.setex(
                f"{self.RESULTS_PREFIX}{job.job_id}",
                3600,
                json.dumps(asdict(job))
            )

Resumen del Capítulo

En este capítulo cubrimos el ciclo completo de deployment de agentes en producción:

flowchart LR
    Dev["Desarrollo\nLocal"] --> Docker["Docker\nContainer"]
    Docker --> CI["CI/CD\nGitHub Actions"]
    CI --> |Test + Build| Registry["Container\nRegistry"]
    Registry --> |Deploy| Target["Target\n(K8s/Cloud Run/Railway)"]
    Target --> Monitor["Monitoreo\n(Prometheus + Grafana)"]
    Monitor --> |Alertas| Team["Equipo"]

Puntos clave:

Próximo capítulo: Capítulo 18: Monitoreo y Observabilidad — Logs estructurados, métricas Prometheus, trazas OpenTelemetry y dashboards Grafana.