Compression Pipeline
Every request from your AI CLI passes through Squeezr on localhost:8080. The proxy applies three compression layers before forwarding to the upstream API.
Pipeline overview
Request from coding tool
|
v
+------------------------+
| Layer 1: System Prompt | Compress once, cache, reuse every turn
+------------------------+
|
v
+------------------------+
| Layer 2: Deterministic | Zero-latency rule-based transforms
| Preprocessing | (ANSI, dedup, JSON, noise removal)
+------------------------+
|
v
+------------------------+
| Layer 3: Tool-Specific | 30+ patterns for git, tests, builds,
| Patterns | infra, package managers, and more
+------------------------+
|
v
Forward to upstream API (streaming, unmodified response)Layer 1: System prompt compression
Claude Code's system prompt is ~13KB and is sent with every single request. Squeezr compresses it once using a cheap AI model (Haiku) and caches the result. Every subsequent request reuses the cached version automatically.
Savings: ~3,000 tokens per request after the first.
Layer 2: Deterministic preprocessing
Zero-latency rule-based transforms applied to every tool result. No API calls, no latency:
- Noise removal — ANSI escape codes, progress bars, timestamps, spinner output stripped
- Deduplication — repeated stack frames, duplicate lines, redundant git hunks removed
- Minification — JSON whitespace collapsed, blank lines consolidated
Layer 3: Tool-specific patterns
Each tool result is matched against 30+ specialized compression rules. Errors, warnings, and actionable information are always preserved.
| Category | Tools | What it does |
|---|---|---|
| Git | diff, log, status, branch | 1-line diff context, capped log, compact status |
| JS/TS | vitest, jest, playwright, tsc, eslint, biome, prettier | Failures/errors only, grouped by file |
| Package managers | pnpm, npm | Install summary, list capped at 30, outdated only |
| Build | next build, cargo build | Errors only |
| Test | cargo test, pytest, go test | FAIL blocks + tracebacks only |
| Infra | terraform, docker, kubectl | Resource changes, compact tables, last 50 log lines |
| Other | prisma, gh CLI, curl/wget | Strip ASCII art, cap output, remove verbose headers |
Exclusive patterns
Applied to specific content types regardless of which tool produced them:
- Lockfiles (package-lock.json, Cargo.lock, etc.) → dependency count summary
- Large code files (>500 lines) → imports + function/class signatures only
- Long output (>200 lines) → head + tail + omission note
- Grep results → grouped by file, matches capped
- Glob results (>30 files) → directory tree summary
- Noisy output (>50% non-essential) → auto-extract errors/warnings
Adaptive pressure
Compression aggressiveness scales automatically with context window usage:
| Context usage | Threshold | Behavior |
|---|---|---|
| < 50% | 1,500 chars | Light — only compress large results |
| 50–75% | 800 chars | Normal — standard compression |
| 75–90% | 400 chars | Aggressive — compress most results |
| > 90% | 150 chars | Critical — compress everything, 0 git diff context |
Session optimizations
- Session cache — after ~50 tool results, older results are batch-summarized into a single compact block
- KV cache warming — deterministic MD5-based IDs keep compressed content prefix-stable across requests
- Cross-turn dedup — if the same file is read multiple times, earlier reads are replaced with reference pointers
- Expand on demand — compressed blocks include a
squeezr_expand(id)callback to retrieve full content
Compression backends
| Backend | Model | Used for | Cost |
|---|---|---|---|
| Anthropic | Haiku | System prompt, session cache | ~$0.0001/call |
| OpenAI | GPT-4o-mini | Fallback compression | ~$0.0001/call |
| Gemini | Flash-8B | Fallback compression | Free |
| Local | qwen2.5-coder:1.5b | Compression when using Ollama | Free |