Compression Pipeline
Every request from your AI CLI passes through Squeezr on localhost:8080. The proxy applies three compression layers before forwarding to the upstream API.
Pipeline overview
Request from coding tool
|
v
+------------------------+
| Layer 1: System Prompt | Compress once, cache, reuse every turn
+------------------------+
|
v
+------------------------+
| Layer 2: Deterministic | Zero-latency rule-based transforms
| Preprocessing | (ANSI, dedup, JSON, noise removal)
+------------------------+
|
v
+------------------------+
| Layer 3: Tool-Specific | 30+ patterns for git, tests, builds,
| Patterns | infra, package managers, and more
+------------------------+
|
v
Forward to upstream API (streaming, unmodified response)Layer 1: System prompt compression
Claude Code's system prompt is typically 13–20 KB and is sent with every single request. Squeezr runs two passes before forwarding it:
- Skill/plugin block dedup — Identical blocks (e.g. duplicated plugin skill registrations, a known Claude Code issue) are collapsed byte-a-byte using MD5 matching. Free, zero-latency, typically saves 5–20% of the system prompt.
- AI compression (Haiku) — The deduped prompt is compressed once using Haiku and the result is cached. Every subsequent request reuses the cached version.
Savings: ~3,000–6,000 tokens per request after the first.
Layer 2: Deterministic preprocessing
Zero-latency rule-based transforms applied to every tool result. No API calls, no latency:
- Noise removal — ANSI escape codes, progress bars, timestamps, spinner output stripped
- Deduplication — repeated stack frames, duplicate lines, redundant git hunks removed
- Minification — JSON whitespace collapsed, blank lines consolidated
Layer 3: Tool-specific patterns
Each tool result is matched against 30+ specialized compression rules. Errors, warnings, and actionable information are always preserved.
| Category | Tools | What it does |
|---|---|---|
| Git | diff, log, status, branch | 1-line diff context, capped log, compact status |
| JS/TS | vitest, jest, playwright, tsc, eslint, biome, prettier | Failures/errors only, grouped by file |
| Package managers | pnpm, npm | Install summary, list capped at 30, outdated only |
| Build | next build, cargo build | Errors only |
| Test | cargo test, pytest, go test | FAIL blocks + tracebacks only |
| Infra | terraform, docker, kubectl | Resource changes, compact tables, last 50 log lines |
| Other | prisma, gh CLI, curl/wget | Strip ASCII art, cap output, remove verbose headers |
Exclusive patterns
Applied to specific content types regardless of which tool produced them:
- Lockfiles (package-lock.json, Cargo.lock, etc.) → dependency count summary
- Large code files (>500 lines) → imports + function/class signatures only
- Long output (>200 lines) → head + tail + omission note
- Grep results → grouped by file, matches capped
- Glob results (>30 files) → directory tree summary
- Noisy output (>50% non-essential) → auto-extract errors/warnings
Adaptive pressure
Compression aggressiveness scales automatically with context window usage:
| Context usage | Threshold | Behavior |
|---|---|---|
| < 50% | 1,500 chars | Light — only compress large results |
| 50–75% | 800 chars | Normal — standard compression |
| 75–90% | 400 chars | Aggressive — compress most results |
| > 90% | 150 chars | Critical — compress everything, 0 git diff context |
Session optimizations
- Diff-based repeated Read — if the same file is read multiple times, earlier reads are replaced with a Myers unified diff vs the latest version. Typical savings: 60–85% on files that change slightly between reads.
- Image dedup — repeated image blocks (screenshots) are hashed and deduplicated; only the most recent occurrence is kept at full fidelity. Typical savings: 80–95% on repeated screenshots.
- Attachment/artifact dedup — large repeated text blocks (≥500 chars) are hashed and collapsed. Catches Desktop file uploads and generated artifacts that get re-sent every turn.
- Stale turns summarization — when a session exceeds the configured threshold, old assistant/user turns are replaced with a compact placeholder. The last N turns are always kept at full fidelity.
- Cache barrier — operations that could invalidate Anthropic's prompt cache (dedup, diff, AI compression) are restricted to messages afterthe last cache_control marker. The cached prefix is never mutated.
- KV cache warming — deterministic MD5-based IDs keep compressed content prefix-stable across requests, maximising Anthropic cache hit rate.
- Expand on demand — compressed blocks include a
squeezr_expand(id)callback to retrieve full content.
Compression backends
| Backend | Model | Used for | Cost |
|---|---|---|---|
| Anthropic | Haiku | System prompt, session cache | ~$0.0001/call |
| OpenAI | GPT-4o-mini | Fallback compression | ~$0.0001/call |
| Gemini | Flash-8B | Fallback compression | Free |
| Local | qwen2.5-coder:1.5b | Compression when using Ollama | Free |