All docs

Compression Pipeline

Every request from your AI CLI passes through Squeezr on localhost:8080. The proxy applies three compression layers before forwarding to the upstream API.

Pipeline overview

Request from coding tool
         |
         v
  +------------------------+
  | Layer 1: System Prompt |  Compress once, cache, reuse every turn
  +------------------------+
         |
         v
  +------------------------+
  | Layer 2: Deterministic |  Zero-latency rule-based transforms
  |    Preprocessing       |  (ANSI, dedup, JSON, noise removal)
  +------------------------+
         |
         v
  +------------------------+
  | Layer 3: Tool-Specific |  30+ patterns for git, tests, builds,
  |    Patterns            |  infra, package managers, and more
  +------------------------+
         |
         v
  Forward to upstream API (streaming, unmodified response)

Layer 1: System prompt compression

Claude Code's system prompt is typically 13–20 KB and is sent with every single request. Squeezr runs two passes before forwarding it:

  1. Skill/plugin block dedup — Identical blocks (e.g. duplicated plugin skill registrations, a known Claude Code issue) are collapsed byte-a-byte using MD5 matching. Free, zero-latency, typically saves 5–20% of the system prompt.
  2. AI compression (Haiku) — The deduped prompt is compressed once using Haiku and the result is cached. Every subsequent request reuses the cached version.

Savings: ~3,000–6,000 tokens per request after the first.

Layer 2: Deterministic preprocessing

Zero-latency rule-based transforms applied to every tool result. No API calls, no latency:

  • Noise removal — ANSI escape codes, progress bars, timestamps, spinner output stripped
  • Deduplication — repeated stack frames, duplicate lines, redundant git hunks removed
  • Minification — JSON whitespace collapsed, blank lines consolidated

Layer 3: Tool-specific patterns

Each tool result is matched against 30+ specialized compression rules. Errors, warnings, and actionable information are always preserved.

CategoryToolsWhat it does
Gitdiff, log, status, branch1-line diff context, capped log, compact status
JS/TSvitest, jest, playwright, tsc, eslint, biome, prettierFailures/errors only, grouped by file
Package managerspnpm, npmInstall summary, list capped at 30, outdated only
Buildnext build, cargo buildErrors only
Testcargo test, pytest, go testFAIL blocks + tracebacks only
Infraterraform, docker, kubectlResource changes, compact tables, last 50 log lines
Otherprisma, gh CLI, curl/wgetStrip ASCII art, cap output, remove verbose headers

Exclusive patterns

Applied to specific content types regardless of which tool produced them:

  • Lockfiles (package-lock.json, Cargo.lock, etc.) → dependency count summary
  • Large code files (>500 lines) → imports + function/class signatures only
  • Long output (>200 lines) → head + tail + omission note
  • Grep results → grouped by file, matches capped
  • Glob results (>30 files) → directory tree summary
  • Noisy output (>50% non-essential) → auto-extract errors/warnings

Adaptive pressure

Compression aggressiveness scales automatically with context window usage:

Context usageThresholdBehavior
< 50%1,500 charsLight — only compress large results
50–75%800 charsNormal — standard compression
75–90%400 charsAggressive — compress most results
> 90%150 charsCritical — compress everything, 0 git diff context

Session optimizations

  • Diff-based repeated Read — if the same file is read multiple times, earlier reads are replaced with a Myers unified diff vs the latest version. Typical savings: 60–85% on files that change slightly between reads.
  • Image dedup — repeated image blocks (screenshots) are hashed and deduplicated; only the most recent occurrence is kept at full fidelity. Typical savings: 80–95% on repeated screenshots.
  • Attachment/artifact dedup — large repeated text blocks (≥500 chars) are hashed and collapsed. Catches Desktop file uploads and generated artifacts that get re-sent every turn.
  • Stale turns summarization — when a session exceeds the configured threshold, old assistant/user turns are replaced with a compact placeholder. The last N turns are always kept at full fidelity.
  • Cache barrier — operations that could invalidate Anthropic's prompt cache (dedup, diff, AI compression) are restricted to messages afterthe last cache_control marker. The cached prefix is never mutated.
  • KV cache warming — deterministic MD5-based IDs keep compressed content prefix-stable across requests, maximising Anthropic cache hit rate.
  • Expand on demand — compressed blocks include a squeezr_expand(id) callback to retrieve full content.

Compression backends

BackendModelUsed forCost
AnthropicHaikuSystem prompt, session cache~$0.0001/call
OpenAIGPT-4o-miniFallback compression~$0.0001/call
GeminiFlash-8BFallback compressionFree
Localqwen2.5-coder:1.5bCompression when using OllamaFree