All docs

Compression Pipeline

Every request from your AI CLI passes through Squeezr on localhost:8080. The proxy applies three compression layers before forwarding to the upstream API.

Pipeline overview

Request from coding tool
         |
         v
  +------------------------+
  | Layer 1: System Prompt |  Compress once, cache, reuse every turn
  +------------------------+
         |
         v
  +------------------------+
  | Layer 2: Deterministic |  Zero-latency rule-based transforms
  |    Preprocessing       |  (ANSI, dedup, JSON, noise removal)
  +------------------------+
         |
         v
  +------------------------+
  | Layer 3: Tool-Specific |  30+ patterns for git, tests, builds,
  |    Patterns            |  infra, package managers, and more
  +------------------------+
         |
         v
  Forward to upstream API (streaming, unmodified response)

Layer 1: System prompt compression

Claude Code's system prompt is ~13KB and is sent with every single request. Squeezr compresses it once using a cheap AI model (Haiku) and caches the result. Every subsequent request reuses the cached version automatically.

Savings: ~3,000 tokens per request after the first.

Layer 2: Deterministic preprocessing

Zero-latency rule-based transforms applied to every tool result. No API calls, no latency:

  • Noise removal — ANSI escape codes, progress bars, timestamps, spinner output stripped
  • Deduplication — repeated stack frames, duplicate lines, redundant git hunks removed
  • Minification — JSON whitespace collapsed, blank lines consolidated

Layer 3: Tool-specific patterns

Each tool result is matched against 30+ specialized compression rules. Errors, warnings, and actionable information are always preserved.

CategoryToolsWhat it does
Gitdiff, log, status, branch1-line diff context, capped log, compact status
JS/TSvitest, jest, playwright, tsc, eslint, biome, prettierFailures/errors only, grouped by file
Package managerspnpm, npmInstall summary, list capped at 30, outdated only
Buildnext build, cargo buildErrors only
Testcargo test, pytest, go testFAIL blocks + tracebacks only
Infraterraform, docker, kubectlResource changes, compact tables, last 50 log lines
Otherprisma, gh CLI, curl/wgetStrip ASCII art, cap output, remove verbose headers

Exclusive patterns

Applied to specific content types regardless of which tool produced them:

  • Lockfiles (package-lock.json, Cargo.lock, etc.) → dependency count summary
  • Large code files (>500 lines) → imports + function/class signatures only
  • Long output (>200 lines) → head + tail + omission note
  • Grep results → grouped by file, matches capped
  • Glob results (>30 files) → directory tree summary
  • Noisy output (>50% non-essential) → auto-extract errors/warnings

Adaptive pressure

Compression aggressiveness scales automatically with context window usage:

Context usageThresholdBehavior
< 50%1,500 charsLight — only compress large results
50–75%800 charsNormal — standard compression
75–90%400 charsAggressive — compress most results
> 90%150 charsCritical — compress everything, 0 git diff context

Session optimizations

  • Session cache — after ~50 tool results, older results are batch-summarized into a single compact block
  • KV cache warming — deterministic MD5-based IDs keep compressed content prefix-stable across requests
  • Cross-turn dedup — if the same file is read multiple times, earlier reads are replaced with reference pointers
  • Expand on demand — compressed blocks include a squeezr_expand(id) callback to retrieve full content

Compression backends

BackendModelUsed forCost
AnthropicHaikuSystem prompt, session cache~$0.0001/call
OpenAIGPT-4o-miniFallback compression~$0.0001/call
GeminiFlash-8BFallback compressionFree
Localqwen2.5-coder:1.5bCompression when using OllamaFree