Squeezr — AI Context Compression

Every request from your AI CLI passes through Squeezr on localhost:8080. The proxy applies three compression layers before forwarding to the upstream API.

Pipeline overview

Request from coding tool
         |
         v
  +------------------------+
  | Layer 1: System Prompt |  Compress once, cache, reuse every turn
  +------------------------+
         |
         v
  +------------------------+
  | Layer 2: Deterministic |  Zero-latency rule-based transforms
  |    Preprocessing       |  (ANSI, dedup, JSON, noise removal)
  +------------------------+
         |
         v
  +------------------------+
  | Layer 3: Tool-Specific |  30+ patterns for git, tests, builds,
  |    Patterns            |  infra, package managers, and more
  +------------------------+
         |
         v
  Forward to upstream API (streaming, unmodified response)

Layer 1: System prompt compression

Claude Code's system prompt is typically 13–20 KB and is sent with every single request. Squeezr runs two passes before forwarding it:

Skill/plugin block dedup — Identical blocks (e.g. duplicated plugin skill registrations, a known Claude Code issue) are collapsed byte-a-byte using MD5 matching. Free, zero-latency, typically saves 5–20% of the system prompt.
AI compression (Haiku) — The deduped prompt is compressed once using Haiku and the result is cached. Every subsequent request reuses the cached version.

Savings: ~3,000–6,000 tokens per request after the first.

Layer 2: Deterministic preprocessing

Zero-latency rule-based transforms applied to every tool result. No API calls, no latency:

Noise removal — ANSI escape codes, progress bars, timestamps, spinner output stripped
Deduplication — repeated stack frames, duplicate lines, redundant git hunks removed
Minification — JSON whitespace collapsed, blank lines consolidated

Layer 3: Tool-specific patterns

Each tool result is matched against 30+ specialized compression rules. Errors, warnings, and actionable information are always preserved.

Category	Tools	What it does
Git	diff, log, status, branch	1-line diff context, capped log, compact status
JS/TS	vitest, jest, playwright, tsc, eslint, biome, prettier	Failures/errors only, grouped by file
Package managers	pnpm, npm	Install summary, list capped at 30, outdated only
Build	next build, cargo build	Errors only
Test	cargo test, pytest, go test	FAIL blocks + tracebacks only
Infra	terraform, docker, kubectl	Resource changes, compact tables, last 50 log lines
Other	prisma, gh CLI, curl/wget	Strip ASCII art, cap output, remove verbose headers

Exclusive patterns

Applied to specific content types regardless of which tool produced them:

Lockfiles (package-lock.json, Cargo.lock, etc.) → dependency count summary
Large code files (>500 lines) → imports + function/class signatures only
Long output (>200 lines) → head + tail + omission note
Grep results → grouped by file, matches capped
Glob results (>30 files) → directory tree summary
Noisy output (>50% non-essential) → auto-extract errors/warnings

Adaptive pressure

Compression aggressiveness scales automatically with context window usage:

Context usage	Threshold	Behavior
< 50%	1,500 chars	Light — only compress large results
50–75%	800 chars	Normal — standard compression
75–90%	400 chars	Aggressive — compress most results
> 90%	150 chars	Critical — compress everything, 0 git diff context

Session optimizations

Diff-based repeated Read — if the same file is read multiple times, earlier reads are replaced with a Myers unified diff vs the latest version. Typical savings: 60–85% on files that change slightly between reads.
Image dedup — repeated image blocks (screenshots) are hashed and deduplicated; only the most recent occurrence is kept at full fidelity. Typical savings: 80–95% on repeated screenshots.
Attachment/artifact dedup — large repeated text blocks (≥500 chars) are hashed and collapsed. Catches Desktop file uploads and generated artifacts that get re-sent every turn.
Stale turns summarization — when a session exceeds the configured threshold, old assistant/user turns are replaced with a compact placeholder. The last N turns are always kept at full fidelity.
Cache barrier — operations that could invalidate Anthropic's prompt cache (dedup, diff, AI compression) are restricted to messages afterthe last cache_control marker. The cached prefix is never mutated.
KV cache warming — deterministic MD5-based IDs keep compressed content prefix-stable across requests, maximising Anthropic cache hit rate.
Expand on demand — compressed blocks include a squeezr_expand(id) callback to retrieve full content.

Compression backends

Backend	Model	Used for	Cost
Anthropic	Haiku	System prompt, session cache	~$0.0001/call
OpenAI	GPT-4o-mini	Fallback compression	~$0.0001/call
Gemini	Flash-8B	Fallback compression	Free
Local	qwen2.5-coder:1.5b	Compression when using Ollama	Free

Compression Pipeline