All docs

Ollama & LM Studio

Ollama lets you run LLMs locally. While local models are free, they still benefit from Squeezr's compression — smaller prompts mean faster inference, lower memory usage, and better results within the model's context window.

How Squeezr detects Ollama

Squeezr detects Ollama automatically via transparent proxy — when it sees a dummy API key (e.g. ollama as the value of the Authorization: Bearer header) or a request targeting a local upstream, it routes through the local model server without requiring a real API key.

Setup

Configure the local upstream in squeezr.toml (next to the binary):

[local]
enabled = true
upstream_url = "http://localhost:11434"
compression_model = "qwen2.5-coder:1.5b"

Then start both services:

# Start Ollama
ollama serve

# Start Squeezr
squeezr start

Using Ollama as the compression backend

With compression_model set, Squeezr uses the local Ollama model to compress large content blocks instead of calling an external API. This makes compression completely free when using Ollama for your main coding tool.

Pull the compression model first:

ollama pull qwen2.5-coder:1.5b

Why compress local model requests?

Even though local models are free, compression helps in several ways:

  • Faster inference — Fewer input tokens means the model processes the prompt faster. This is especially noticeable with larger models.
  • Context window — Local models often have smaller context windows (4K–32K). Compression lets you fit more conversation history within the limit.
  • Memory usage — KV cache memory scales with input length. Shorter prompts reduce VRAM pressure.
  • Quality — Removing noise and duplication helps the model focus on what matters, improving response quality.