AI Agency Deep Dive: Building Low-Latency LLM APIs with Quantization and Caching

Understanding low-latency LLM inference and why milliseconds matter

For language model products, speed is not a nicety; it’s the feeling of intelligence. Users equate response time with competence, and they abandon tools that “think” for too long. Human–computer interaction research gives us a few anchor points: under 100 ms feels instantaneous, under a second keeps the user’s attention, and anything beyond that invites context switching. LLMs complicate the picture because they stream tokens incrementally rather than returning a single blob. That means you can ship the first token fast to sell the perception of speed, while maintaining high overall throughput behind the scenes. In a practical sense, you should design for three latency numbers: time-to-first-token (TTFT), median latency per token, and tail latency (p95/p99) for the whole completion. Optimising only the average will leave you with a product that feels “fine most of the time” and “broken when it matters”.

The critical levers are architectural as much as they are algorithmic. A single GPU with a stock model might deliver respectable throughput for one user, only to fall apart under concurrency because of context length, KV cache blow-ups, or blocking decode loops. Low latency is therefore a systems problem: shaping requests to fit hardware, choosing quantisation to trade precision for speed, caching aggressively to avoid recomputation, and orchestrating scheduling so that small, interactive jobs aren’t swallowed by long-running batch prompts. You need a mental model that connects an input prompt’s token length, the model’s hidden size and number of attention heads, and the amount of memory used for the KV cache, because that’s what silently determines how many concurrent sequences your hardware can serve without paging or thrashing.

There’s also a product constraint most teams miss: latency is part of your brand. If you promise “instant answers”, you must preserve that expectation even as prompts fluctuate, context grows, and downstream tools get invoked. A resilient design uses adaptive strategies—like switching to a smaller “first responder” model for the opening seconds (speculative decoding), or degrading context length gracefully—so you meet a latency budget even under unpredictable load. The question is not only “how fast can we go” but “how consistently can we stay fast”.

Quantisation strategies for production LLMs that keep quality while slashing latency

Quantisation is the most effective way to reduce latency and cost without retraining. By representing weights (and occasionally activations and KV caches) with fewer bits, you shrink memory bandwidth requirements and fit bigger batch sizes on the same hardware. The trade-off is familiar: fewer bits can mean more quantisation error, which can degrade accuracy or increase hallucination rates if applied naively. In practice, modern techniques have become remarkably robust, especially for decoder-only Transformer architectures used in chat.

A basic taxonomy helps. Post-training quantisation (PTQ) converts a trained model’s weights to lower precision—commonly 8-bit (INT8) or 4-bit (INT4)—using calibration data to minimise error. You’ll see variants like GPTQ (gradient-based PTQ with per-group quantisers) and AWQ (activation-aware weight quantisation) that take into account the distribution of activations, leading to better retention of performance at aggressive bit-rates. Quantisation-aware training (QAT) bakes quantisation effects into training, giving the best quality at the expense of extra training cost and complexity. For many agency use-cases, PTQ with modern algorithms hits the sweet spot: it’s fast to apply and quality loss is minimal for general chat, rewriting, and code assistance.

One nuance the marketing gloss tends to skip is where quantisation is applied. Weight-only quantisation saves a lot of memory and improves throughput because weight matrices dominate compute, but it doesn’t reduce activation size or KV cache growth during decoding. Activation-aware schemes, or mixed-precision approaches (e.g., INT8 weights with FP16 activations), reduce quality regressions while retaining many of the memory benefits. For long-context chatbots, quantising the KV cache can be as impactful as quantising the weights: the cache grows linearly with sequence length and attention heads, and on popular 7–13B models it can consume multiple gigabytes per active sequence. Lightweight KV-cache quantisation (e.g., 8-bit for keys and values with per-channel scaling) often has negligible impact on output quality but dramatically increases the number of concurrent requests a GPU can serve.

Quantisation’s benefits compound with other inference optimisations. Kernel fusion reduces memory round-trips, tensor parallelism spreads large layers across multiple GPUs, and paged attention lets you store KV blocks sparsely so long prompts don’t monopolise memory. When combined, a 4-bit weight-only quantised 7B model on a single modern GPU can push tens of thousands of tokens per second in batched scenarios, while delivering sub-200 ms TTFT for interactive prompts. But success depends on careful calibration: different layers have different sensitivity. Attention output projections and embedding layers can be more fragile; mixed precision lets you keep these at higher bit-depth while compressing the bulk elsewhere.

Finally, quality assurance is non-negotiable. If your agency delivers domain-specific answers—legal drafting, medical triage, or financial analysis—you’ll want to run a battery of task-relevant evals before settling on a quantisation recipe. Think beyond simple perplexity; use prompt-completion pairs drawn from your real traffic to detect quality drifts, and track semantic similarity and factual consistency. Only by testing on “your” prompts will you notice, for example, that INT4 on a particular model makes it forget to include structured fields in a JSON output, even though benchmark scores look unchanged.

Practical tips for quantising production LLMs

Start with weight-only INT8 or 4-bit groupsize quantisation and preserve higher precision for embeddings and layer norms.
Calibrate with a few thousand representative prompts that reflect your true input distribution, not synthetic benchmark snippets.
Measure TTFT and tokens/sec under concurrent loads; a single-request micro-benchmark can hide contention and kernel scheduling issues.
Combine quantised weights with an 8-bit KV cache to increase concurrency without crippling generation quality for long contexts.
Keep a floating-point “golden path” model available for A/B fallback when outputs must be pristine.

Caching architectures that actually move the latency needle in real workloads

If quantisation cuts the cost of each token, caching eliminates work altogether. The right caches turn your API into a memory of past computations that can be reused at multiple levels. Start with the simplest: prompt-response caching keyed by a hash of the full input. For deterministic decoding parameters (top-p, temperature, penalties), this lets you serve repeat requests instantly from a store such as Redis or an embedded LRU cache. But exact-match caches won’t help when prompts vary by a few words, and they do nothing for long chains where only early steps are repeated.

The next level is KV-cache reuse, sometimes called “prefix caching” or “prompt caching”. During the prefill phase, the model encodes the entire context and writes key/value tensors for each attention head to memory. If a new request starts with a prefix that has been seen before—perhaps a system prompt or a shared knowledge preamble—you can skip recomputing that prefix by reusing the stored KV blocks and jumping straight to decoding. This is why many high-performance inference engines organise KV memory in paged blocks: it allows fast lookups and partial sharing across requests without copying huge contiguous tensors around. The impact on TTFT can be dramatic when you have long system prompts or large retrieved contexts; you’re literally cutting out hundreds of milliseconds of prefill time.

Semantic caching is the most nuanced but often the most rewarding. Here you store not only the input hash but an embedding of the input and output. When a similar prompt arrives, you can either return the previous answer (if your product tolerates that) or prime the model by inserting the cached answer as an example, thereby reducing the number of tokens needed to reach a high-quality completion. For content generation where slight paraphrasing is acceptable, semantic caches are a goldmine. You do need to consider staleness: embeddings and outputs may age as underlying knowledge changes, so your cache entries should carry TTLs or be invalidated by domain events (new product release, pricing update, regulation change).

Finally, look beyond the LLM. Tool-augmented agents often fetch web pages, call internal APIs, or run SQL queries. Caching those upstream results—HTML fetches, vector searches, database aggregates—can shave seconds off end-to-end latency. Make cache boundaries explicit in your orchestration layer and attach tracing so you can see hits and misses per stage. This holistic view prevents the common anti-pattern of over-optimising the model path while neglecting a slow retriever, an unindexed database, or a cold serverless function in the toolchain.

Cache design choices that pay off quickly

Use an exact-match response cache for deterministic parameters, with an LRU policy and per-tenant namespacing to avoid data leaks.
Implement prefix/KV caching for long system prompts and RAG headers; store KV blocks in fixed-size pages to support partial reuse.
Add a semantic cache powered by embeddings to serve “near-duplicate” prompts or to seed few-shot exemplars for faster convergence.
Set cache TTLs by content type; e.g., product pricing might be minutes, documentation hours, and boilerplate system prompts effectively infinite.
Trace cache hits/misses end-to-end, including retrievers and tools, so you can attribute latency reductions to the right layer.

KV cache memory warrants a brief back-of-the-envelope. For a decoder-only Transformer with hidden size H, number of heads A, head dimension D = H/A, sequence length T, and precision b bytes per float, the KV cache roughly costs 2 * T * H * b bytes per sequence (two because of keys and values), ignoring per-head overhead and padding. At FP16, a 7B model with H ≈ 4096 and a 4K-token context will consume on the order of 2 × 4096 × 4096 × 2 ≈ 67 MB per sequence just for KV (rule-of-thumb; real engines use paged blocks and slightly different layouts). Multiply by dozens of concurrent users and you see why quantising KV to 8-bit or using paged attention can double or triple effective concurrency.

A thornier consideration is correctness. Returning a cached answer that was generated at temperature 0.8 for a new request at temperature 0.2 might violate expectations even if the prompt text matches. To preserve invariants, include decoding parameters in the cache key, or normalise them by running a fast deterministic “rewrite pass” (e.g., at temperature 0) on the cached text before serving. Conversely, when KV caching across requests, ensure that safety filters, system prompts, and per-tenant policies are included in the derivation of the cache key. KV blocks created under one system prompt should never be reused under another that loosens guardrails.

Caching also interacts with batching. Prefill-heavy workloads benefit from grouping requests with similar prompt lengths to keep GPUs busy and limit padding waste. But batching can reduce prefix cache hit-rates if your scheduler is oblivious to reuse opportunities. A practical scheduler groups by “prefix fingerprint” first, then secondarily by length, and only then by arrival time. That way you exploit KV reuse and keep utilisation high. Real-world traces show that a well-tuned prefix-aware scheduler can cut TTFT variability (the dreaded “sometimes it’s instant, sometimes it’s 2 seconds” effect) by aligning requests to cached blocks opportunistically.

Designing a fast API surface: request shaping, batching, streaming, and backpressure

Your public API design has a direct, mechanical impact on latency. Begin with request shaping: cap context length sensibly, reject pathological prompts early (e.g., megabyte-sized JSON pasted into a single message), and tokenise on the server so you can estimate cost before committing resources. Provide parameters that bias towards speed—like low temperature, top-p < 0.9, and a modest max_tokens—when callers opt into “fast mode”. For RAG systems, cap the number of retrieved chunks and collapse duplicates; the majority of “mysteriously slow” prompts are simply too long on the prefill path. If you’re hosting multiple models, route short prompts to smaller checkpoints with aggressive quantisation and reserve larger models for tasks proven to need them.

Batching is the main server-side lever for throughput, but naive batching harms interactive latency. The trick is micro-batching with a tight timeout (e.g., 2–10 ms) that collects a handful of concurrent prefill requests and pushes them through fused kernels. During decoding, techniques like continuous batching and dynamic batching windows let you stitch together streams of tokens from many users while preserving each stream’s cadence. To avoid head-of-line blocking, pre-empt long sequences and use token-level scheduling so short jobs can “overtake” slow ones without starving them. It’s perfectly acceptable to send the first token from a small request sooner while a large one remains in the queue for its next step.

Streaming is the UX cheat code. Even when the total time to completion is unchanged, sending the first token in ~100–300 ms converts the experience from “waiting” to “listening”. Implement server-sent events (SSE) or gRPC streaming and structure your clients to render incrementally. Combine streaming with speculative decoding: a small “drafter” model generates likely next tokens while the large “verifier” confirms them. When the verifier agrees, you skip work; when it doesn’t, you correct the stream—this still improves TTFT dramatically in the common case. The elegance of speculative decoding is that it layers onto quantisation and caching: a quantised drafter is cheap, and its outputs can populate caches for the verifier path.

Backpressure protects both user experience and infrastructure. Rate limit by tenant and by token-per-second, not just by requests-per-minute. If queues grow beyond a threshold, degrade gracefully: switch to smaller models, cut max_tokens, or move to an extractive answer with citations rather than a generative essay. Expose retry-after hints in error responses and return partial completions rather than timeouts. Many “latency incidents” are not faults but predictable overloads that could have been smoothed if the service advertised its capacity and adjusted the shape of work under duress.

Observability, testing, and cost governance for sustainable speed

The end-state of a high-performance LLM API is a control room, not a shrine. You need instrumentation that reports TTFT, tokens/sec, queue time, GPU utilisation, VRAM headroom, cache hit-rates, and tail latencies by endpoint and tenant. Break down latencies into prefill, decode, and post-processing (e.g., JSON validation or tool calls) so you know where to optimise. Track p50, p95, and p99 separately; the median tells you nothing about the horror stories your users remember. Overlay these with business metrics—conversion, retention, cost per 1k tokens—because sometimes the right move is to spend more to be faster on a high-value path and save elsewhere.

Testing starts in synthetic land but must graduate to replaying real traffic. Build a harness that can “shadow” production requests into a new model or quantisation variant and compare outputs with strict diffing for structured formats and semantic similarity for free-form text. For safety, include red-team prompts and jailbreak attempts, because aggressive quantisation can occasionally weaken safety layers if they were fragile to begin with. When you trial caching changes, run them in record-only mode first, then enable read-through for a subset of tenants to measure hit-rates and false positives. The goal is to make performance tweaks routine rather than nerve-wracking.

Cost governance is the quiet partner of latency. Faster models are often cheaper per unit of work because they push more tokens per second through the same hardware, but speed can entice you into wasteful defaults—like allowing unbounded context windows or always calling the largest model. A disciplined approach classifies requests by complexity and routes them to the cheapest engine that meets quality SLAs. Maintain per-endpoint and per-tenant budgets; if someone starts pasting novels into your chat box, your service should still feel snappy for everyone else. Consider spot instances or pre-emptible GPUs for non-interactive batch jobs, and keep interactive serving on stable nodes with aggressive autoscaling and warm pools to avoid cold starts.

There’s also a human dimension to running a fast API. Incident response should include “UX triage”: when latency spikes, can you immediately switch on a “fast mode” for the affected tenants that trims context, disables slow tools, and routes to a quantised small model while you fix the root cause? Document these playbooks. Teams that practise these drills recover in minutes; teams that don’t spend hours debating changes while users churn. Low latency is a habit, not a feature.

Putting it all together: a blueprint for agencies shipping fast LLM APIs

A credible agency playbook has a pragmatic arc. Start by setting a latency budget—for example, TTFT under 300 ms and p95 completion under 2 seconds for typical prompts under 1,000 input tokens and 200 output tokens. Pick a baseline model that comfortably meets this on your target GPU with moderate concurrency. Apply weight-only INT8 or 4-bit quantisation, preserve FP16 where sensitivity shows, and verify on your domain tasks. Introduce an 8-bit KV cache and paged attention to expand concurrency. Add exact-match and prefix caches first; they are low risk and high reward. Then layer in semantic caching where product tolerance allows reuse and paraphrase.

Design your API with streaming from day one. Offer a “fast” profile that enables conservative decoding, short max_tokens, and the smaller model for quick answers. Implement micro-batching with a tiny queueing window and a scheduler aware of prefix fingerprints. Use speculative decoding to hack TTFT further, and tune the drafter/verifier pair with your own workload. Wrap the whole system in observability that separates prefill, decode, and post-processing, and that reports cache hit-rates with tenant scope. Capture spend per request and enforce budgets automatically.

You’ll discover that speed is cumulative rather than singular. Quantisation by itself looks good on paper; quantisation plus KV-cache quantisation, plus prefix caching, plus streaming, plus speculative decoding, plus micro-batching is what shifts the experience from “acceptable” to “it feels instant”. The good news is that each ingredient is independently valuable. If you’re modernising a client’s creaky chatbot, you can deliver visible improvements in days by inserting a prefix cache and enabling streaming, then come back with a larger optimisation pass that adds quantisation and a better scheduler.

The final mindset shift is to treat latency as a product metric. Put it on dashboards that business stakeholders see. Tie it to customer outcomes and SLAs. Celebrate shaving 100 ms off TTFT the same way you’d celebrate a new feature going live, because in the user’s mind it is a feature. Agencies that internalise this lesson don’t just build faster APIs; they win more pitches, retain clients longer, and create the kind of “it just feels better” experiences that are hard to copy. Low latency is table stakes for intelligence at the edge. With quantisation and caching in your toolkit—and a thoughtful approach to scheduling and API design—you can deliver it reliably, at scale, and within budget.

Need help with AI solution development? Get in touch today, or find out more about our AI Solutions Development services.

Get in touch

Need help with AI solution development?

Is your team looking for help with AI solution development? Click the button below.