From the team

Blog

Product updates, engineering deep dives, and thought leadership from the Parasail team.

Product

How to choose the right managed inference architecture: Serverless, dedicated, dedicated serverless, or batch

Use this decision framework to choose the right managed inference mode based on latency requirements, GPU breakeven utilization, and whether your workload needs a dedicated endpoint.

Gabriel Perácio · Jun 09, 2026

Product

Serverless vs. Dedicated Inference: Why We Built Dedicated Serverless

Most AI teams choose between shared serverless and dedicated GPUs. With dedicated serverless you get dedicated hardware on per-token pricing, no idle-hour charges or long-term GPU commitment.

Gabriel Perácio · May 29, 2026

Product

Parasail and Wafer AI: Faster models, lower costs

Parasail and Wafer AI are partnering to make frontier AI cheaper and more accessible. Wafer optimizes models to do more with less compute. Parasail serves them reliably at scale. Developers get the most efficient versions of the best open models, instantly accessible via API. Kimi K2.6 NVFP4 is the first release.

Team Parasail · Apr 30, 2026

Engineering

Making an EAGLE fly: How We Got 2.6x Faster LLM Inference (Without Cheating)

We trained a custom EAGLE-3 speculative decoding head for OLMo-3.1-32B-Think and got 2.6x faster inference. Same model, same weights, same outputs, just faster. A single B200 running our setup outperformed 2xH200 without it. This post walks through the full pipeline: dataset prep, 40 TiB of captured hidden states, hyperparameter sweeps that mostly didn't matter, and the inference-time tuning that turned promising training curves into real production throughput.

Gabriel Perácio · Apr 28, 2026

Engineering

Making Cold Start Latencies go Brrrr: A Multi-pronged Approach (Part 1)

Cold-start latency is often orders of magnitude higher than steady-state latency on an inference platform serving hundreds of models. In Part I of a series, we walk through how we combined fastsafetensors, O_DIRECT, and io_uring to get fast cold-starts and fast warm-starts on the same stack.

Meghana Madhyastha · Apr 20, 2026
More posts coming soon.