May 30, 2026

May 30, 2026

infra_api

Kog Hits 3,000 Tokens Per Second on Standard Datacenter GPUs

Kog Labs claims 3,000 output tokens per second per request on 8x AMD MI300X GPUs using standard datacenter hardware, no speculative decoding. The implication: the bottleneck on fast single-request inference has been software, not silicon.

Kog Labs just launched a tech preview of the Kog Inference Engine (KIE), hitting 3,000 output tokens per second per request on 8x AMD MI300X GPUs and 2,100 on 8x NVIDIA H200, running FP16 with no speculative decoding. That is not an aggregate throughput number. That is a single-request decode speed, which is the number that actually matters for agents.

Most inference benchmarks blend together three different things: total server throughput, time to first token, and per-request decode speed. For batch workloads, total throughput is the right metric. For autonomous agents running sequential loops, it is the per-request decode speed that sets the clock rate of the entire workflow.

Kog spells out the math plainly. Agentic software engineering is a sequential loop: inspect, plan, edit, test, revise. Each step depends on the previous one. The generation-heavy steps, planning, code writing, trace analysis, debugging, set the loop rate. Reasoning tokens compound on top. If an agent needs to generate 50,000 tokens in a workflow, how long that takes is a direct function of single-request decode speed, not server utilization.

The current preview runs a 2B coding model. Kog is upfront that it is small and not a frontier model. The team has been focused on speed rather than scale, though they describe it as capable when fine-tuned for specific software engineering tasks. Support for large third-party MoE models is coming next, at similar speeds.

The core argument Kog is making is that the hardware ceiling is much higher than existing inference stacks expose. Standard datacenter GPUs already have the memory bandwidth to support this speed regime. The bottleneck has been software: existing inference stacks are not optimized for single-request low-latency decoding. KIE's approach is co-designing the model architecture, runtime, and low-level GPU code as a single latency-optimized pipeline.

This matters for product engineers because it changes the cost and lock-in calculus. Fast single-request inference has typically required proprietary inference hardware. Kog is arguing that enterprises can get comparable speeds on GPUs they already own, including setups relevant to AI labs and sovereign-AI buyers, without vendor lock-in.

What should you do with this today? If you are building an agentic coding workflow and per-step latency is your bottleneck, test the live coding playground at playground.kog.ai. The 2B model is limited in scope but the speed is the thing being demonstrated. Watch closely when MoE model support ships. That is when the performance claims become relevant to production-grade agents.