Andrei Volkov - Infrastructure Engineer at Tabnine

About

Built inference infra serving 1M+ developers at <100ms. Kubernetes + GPU optimization.

Proof of Work

2 plays

GPU Inference Platform: 4x Throughput at Half the Cost

Built GPU inference platform that cut costs 54% while quadrupling throughput

Before$2.8M/mo GPU spend, 34% utilization, 45s cold-start

After$1.3M/mo, 4.1x throughput/dollar, 89ms P99, 3.2s cold-start

EngineeringTransformation

Situation

Tabnine was spending $2.8M/month on GPU inference with A100 instances. Average GPU utilization was only 34% due to bursty traffic patterns, and cold-start latency for new model versions was 45 seconds.

Action

Built a multi-tenant GPU serving layer in Rust with continuous batching and dynamic model routing. Implemented predictive autoscaling based on developer timezone patterns and a warm model cache using shared memory across pods. Moved from dedicated A100s to a mix of A10G and L4 GPUs with intelligent routing.

Result

Throughput increased 4.1x per GPU dollar. Monthly GPU spend dropped from $2.8M to $1.3M. Cold-start latency went from 45s to 3.2s. P99 inference latency held steady at 89ms even during peak hours.

456 copies · 93 forks

RustKubernetesCUDATerraform

View Artifact

Edge Compute Cold-Start Elimination

EngineeringSystem

Situation

Cloudflare Workers AI cold starts averaged 1.2s for model loading, making real-time inference impractical for latency-sensitive applications.

Action

Designed a model pre-warming system that used traffic prediction to keep popular models hot across edge nodes. Built a tiered caching system from GPU VRAM to NVMe to network storage.

Result

Cold starts dropped from 1.2s to 80ms for top-100 models. Edge AI request volume grew 12x in 6 months as developers adopted the lower-latency endpoints.

0 copies · 0 forks

GoRustKubernetes

View Artifact