
GitHub
Built inference infra serving 1M+ developers at <100ms. Kubernetes + GPU optimization.

Built GPU inference platform that cut costs 54% while quadrupling throughput
Tabnine was spending $2.8M/month on GPU inference with A100 instances. Average GPU utilization was only 34% due to bursty traffic patterns, and cold-start latency for new model versions was 45 seconds.
Built a multi-tenant GPU serving layer in Rust with continuous batching and dynamic model routing. Implemented predictive autoscaling based on developer timezone patterns and a warm model cache using shared memory across pods. Moved from dedicated A100s to a mix of A10G and L4 GPUs with intelligent routing.
Throughput increased 4.1x per GPU dollar. Monthly GPU spend dropped from $2.8M to $1.3M. Cold-start latency went from 45s to 3.2s. P99 inference latency held steady at 89ms even during peak hours.

Cloudflare Workers AI cold starts averaged 1.2s for model loading, making real-time inference impractical for latency-sensitive applications.
Designed a model pre-warming system that used traffic prediction to keep popular models hot across edge nodes. Built a tiered caching system from GPU VRAM to NVMe to network storage.
Cold starts dropped from 1.2s to 80ms for top-100 models. Edge AI request volume grew 12x in 6 months as developers adopted the lower-latency endpoints.