
GPU Optimization Engineer | Solving AI Inference Cost Problems
San Francisco, CA · MS Computer Science (Systems) · Stanford University
3 plays · 1 domains
Tools: CUDA · TensorRT · Triton · LLVM · PyTorch · Claude · Python · C++
Substack
GitHub
Solved AI inference cost problems at scale. Reduced GPU spending 40% while improving throughput 6x. Specialized in fixing multi-GPU scaling bottlenecks for production LLM workloads. Write about CUDA optimization on Substack.

Unblocked product roadmap by implementing Flash Attention, reducing memory 4x and enabling 16K context.
Product team wanted 16K context windows but standard attention ate 32GB GPU memory - wouldn't fit on single A100. Feature was blocked for 4 months. Sales losing deals to competitors with longer context.
Reduce memory footprint enough to support 16K context on single GPU without sacrificing performance
Implemented Flash Attention algorithm in pure CUDA. Used tiling strategy to keep intermediate attention matrices in SRAM instead of HBM. Optimized register usage and shared memory layout. Open-sourced implementation on GitHub.
Memory: 8GB → 2GB for 8K context (4x reduction). Enabled 16K context windows in production on single A100. Zero performance regression. Product team shipped feature, closed 3 enterprise deals. GitHub repo got 800+ stars.

Fixed multi-GPU scaling bottleneck with compiler optimizations, improving efficiency from 78% to 94%.
Team spent 3 months trying to scale training to 8 GPUs but stuck at 78% efficiency. Each training run took 12.4 hours. Research velocity was crawling - only 2 experiments per day. Engineers frustrated, considering switching to competitor infrastructure.
Fix multi-GPU scaling bottleneck to hit >90% efficiency and enable faster iteration cycles
Traced issue to gradient sync overhead. Built LLVM compiler passes to fuse gradient accumulation with forward pass. Implemented overlap of NCCL all-reduce with backward computation. Wrote custom collective kernels optimized for NVLink ring topology. Contributed patches back to Triton compiler.
Scaling efficiency: 78% → 94% on 8-GPU clusters. Training time: 12.4hr → 7.5hr (40% faster). Team can now run 5 experiments/day instead of 2. Contributed optimization to open-source Triton (merged PR). Published technical deep-dive on Substack.

Saved $192K/month by building custom CUDA kernels that improved GPU utilization from 42% to 89%.
Company was burning $480K/month on GPU compute for LLM inference. Only 42% GPU utilization - massive waste. Standard TensorRT kernels couldn't saturate multi-GPU bandwidth. CFO threatening to shut down the product due to unsustainable unit economics.
Reduce inference costs by 40% while maintaining throughput and making product economically viable
Profiled production workload with Nsight - found memory bottlenecks in attention layer. Built custom fused attention kernels with shared memory optimization. Rewrote hot paths to use tensor cores. Tuned for A100 architecture (NVLink, HBM bandwidth). Wrote Substack post on the optimization process.
6.3x kernel throughput improvement (1.2K → 7.6K tokens/sec/GPU). GPU utilization: 42% → 89%. Monthly inference costs: $480K → $288K (40% reduction, $192K saved). Product became profitable. Latency dropped 73% as bonus.