Jordan Park

About

Solved AI inference cost problems at scale. Reduced GPU spending 40% while improving throughput 6x. Specialized in fixing multi-GPU scaling bottlenecks for production LLM workloads. Write about CUDA optimization on Substack.

Proof of Work

3 plays

Unblocked Product Roadmap: 4x Memory Reduction Enabled 16K Context

Unblocked product roadmap by implementing Flash Attention, reducing memory 4x and enabling 16K context.

Before8GB memory (8K context max), feature blocked 4 months

After2GB memory (4x reduction), 16K context enabled, 3 deals closed

EngineeringWorkflow

Situation

Product team wanted 16K context windows but standard attention ate 32GB GPU memory - wouldn't fit on single A100. Feature was blocked for 4 months. Sales losing deals to competitors with longer context.

Task

Reduce memory footprint enough to support 16K context on single GPU without sacrificing performance

Action

Implemented Flash Attention algorithm in pure CUDA. Used tiling strategy to keep intermediate attention matrices in SRAM instead of HBM. Optimized register usage and shared memory layout. Open-sourced implementation on GitHub.

Result

Memory: 8GB → 2GB for 8K context (4x reduction). Enabled 16K context windows in production on single A100. Zero performance regression. Product team shipped feature, closed 3 enterprise deals. GitHub repo got 800+ stars.

62 copies · 11 forks

CUDAPyTorchNsightC++

View Artifact

Fixed Multi-GPU Scaling Bottleneck: 78% → 94% Efficiency

Fixed multi-GPU scaling bottleneck with compiler optimizations, improving efficiency from 78% to 94%.

Before78% scaling efficiency, 12.4hr training, 2 experiments/day

After94% scaling efficiency (+16pts), 7.5hr training (40% faster), 5 experiments/day

EngineeringSystem

Situation

Team spent 3 months trying to scale training to 8 GPUs but stuck at 78% efficiency. Each training run took 12.4 hours. Research velocity was crawling - only 2 experiments per day. Engineers frustrated, considering switching to competitor infrastructure.

Task

Fix multi-GPU scaling bottleneck to hit >90% efficiency and enable faster iteration cycles

Action

Traced issue to gradient sync overhead. Built LLVM compiler passes to fuse gradient accumulation with forward pass. Implemented overlap of NCCL all-reduce with backward computation. Wrote custom collective kernels optimized for NVLink ring topology. Contributed patches back to Triton compiler.

Result

Scaling efficiency: 78% → 94% on 8-GPU clusters. Training time: 12.4hr → 7.5hr (40% faster). Team can now run 5 experiments/day instead of 2. Contributed optimization to open-source Triton (merged PR). Published technical deep-dive on Substack.

89 copies · 15 forks

LLVMTritonNCCLPythonCUDA

View Artifact

About

Proof of Work

Unblocked Product Roadmap: 4x Memory Reduction Enabled 16K Context

Fixed Multi-GPU Scaling Bottleneck: 78% → 94% Efficiency

Solved $480K/month Inference Cost Problem with Custom CUDA Kernels