
GitHub
Fine-tuned code LLMs that cut latency 60% while improving completion accuracy. Ex-Google Brain.

Cut code completion latency 60% with speculative decoding and INT8 quantization
Code completion model was averaging 320ms p95 latency, causing developer frustration and autocomplete abandonment rates above 40%. Users on slower machines saw 500ms+ delays.
Implemented speculative decoding with a small draft model, quantized the main model to INT8 using GPTQ, and built a custom KV-cache warming strategy for repository-level context. Trained on 2M curated code completions with rejection sampling.
P95 latency dropped to 128ms. Completion acceptance rate jumped from 22% to 37%. Autocomplete abandonment fell to 12%, and daily active completions per user increased by 2.4x.

Completions lacked project-specific context. The model would suggest imports from packages not in the project, and missed local type definitions regularly.
Built a retrieval-augmented generation pipeline that indexes the active repository using tree-sitter ASTs. Created a relevance scorer that weights recently edited files, import chains, and type signatures.
Type-correct completions improved by 28%. Import suggestion accuracy went from 61% to 89%. Reduced "hallucinated import" complaints by 73%.