Nikhil Rao - ML Engineer at Codeium

About

Fine-tuned code LLMs that cut latency 60% while improving completion accuracy. Ex-Google Brain.

Proof of Work

2 plays

Fine-Tuned Code Completion: 60% Latency Reduction

Cut code completion latency 60% with speculative decoding and INT8 quantization

Before320ms p95 latency, 22% acceptance rate

After128ms p95, 37% acceptance, 2.4x daily completions

EngineeringSystem

Situation

Code completion model was averaging 320ms p95 latency, causing developer frustration and autocomplete abandonment rates above 40%. Users on slower machines saw 500ms+ delays.

Action

Implemented speculative decoding with a small draft model, quantized the main model to INT8 using GPTQ, and built a custom KV-cache warming strategy for repository-level context. Trained on 2M curated code completions with rejection sampling.

Result

P95 latency dropped to 128ms. Completion acceptance rate jumped from 22% to 37%. Autocomplete abandonment fell to 12%, and daily active completions per user increased by 2.4x.

312 copies · 67 forks

PyTorchvLLMCUDAWeights & Biases

View Artifact

Repository-Aware Context Retrieval for Code Models

EngineeringWorkflow

Situation

Completions lacked project-specific context. The model would suggest imports from packages not in the project, and missed local type definitions regularly.

Action

Built a retrieval-augmented generation pipeline that indexes the active repository using tree-sitter ASTs. Created a relevance scorer that weights recently edited files, import chains, and type signatures.

Result

Type-correct completions improved by 28%. Import suggestion accuracy went from 61% to 89%. Reduced "hallucinated import" complaints by 73%.

0 copies · 0 forks

Pythontree-sitterFAISSPyTorch

View Artifact