Rachel Kim-Okonkwo - ML Engineer at Anthropic

About

ML engineer working on model evaluation and fine-tuning at Anthropic. Spelman + Georgia Tech. Passionate about responsible AI and making ML systems reliable.

Proof of Work

3 plays

Fine-Tuning Pipeline: Custom Models in 1 Click

One-click fine-tuning pipeline cutting model training from 2 weeks to 4 hours

Before2-week fine-tuning cycle, 3 models/quarter

After4-hour fine-tuning, 15 models/quarter, 40% cost reduction

EngineeringSystem

Situation

Internal teams needed custom fine-tuned models but process was manual and error-prone. Each fine-tune took an ML engineer 2 weeks of babysitting.

Action

Built end-to-end fine-tuning pipeline with Modal for compute, W&B for experiment tracking, Claude for data quality checks. One-click training with automatic hyperparameter tuning.

Result

Fine-tuning time: 2 weeks → 4 hours. Internal teams can self-serve model training. Shipped 15 custom models in Q1 vs 3 previous quarter. Compute costs down 40%.

156 copies · 27 forks

PythonPyTorchModalWeights & Biases

View Artifact

Model Serving Optimization: 70% Latency Reduction

EngineeringWorkflow

Situation

Model inference was slow and expensive. P50 latency was 800ms. GPU utilization was only 40%. Serving costs were $50k/month for one model.

Action

Profiled inference pipeline, applied quantization and batching optimizations. Used Modal for autoscaling. Built caching layer for common queries.

Result

P50 latency: 800ms → 240ms. GPU utilization: 40% → 85%. Serving costs: $50k → $18k/month. Can now serve 3x traffic with same infrastructure.

0 copies · 0 forks

PythonPyTorchModalClaude

View Artifact

Eval Framework: 3x Coverage, 50% Fewer False Positives

EngineeringSystem

Situation

Model evaluation was ad-hoc and incomplete. Critical capability regressions were caught in production. Eval suite covered only 30% of important behaviors.

Action

Designed comprehensive evaluation framework. Used Claude to generate diverse test cases, built automated regression detection, created dashboards for eval tracking.

Result

Eval coverage: 30% → 90% of critical behaviors. False positive rate cut 50%. Caught 12 regressions pre-release that would have shipped. Framework adopted org-wide.

0 copies · 0 forks

PythonClaudeWeights & Biases

View Artifact