Back to jobs
S

Founding GPU Kernel Engineer

SF Tensor (F25)

San FranciscoFull-timefull stack

About this role

About the role ABOUT SF TENSOR At The San Francisco Tensor Company, we believe the future of AI and high-performance computing depends on rethinking the entire software and infrastructure stack. Today's developers face bottlenecks across hardware, cloud, and code optimization that slow progress before ideas can reach their full potential. Our mission is to remove those barriers and make compute faster, cheaper, and universally portable. We are building a Kernel Optimizer that automatically transforms code into its most efficient form, combined with Tensor Cloud for adaptive, cross-cloud compute and Emma Lang, a new programming language for high-performance, hardware-aware computation. Together, these technologies reinvent the foundations of AI and HPC. SF Tensor is proudly backed by Susa Ventures and Y Combinator, as well as a group of angels including Max Mullen and Paul Graham as well as founders and executives of NeuraLink, Notion and AMD. We are partnering with researchers, engineers, and organizations who share our belief that the next breakthroughs in AI require breakthroughs in compute. ABOUT THE ROLE We're looking for a Founding GPU Kernel Engineer who lives right at the boundary between hardware and software. Someone who thinks in warps, occupancy, and memory hierarchies, and can squeeze every last FLOP out of a GPU. Your job is to go deeper than anyone else. You'll hand-tune kernels to figure out what's actually possible on the hardware, and then turn that knowledge into compiler optimization passes that help every model we compile. WHAT YOU'LL DO Write and hand-optimize GPU kernels for ML workloads (matmuls, attention, normalization, etc.) to set the performance ceilings Profile at the microarchitectural level: look into SM utilization, warp stalls, memory bank conflicts, register pressure, instruction throughput Debug performance issues by digging deep into things like clock speeds, thermal throttling, driver behavior, hardware errata Turn you WHAT WE'RE LOOKING FOR Deep expertise in GPU architecture Proven track record of hand-writing kernels that match or beat vendor libraries (cuBLAS, cuDNN, CUTLASS) Strong skills with low-level profiling tools: Nsight Compute, Nsight Systems, rocprof, or equivalents Experience reading and reasoning about PTX/SASS or GPU assembly Solid systems programming in C++ and CUDA (or ROCm/HIP) Good understanding of how high-level ML operations map to hardware execution Experience with distributed training systems: collective ops like all-reduce and all-gather, NCCL/RCCL, multi-node communication patterns