This role is to optimize the latency and throughput of model inference, building reliable and performant production serving systems to serve billions of users, accelerating research on scaling test-time compute and rollout in reinforcement learning training, and model-hardware co-design for next-generation architectures.
What you'll do
- Optimizing the latency and throughput of model inference.
- Building reliable and performant production serving systems to serve billions of users.
- Accelerating research on scaling test-time compute and rollout in reinforcement learning training.
- Model-hardware co-design for next-generation architectures.
What you need
- Worked on system optimizations for model serving, such as batching, caching, load balancing, and parallelism.
- Worked on low-level optimizations for inference, such as GPU kernels and code generation.
- Worked on algorithmic optimizations for inference, such as quantization, distillation, and speculative decoding, and low-precision numerics.