You will work closely with Scale's ML teams and researchers to build the foundation platform which supports all our ML research and development works. You will be building and optimising the platform to enable our next generation LLM training, inference and data curation.
Key responsibilities include:
- Building, profiling and optimising our training and inference framework.
- Collaborating with ML and research teams to accelerate their research and development, and enable them to develop the next generation of models and data curation.
- Researching and integrating state-of-the-art technologies to optimise our ML system.
Ideal candidates will have experience with multi-node LLM training and inference, developing large-scale distributed ML systems, and post-training methods like RLHF/RLVR and related algorithms like PPO/GRPO etc.
Strong software engineering skills, proficient in frameworks and tools such as CUDA, PyTorch, transformers, flash attention, etc. are required. Strong written and verbal communication skills to operate in a cross-functional team environment are also essential.
This role may be eligible for additional benefits such as a commuter stipend.
XML job scraping automation by YubHub