Meta's AI Training and Inference Infrastructure is growing exponentially to support ever increasing uses cases of AI. This results in a dramatic scaling challenge that our engineers have to deal with on a daily basis. We need to build and evolve our network infrastructure that connects myriads of training accelerators like GPUs together.
What you'll do
Collaborate with hardware and software teams to optimize end-to-end communication pathways for large-scale distributed training workloads, ensuring seamless integration between compute, storage, and networking components.
What you need
- Experience with using communication libraries, such as MPI, NCCL, and UCX