We are looking for an AI Infra engineer to join our growing team. We work with Kubernetes, Slurm, Python, C++, PyTorch, and primarily on AWS. As an AI Infrastructure Engineer, you will be partnering closely with our Inference and Research teams to build, deploy, and optimize our large-scale AI training and inference clusters
What you'll do
- Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads
- Manage and optimize Slurm-based HPC environments for distributed training of large language models
What you need
- Strong expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management
- Hands-on experience with Slurm workload management, including job scheduling, resource allocation, and cluster optimization