Join a team that analyzes large-scale datacenter workloads on GPU-accelerated clusters. You will turn telemetry and workload data into clear findings and visuals. You will partner with OS, container, GPU, and systems engineers. When useful, you will apply machine learning and deep learning techniques for categorization and forecasting. These will be coordinated into tools the team actually uses.
Responsibilities:
- Analyze large-scale workloads and infrastructure signals to find application and platform improvement opportunities.
- Work with high-dimensional data: spot trends, tie changes to known events, summarize conclusions, and communicate results to engineers and leadership.
- Partner with the team to clarify questions, scope analyses, and document methods so others can extend your work.
- Build and maintain practical visualizations and lightweight implementations (e.g. ML/DL for classification/prediction) inside existing software workflows.
Requirements:
- 5+ years analyzing complex datasets, debugging data issues, and communicating trends clearly.
- BS or MS in Engineering, Mathematics, Physics, Computer Science, or equivalent experience.
- Strong Python and JavaScript;
- Comfortable being responsible for an analysis end-to-end.
- Hands-on use of telemetry / observability stacks (e.g. Grafana, Elasticsearch, Splunk).
- Shown grasp of core ML concepts; quick learner; strong analytical and problem-solving skills.
- Collaboration and communication.
Nice to Have:
- TensorFlow or PyTorch
- Linux and HPC / large-scale or performance-sensitive environments
- Experience visualizing high-dimensional problems
- Diligent, action-biased analysis style
XML job scraping automation by YubHub