AWS Utility Computing (UC) provides product innovations — from foundational services such as Amazon's Simple Storage Service (S3) and Amazon Elastic Compute Cloud (EC2), to consistently released new product innovations that continue to set AWS's services and features apart in the industry. As a member of the UC organization, you'll support the development and management of Compute, Database, Storage, Internet of Things (IoT), Platform, and Productivity Apps services in AWS.
We seek a DevOps Engineer for the Machine Learning (ML) Infrastructure team to build the tools that are used to guarantee top performance of AWS ML and High Performance Computing (HPC) technologies developed by our organization. You will:
- Be the lead engineer on a team that builds and maintains the infrastructure that monitors and reports on functionality and performance of massive testing workloads run at scale.
- Use internal Amazon CI/CD tools, Linux, and public AWS products to automate the delivery of our software to customers.
- Write Python code that effortlessly spools up large clusters and runs benchmarks and applications for ML and HPC workloads.
- Use AWS Managed Grafana, Quicksight, OpenSearch, and Athena to digest performance data and create dashboards.
- Invent automatic mechanisms to alert developers to functional and performance regressions.
- Manage complex infrastructure covering many instance types, software stacks, and Linux operating systems.
- Ensure all infrastructure setup is code (IaC), reviewed and committed to automated pipelines.
- Find innovative ways to schedule work using Jenkins, supporting the development team while keeping cluster costs down.
- Review dashboard and automation results, triage failures, and introduce new tests and platforms.
- Create reports and status updates of the CI/CD system for stakeholders.
Join us as we expand the AWS offerings for AI, including Trainium, Neuron, and the Elastic Fabric Adapter (EFA).