Taro Logo

MLOps Site Reliability Engineer

KLA is a global leader in diversified electronics for the semiconductor manufacturing ecosystem, inventing systems and solutions for wafers, integrated circuits, packaging, and displays.
Site Reliability
Mid-Level Software Engineer
In-Person
5,000+ Employees
2+ years of experience
AI · Enterprise SaaS

Description For MLOps Site Reliability Engineer

KLA, a global leader in semiconductor process control technology, is seeking a MLOps Site Reliability Engineer to join their team. This role sits at the intersection of machine learning operations and infrastructure reliability, focusing on building and maintaining robust systems for ML workflows. The position offers an opportunity to work with cutting-edge technologies in semiconductor manufacturing, where KLA invests heavily in R&D (15% of sales).

The role involves collaborating with data scientists and ML engineers to ensure the reliable deployment and operation of machine learning systems. You'll be responsible for designing and implementing scalable infrastructure, managing CI/CD pipelines, and ensuring the performance and security of ML systems. The position requires expertise in modern DevOps practices, cloud platforms, and containerization technologies.

KLA's Global Products Group (GPG) and Central Engineering organization, with its 9 Centers-of-Excellence, provides a rich environment for innovation and technical growth. The company's products are crucial in the manufacturing of virtually every electronic device, from smartphones to smart cars.

The ideal candidate will have a strong background in Site Reliability Engineering, combined with knowledge of machine learning concepts and workflows. This role offers the opportunity to make a significant impact on KLA's ML infrastructure while working with a global team of experts in various technical disciplines.

Benefits include competitive compensation and a family-friendly total rewards package, though specific details aren't provided. KLA is an equal opportunity employer committed to providing reasonable accommodations and maintaining an inclusive environment.

Last updated 2 days ago

Responsibilities For MLOps Site Reliability Engineer

  • Design, implement, and maintain scalable and reliable machine learning infrastructure
  • Collaborate with data scientists and ML engineers to deploy and manage ML models in production
  • Develop and maintain CI/CD pipelines for machine learning workflows
  • Monitor and optimize the performance of machine learning systems and infrastructure
  • Implement and manage automated testing and validation processes for ML models
  • Ensure the security and compliance of machine learning systems and data
  • Troubleshoot and resolve issues related to ML infrastructure and workflows
  • Document processes, procedures, and best practices for machine learning operations

Requirements For MLOps Site Reliability Engineer

Python
Java
Go
Kubernetes
  • Bachelor's degree in Computer Science, Engineering, or related field
  • Proven experience as a Site Reliability Engineer (SRE)
  • Strong knowledge of machine learning concepts and workflows
  • Proficiency in programming languages such as Python, Java, or Go
  • Experience with cloud platforms such as AWS, Azure, or Google Cloud
  • Familiarity with containerization technologies like Docker and Kubernetes
  • Experience with CI/CD tools such as Jenkins, GitLab CI, or CircleCI
  • Strong problem-solving skills and ability to troubleshoot complex issues
  • Excellent communication and collaboration skills

Benefits For MLOps Site Reliability Engineer

Medical Insurance
  • Competitive compensation
  • Family-friendly total rewards package

Interested in this job?

Jobs Related To KLA MLOps Site Reliability Engineer