Taro Logo

Software Engineer, Machine Learning Supercomputer Reliability

A global technology company that develops internet-related services and products.
$197,000 - $291,000
Machine Learning
Staff Software Engineer
In-Person
5,000+ Employees
8+ years of experience
AI · Enterprise SaaS
This job posting is no longer active. Check out these related jobs instead:

Job Description

Google is seeking a Software Engineer to join their ML, Systems, & Cloud AI (MSCA) organization, focusing on machine learning supercomputer reliability. This role is critical in developing and maintaining software for reliable scale-out and scale-up of accelerators, specifically for massive-scale Machine Learning applications.

The position requires deep expertise in distributed systems, machine learning, and networking technologies. You'll be working on various layers of the software stack, from network routing rules for Tensor Processing Units (TPUs) to distributed software running on Google's internal and cloud infrastructure. The role combines technical leadership with hands-on development, requiring both strategic thinking and practical implementation skills.

As part of Google's MSCA organization, you'll be contributing to the infrastructure that powers all Google services (Search, YouTube, etc.) and Google Cloud. The team prioritizes security, efficiency, and reliability while pushing the boundaries of hyperscale computing. Your work will directly impact Google Cloud's Vertex AI, the leading AI platform for bringing Gemini models to enterprise customers.

The position offers competitive compensation ($197,000-$291,000 base salary plus bonus, equity, and benefits) and the opportunity to work with cutting-edge technology. You'll be part of a team that shapes the future of machine learning infrastructure, working on projects that affect billions of users worldwide.

The ideal candidate should have at least 8 years of experience with programming languages like Java, C/C++, or Python, strong understanding of distributed systems, and knowledge of ML algorithms. This role offers the chance to work on challenging technical problems while providing leadership and driving software development initiatives that are crucial to Google's machine learning infrastructure.

Last updated 19 days ago

Responsibilities For Software Engineer, Machine Learning Supercomputer Reliability

  • Design and maintain supercomputer software across different layers of the software stack
  • Provide technical leadership to help formulate and drive software development plans
  • Identify commonalities between different supercomputer generations and accelerator types and create well abstracted and flexible software
  • Help identify dependencies in cross-functional teams and drive common execution with focus on development velocity and quality

Requirements For Software Engineer, Machine Learning Supercomputer Reliability

Python
Java
  • Bachelor's degree or equivalent practical experience
  • 8 years of experience with one or more general purpose programming languages (e.g., Java, C/C++ or Python)
  • Experience with coding in data structures, algorithms and software design
  • Understanding of distributed systems concepts
  • Knowledge of common ML algorithms and how they map to software and hardware operations
  • Passion for building back-end software for high-performance computing and machine learning applications

Benefits For Software Engineer, Machine Learning Supercomputer Reliability

Medical Insurance
Equity
401k
  • Medical Insurance
  • Equity
  • 401k