Google is seeking a Software Engineer to join their ML, Systems, & Cloud AI (MSCA) organization, focusing on machine learning supercomputer reliability. This role is critical in developing and maintaining software for reliable scale-out and scale-up of accelerators, specifically for massive-scale Machine Learning applications.
The position requires deep expertise in distributed systems, machine learning, and networking technologies. You'll be working on various layers of the software stack, from network routing rules for Tensor Processing Units (TPUs) to distributed software running on Google's internal and cloud infrastructure. The role combines technical leadership with hands-on development, requiring both strategic thinking and practical implementation skills.
As part of Google's MSCA organization, you'll be contributing to the infrastructure that powers all Google services (Search, YouTube, etc.) and Google Cloud. The team prioritizes security, efficiency, and reliability while pushing the boundaries of hyperscale computing. Your work will directly impact Google Cloud's Vertex AI, the leading AI platform for bringing Gemini models to enterprise customers.
The position offers competitive compensation ($197,000-$291,000 base salary plus bonus, equity, and benefits) and the opportunity to work with cutting-edge technology. You'll be part of a team that shapes the future of machine learning infrastructure, working on projects that affect billions of users worldwide.
The ideal candidate should have at least 8 years of experience with programming languages like Java, C/C++, or Python, strong understanding of distributed systems, and knowledge of ML algorithms. This role offers the chance to work on challenging technical problems while providing leadership and driving software development initiatives that are crucial to Google's machine learning infrastructure.