Google is seeking a Staff Software Engineer to join their Machine Learning (ML) Performance team, focusing on optimizing performance and efficiency for machine learning and AI workloads. This role is at the forefront of ML technology, working with Google's latest accelerators and fleet-scale systems.
The position involves working with cutting-edge ML infrastructure, particularly in areas of LLM (Large Language Models) optimization, performance analysis, and benchmarking. You'll be responsible for identifying and maintaining training and serving benchmarks used by the industry and ML community, while driving TensorFlow/JAX TPU performance improvements.
As a Staff Software Engineer, you'll collaborate with product teams to solve complex LLM performance challenges, including onboarding new models on Google's TPU hardware and enabling efficient large-scale training. The role requires deep expertise in performance analysis, system architecture, and machine learning infrastructure.
The ideal candidate will have significant experience in software development, particularly with languages like Python and C++, and a strong background in performance analysis and machine learning systems. You'll need to demonstrate leadership qualities in a matrixed organization and have the ability to set technical direction for project teams.
This is an opportunity to work at Google's scale, impacting billions of users while pushing the boundaries of ML performance optimization. You'll be part of a team that demonstrates ML performance at the largest scale and latest accelerators at the MLPerf competition, and pushes efficiency on trillion-parameter multipod ML models.
The position offers competitive compensation including base salary, bonus, equity, and comprehensive benefits. You'll work in a collaborative environment with opportunities to influence the direction of Google's ML infrastructure and contribute to the advancement of AI technology at a global scale.
This role is perfect for someone who combines technical depth in ML systems with the ability to drive large-scale performance improvements and collaborate across teams to deliver impactful solutions in the rapidly evolving field of machine learning.