Principal Engineer, Cloud ML Compute Services

Google Cloud accelerates organizations' digital transformation by leveraging cutting-edge technology and tools for developers.
$278,000 - $399,000
Machine Learning
Principal Software Engineer
In-Person
5,000+ Employees
15+ years of experience
AI · Cloud · Enterprise SaaS

Description For Principal Engineer, Cloud ML Compute Services

Cloud ML Compute Services (CMCS) at Google is seeking a Principal Engineer to drive the technical strategy for ML Frameworks and Models. This role focuses on enabling massive scale ML Services powered by GPUs and TPUs. The successful candidate will lead the development of cloud-based ML solutions for training and serving large models (e.g., LLMs, MoE, Diffusion, Ranking/Recommendation) using cutting-edge AI hardware like Google TPUs and NVIDIA GPUs.

Key responsibilities include:

  1. Designing and deploying solutions leveraging GPU/TPU infrastructure for ML workloads
  2. Building strategic alignment across Google's ML landscape
  3. Collaborating with cross-functional teams on hardware and software development
  4. Providing leadership for cloud developer technology
  5. Optimizing emerging ML model types and frameworks on Google Cloud Platform

The ideal candidate will have extensive experience in software development, distributed systems, machine learning algorithms, and cloud infrastructure. They should be able to work cross-functionally, possess excellent problem-solving skills, and have outstanding communication abilities.

This role offers the opportunity to shape the future of Google's ML compute services and drive innovation in the rapidly evolving field of AI and machine learning. The position comes with a competitive salary range, bonuses, equity, and benefits, reflecting the high-level expertise required for this principal engineering role.

Last updated 8 hours ago

Responsibilities For Principal Engineer, Cloud ML Compute Services

  • Design, build, and deploy solutions that leverage GPU, TPU and highly-scalable hardware and software infrastructure to deliver compelling solutions for GPU/TPU ML workloads
  • Build strategic alignment with major organizations across Google contributing to the ML landscape to create mutually beneficial joint goals and execute on them
  • Work across Engineering teams that build, design, and implement both hardware and software and that span across infrastructure including platforms, chip development, compute, storage, networking, and data analytics
  • Provide leadership for cloud developer technology inside Google and manage collaboration with cross-functional Engineering teams to streamline and improve adoption of Google Cloud Platform capabilities, both within Google as well as for the cloud industry at large
  • Optimize the latest emerging ML model types, benchmarks, as well as common ML frameworks such as PyTorch, TensorFlow, and JAX on GCP

Requirements For Principal Engineer, Cloud ML Compute Services

Python
Kubernetes
  • Bachelor's degree in Computer Science, Electrical Engineering, or equivalent practical experience
  • 15 years of experience building software and distributed systems
  • 10 years of experience with machine learning algorithms and tools (e.g., PyTorch, TensorFlow, JAX), artificial intelligence, and deep learning models like LLMs, NLP, etc.
  • 10 years of experience with hardware and software design, data structures and algorithms, machine learning, and with customer-facing products
  • 10 years of experience with private and public cloud design considerations and limitations in the areas of virtualization, global infrastructure, distributed ML and HPC systems, load balancing, networking, massive data storage, and security

Interested in this job?

Jobs Related To Google Principal Engineer, Cloud ML Compute Services

Engineering Director, Machine Learning Lineage and Governance

Lead ML lineage and governance at Google, developing infrastructure for secure and compliant AI models.

Senior Software Engineering Manager, Machine Learning, Labs

Lead machine learning innovation at Google Labs as a Senior Software Engineering Manager, shaping the future of user interaction with cutting-edge technologies.

Silicon AI/ML Architect, TPU, Google Cloud

Senior Silicon AI/ML Architect role at Google Cloud, focusing on TPU development and SoC architecture for AI/ML applications.

Group Product Manager, Machine Learning Frameworks Applied Ecosystem

Lead ML Frameworks development at Google, focusing on Keras and collaborating with internal and external AI ecosystems.

Senior Software Engineering Manager, Devices Software

Senior Software Engineering Manager for Google's Devices Software team, leading ML engineers in smart home intelligence.