Lead Cluster Operations Support Engineer

A leading technology consultancy delivering extraordinary impact with clients through technology solutions for 30+ years.
$125,330 - $208,880
DevOps
Staff Software Engineer
Hybrid
5,000+ Employees
8+ years of experience
AI · Enterprise SaaS

Description For Lead Cluster Operations Support Engineer

Thoughtworks is seeking a Lead Cluster Operations Support Engineer to join their team in a crucial role managing large-scale GPU infrastructure. This position combines deep technical expertise in cloud infrastructure and Kubernetes with a focus on supporting machine learning operations.

The role involves providing 24x7 white-glove support for clients utilizing massive GPU clusters (6,000+ GPUs) for Managed Post Training operations. You'll be working across multiple time zones (US, Europe, India, and Australia) to ensure continuous support and optimal infrastructure utilization. This is a unique opportunity to work at the intersection of infrastructure operations and machine learning, requiring both technical excellence and strong client-facing skills.

The ideal candidate will bring extensive experience in Kubernetes administration, cloud platforms (GCP, AWS, Azure), and infrastructure automation tools. You'll be working with cutting-edge technologies including the NVIDIA NeMo Framework, various cloud platforms, and modern DevOps tools. The role requires both technical depth and leadership skills, as you'll be mentoring team members and driving technical excellence.

This position offers an exciting chance to shape a new service offering, working with state-of-the-art GPU infrastructure and machine learning technologies. You'll be part of a collaborative team environment where continuous learning and professional development are encouraged. The role combines technical challenges with client interaction, making it ideal for someone who enjoys both deep technical work and stakeholder management.

Thoughtworks offers comprehensive benefits, professional development opportunities, and a culture that values diversity and inclusion. The hybrid working model provides flexibility while maintaining collaborative team dynamics. This is an excellent opportunity for a senior technical professional looking to make an impact in the intersection of infrastructure operations and machine learning.

Last updated 6 days ago

Responsibilities For Lead Cluster Operations Support Engineer

  • Shape and iterate new white glove model training support service on large GPU clusters
  • Work collaboratively with Machine Learning Engineers and Infrastructure Engineers
  • Contribute to accelerator development and automation
  • Assess model training readiness and data preparation
  • Provide model training support during rotating daytime weekend shifts
  • Facilitate collaborative problem-solving within the team
  • Proactively identify and address challenges related to white glove service

Requirements For Lead Cluster Operations Support Engineer

Kubernetes
Python
Linux
  • Deep expertise in Kubernetes administration and debugging at scale
  • Extensive experience managing large clusters with thousands of nodes
  • Knowledge of running training workloads on thousands of GPUs
  • Experience with NVIDIA NeMo Framework
  • Proficiency with cloud platforms (GCP, AWS, Azure)
  • Experience with Terraform/Pulumi, Helm Charts, and Infrastructure-as-Code tools
  • Strong stakeholder management skills
  • Ability to work in ambiguous situations
  • Experience with coaching and mentoring

Benefits For Lead Cluster Operations Support Engineer

Medical Insurance
Dental Insurance
Vision Insurance
  • Learning & Development programs
  • Equal opportunity employer
  • Comprehensive benefits package

Interested in this job?

Jobs Related To Thoughtworks Lead Cluster Operations Support Engineer

Senior Infrastructure Support Engineer

Senior Infrastructure Support Engineer position at Thoughtworks, focusing on cloud infrastructure management, DevOps practices, and incident response in a global technology consultancy.

Support Engineer IV, Device OS, Device OS

Senior support engineering role at Amazon Device OS team, combining technical support expertise with software engineering to triage and resolve device-related issues.

DevOps Lead Engineer

DevOps Lead Engineer position at Barclays in Pune, focusing on leading DevOps practices and infrastructure automation.

Staff Software Engineer (Developer Productivity)

Staff Software Engineer position at Okta focusing on developer productivity and infrastructure automation, offering competitive salary and comprehensive benefits in Toronto.

Sr Network Dev Engineer, Network Provisioning and Automation (Level 6)

Senior Network Development Engineer role at Amazon focusing on network automation and infrastructure provisioning for fulfillment networks, combining DevOps practices with network engineering expertise.