Lead Cluster Operations Support Engineer

Thoughtworks

A leading technology consultancy delivering extraordinary impact with clients through technology solutions for 30+ years.

Chicago, IL, USA

$125,330 - $208,880

DevOps

Staff Software Engineer

Hybrid

5,000+ Employees

8+ years of experience

AI · Enterprise SaaS

Description For Lead Cluster Operations Support Engineer

Thoughtworks is seeking a Lead Cluster Operations Support Engineer to join their team in a crucial role managing large-scale GPU infrastructure. This position combines deep technical expertise in cloud infrastructure and Kubernetes with a focus on supporting machine learning operations.

The role involves providing 24x7 white-glove support for clients utilizing massive GPU clusters (6,000+ GPUs) for Managed Post Training operations. You'll be working across multiple time zones (US, Europe, India, and Australia) to ensure continuous support and optimal infrastructure utilization. This is a unique opportunity to work at the intersection of infrastructure operations and machine learning, requiring both technical excellence and strong client-facing skills.

The ideal candidate will bring extensive experience in Kubernetes administration, cloud platforms (GCP, AWS, Azure), and infrastructure automation tools. You'll be working with cutting-edge technologies including the NVIDIA NeMo Framework, various cloud platforms, and modern DevOps tools. The role requires both technical depth and leadership skills, as you'll be mentoring team members and driving technical excellence.

This position offers an exciting chance to shape a new service offering, working with state-of-the-art GPU infrastructure and machine learning technologies. You'll be part of a collaborative team environment where continuous learning and professional development are encouraged. The role combines technical challenges with client interaction, making it ideal for someone who enjoys both deep technical work and stakeholder management.

Thoughtworks offers comprehensive benefits, professional development opportunities, and a culture that values diversity and inclusion. The hybrid working model provides flexibility while maintaining collaborative team dynamics. This is an excellent opportunity for a senior technical professional looking to make an impact in the intersection of infrastructure operations and machine learning.

Last updated 6 days ago

Responsibilities For Lead Cluster Operations Support Engineer

Shape and iterate new white glove model training support service on large GPU clusters
Work collaboratively with Machine Learning Engineers and Infrastructure Engineers
Contribute to accelerator development and automation
Assess model training readiness and data preparation
Provide model training support during rotating daytime weekend shifts
Facilitate collaborative problem-solving within the team
Proactively identify and address challenges related to white glove service

Requirements For Lead Cluster Operations Support Engineer

Kubernetes

Python

Linux

Deep expertise in Kubernetes administration and debugging at scale
Extensive experience managing large clusters with thousands of nodes
Knowledge of running training workloads on thousands of GPUs
Experience with NVIDIA NeMo Framework
Proficiency with cloud platforms (GCP, AWS, Azure)
Experience with Terraform/Pulumi, Helm Charts, and Infrastructure-as-Code tools
Strong stakeholder management skills
Ability to work in ambiguous situations
Experience with coaching and mentoring

Benefits For Lead Cluster Operations Support Engineer

Medical Insurance

Dental Insurance

Vision Insurance

Learning & Development programs
Equal opportunity employer
Comprehensive benefits package

Thoughtworks

A leading technology consultancy delivering extraordinary impact with clients through technology solutions for 30+ years.

Chicago, IL, USA

$125,330 - $208,880

DevOps

Staff Software Engineer

Hybrid

5,000+ Employees

8+ years of experience

AI · Enterprise SaaS

Interested in this job?

Jobs Related To Thoughtworks Lead Cluster Operations Support Engineer

Senior Infrastructure Support Engineer

Thoughtworks

Senior Infrastructure Support Engineer position at Thoughtworks, focusing on cloud infrastructure management, DevOps practices, and incident response in a global technology consultancy.

Support Engineer IV, Device OS, Device OS

Amazon

Senior support engineering role at Amazon Device OS team, combining technical support expertise with software engineering to triage and resolve device-related issues.

DevOps Lead Engineer

Barclays

DevOps Lead Engineer position at Barclays in Pune, focusing on leading DevOps practices and infrastructure automation.

Staff Software Engineer (Developer Productivity)

Okta

Staff Software Engineer position at Okta focusing on developer productivity and infrastructure automation, offering competitive salary and comprehensive benefits in Toronto.

Sr Network Dev Engineer, Network Provisioning and Automation (Level 6)

Amazon

Senior Network Development Engineer role at Amazon focusing on network automation and infrastructure provisioning for fulfillment networks, combining DevOps practices with network engineering expertise.