Thoughtworks is seeking a Lead Cluster Operations Support Engineer to join their team in a crucial role managing large-scale GPU infrastructure. This position combines deep technical expertise in cloud infrastructure and Kubernetes with a focus on supporting machine learning operations.
The role involves providing 24x7 white-glove support for clients utilizing massive GPU clusters (6,000+ GPUs) for Managed Post Training operations. You'll be working across multiple time zones (US, Europe, India, and Australia) to ensure continuous support and optimal infrastructure utilization. This is a unique opportunity to work at the intersection of infrastructure operations and machine learning, requiring both technical excellence and strong client-facing skills.
The ideal candidate will bring extensive experience in Kubernetes administration, cloud platforms (GCP, AWS, Azure), and infrastructure automation tools. You'll be working with cutting-edge technologies including the NVIDIA NeMo Framework, various cloud platforms, and modern DevOps tools. The role requires both technical depth and leadership skills, as you'll be mentoring team members and driving technical excellence.
This position offers an exciting chance to shape a new service offering, working with state-of-the-art GPU infrastructure and machine learning technologies. You'll be part of a collaborative team environment where continuous learning and professional development are encouraged. The role combines technical challenges with client interaction, making it ideal for someone who enjoys both deep technical work and stakeholder management.
Thoughtworks offers comprehensive benefits, professional development opportunities, and a culture that values diversity and inclusion. The hybrid working model provides flexibility while maintaining collaborative team dynamics. This is an excellent opportunity for a senior technical professional looking to make an impact in the intersection of infrastructure operations and machine learning.