Taro Logo

Senior Software Engineer, Distributed Systems Engineer - DGX Cloud

NVIDIA is the world leader in accelerated computing, pioneering GPU technology and AI solutions.
$144,000 - $270,250
Senior Software Engineer
Hybrid
5,000+ Employees
5+ years of experience
AI · Enterprise SaaS

Description For Senior Software Engineer, Distributed Systems Engineer - DGX Cloud

NVIDIA, a global leader in accelerated computing and AI technology, is seeking a Senior Software Engineer for their DGX Cloud team. This role focuses on building and maintaining large-scale GPU infrastructure for AI workloads, combining expertise in distributed systems with cutting-edge AI technology. The position offers an opportunity to work with NVIDIA's industry-leading GPU technology and kubernetes-based infrastructure.

The role involves developing and maintaining production systems that enable scalable GPU clusters for AI workloads, implementing sophisticated monitoring and health management capabilities, and ensuring optimal performance of AI infrastructure. You'll be working with kubernetes APIs and frameworks, not just operating clusters, and will be responsible for improving system reliability and performance.

As part of NVIDIA's team, you'll be at the forefront of AI computing innovation, working with state-of-the-art technology and contributing to solutions that power AI applications across various industries. The company offers competitive compensation, including a base salary range of $144,000 to $270,250, plus equity and comprehensive benefits.

The ideal candidate brings 5+ years of experience in similar roles, strong expertise in kubernetes and distributed systems, and a solid foundation in computer science or related fields. You should be comfortable with systems programming languages like Go and Python, and have a proven track record of working with large-scale production systems.

This position offers a unique opportunity to work with one of the most respected companies in the technology sector, known for its innovation in GPU computing and AI. You'll be part of a team that's pushing the boundaries of what's possible in AI infrastructure, while working in a collaborative environment that values creativity and autonomous thinking.

Last updated 5 hours ago

Responsibilities For Senior Software Engineer, Distributed Systems Engineer - DGX Cloud

  • Work on DGX Cloud team managing production systems for large scalable GPU clusters
  • Implement monitoring and health management capabilities for GPU assets
  • Work on custom software related to scheduling GPU resources on kubernetes
  • Ensure production AI clusters run reliably with maximum performance
  • Evaluate system failures and improve services based on incident management process

Requirements For Senior Software Engineer, Distributed Systems Engineer - DGX Cloud

Kubernetes
Python
Go
  • 5+ years experience in similar role with large-scale production systems
  • Direct software engineering experience with kubernetes APIs and frameworks
  • BS in Computer Science, Engineering, Physics, Mathematics or equivalent
  • Strong communication skills and ability to work with cross-functional teams
  • Experience with systems programming languages (Go, Python)
  • Solid understanding of data structures and algorithms

Benefits For Senior Software Engineer, Distributed Systems Engineer - DGX Cloud

Equity
  • Equity grants
  • Comprehensive benefits package

Jobs Related To NVIDIA Senior Software Engineer, Distributed Systems Engineer - DGX Cloud