Taro Logo

Senior DGX Cloud Performance Engineer

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins.
$224,000 - $425,500
Cloud
Staff Software Engineer
In-Person
5,000+ Employees
12+ years of experience
AI · Enterprise SaaS · Cloud

Description For Senior DGX Cloud Performance Engineer

NVIDIA is seeking a Senior DGX Cloud Performance Engineer to drive the performance analysis, optimization, and modeling of NVIDIA's DGX Cloud clusters. This role sits at the intersection of cloud computing and artificial intelligence, working with NVIDIA's cutting-edge DGX™ Cloud platform - an end-to-end, scalable AI solution built on the latest NVIDIA architecture and co-engineered with leading cloud service providers.

The position requires deep expertise in parallel and distributed systems, with a focus on optimizing large-scale AI workloads. You'll be responsible for conducting comprehensive performance analysis of critical AI applications, developing benchmarks, and driving architectural decisions that shape the future of NVIDIA's cloud infrastructure.

Working closely with cross-functional teams, you'll help define DGX Cloud cluster architecture across different cloud service providers, optimize workloads, and develop methodologies that advance hardware-software co-design. The role involves hands-on work with various LLM workloads across industries like healthcare, climate modeling, and financial services.

This is an exceptional opportunity for an experienced engineer to impact the future of AI infrastructure at scale. You'll be working at NVIDIA, a leader in accelerated computing, with competitive compensation including a base salary range of $224,000 - $425,500 (depending on level), equity, and comprehensive benefits. The position offers the chance to work with cutting-edge technology while collaborating with some of the industry's brightest minds.

Last updated 5 hours ago

Responsibilities For Senior DGX Cloud Performance Engineer

  • Develop benchmarks and end to end customer applications running at scale
  • Construct experiments to analyze performance bottlenecks
  • Develop ideas to improve end to end system performance and usability
  • Collaborate with external CSPs during cluster deployment and workload optimization
  • Collaborate with AI researchers, developers, and application service providers
  • Work with diverse set of LLM workloads and their applications
  • Develop vital modeling framework and TCO analysis
  • Develop methodology to drive engineering analysis for DGX Cloud architecture

Requirements For Senior DGX Cloud Performance Engineer

Python
Kubernetes
Linux
  • 12+ years of proven experience
  • Ability to work with large scale parallel and distributed accelerator-based systems
  • Expertise optimizing performance and AI workloads on large scale systems
  • Experience with performance modeling and benchmarking at scale
  • Strong background in Computer Architecture, Networking, Storage systems, Accelerators
  • Familiarity with popular AI frameworks (PyTorch, TensorFlow, JAX, Megatron-LM, Tensort-LLM, VLLM)
  • Experience with AI/ML models and workloads, particularly LLMs
  • Understanding of DNNs and their use in emerging AI/ML applications and services
  • Bachelors or Masters in Engineering
  • Proficiency in Python, C/C++
  • Expertise with at least one of public CSP infrastructure (GCP, AWS, Azure, OCI)

Jobs Related To NVIDIA Senior DGX Cloud Performance Engineer