Taro Logo

Senior Deep Learning Performance Engineer - Training at Scale

NVIDIA is the world leader in accelerated computing, pioneering solutions to tackle challenges no one else can solve.
Machine Learning
Senior Software Engineer
Remote
5+ years of experience
AI
This job posting may no longer be active. You may be interested in these related jobs instead:

Description For Senior Deep Learning Performance Engineer - Training at Scale

We are seeking senior engineers with a focus on performance analysis and optimization to help maximize the efficiency of Deep Learning training, inference, and NVIDIA AI Services. Our work spans all layers of the hardware/software stack, from GPU architecture to Deep Learning Framework, aiming to achieve peak performance. This role offers a unique opportunity to directly impact the hardware and software roadmap in a rapidly growing company at the forefront of the AI revolution.

Join our team building software used globally, working alongside world-class engineers to implement blazingly fast state-of-the-art deep learning models. You'll contribute to understanding the end-to-end performance of NVIDIA's DL software and hardware stack, working on the most powerful, enterprise-grade GPU clusters capable of hundreds of Peta FLOPS, and on unreleased hardware before anyone else in the world.

Key Responsibilities:

  • Implement deep learning models across multiple data domains (CV, NLP/LLMs, ASR, TTS, RecSys, etc.) using various DL frameworks (PyTorch, JAX, TensorFlow 2, DGL, etc.)
  • Develop and test new software features (e.g., Graph Compilation, reduced precision training) leveraging the latest hardware functionalities
  • Analyze, profile, and optimize deep learning workloads on cutting-edge hardware and software platforms
  • Collaborate with researchers and engineers across NVIDIA, providing guidance on improving workload design, usability, and performance
  • Lead best practices for building, testing, and releasing DL software

Requirements:

  • 5+ years of experience in DL model implementation and software development
  • BSc, MS, or PhD in Computer Science, Computer Architecture, Mathematics, Physics, or related technical field (or equivalent experience)
  • Excellent Python programming skills and extensive knowledge of at least one DL Framework
  • Strong problem-solving and analytical skills
  • Solid understanding of algorithms and DL fundamentals

Preferred Qualifications:

  • Experience in performance measurements and profiling
  • Experience running large-scale workloads in HPC clusters
  • Knowledge of DevOps/MLOps practices for Deep Learning-based product development
  • Solid understanding of Linux environments and containerization technologies (e.g., Docker)
  • GPU programming experience (CUDA or OpenCL) is a plus but not required

NVIDIA is an equal opportunity employer valuing diversity and does not discriminate based on race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Last updated 8 months ago

Responsibilities For Senior Deep Learning Performance Engineer - Training at Scale

  • Implement deep learning models from multiple data domains in multiple DL frameworks
  • Implement and test new SW features that use the most recent HW functionalities
  • Analyze, profile, and optimize deep learning workloads on state-of-the-art hardware and software platforms
  • Collaborate with researchers and engineers across NVIDIA, providing guidance on improving the design, usability and performance of workloads
  • Lead best-practices for building, testing, and releasing DL software

Requirements For Senior Deep Learning Performance Engineer - Training at Scale

Python
  • 5+ years of experience in DL model implementation and SW Development
  • BSc, MS or PhD degree in Computer Science, Computer Architecture, Mathematics, Physics or related technical field or equivalent experience
  • Excellent Python programming skills, extensive knowledge of at least one DL Framework
  • Strong problem solving and analytical skills
  • Algorithms and DL fundamentals

Interested in this job?