NVIDIA is seeking a Senior AI Cluster Tools Developer to join their multifaceted software team. This role involves developing tools for GPU Cluster users and admins, working with various departments like Architecture and Software teams. The successful candidate will build internal performance/power profiling and analysis tools for AI workloads at cluster scale, create debugging tools for common GPU cluster problems, and collaborate with users to build/calibrate performance/power models for next-generation hardware or systems.
Key responsibilities include:
- Developing internal perf/power profiling and analysis tools for AI workloads at cluster scale
- Creating debugging tools for common GPU cluster issues
- Collaborating with users to build and calibrate perf/power models
- Partnering with architects to propose new hardware features or improve existing ones
Requirements:
- BS+ in Computer Science or related field (or equivalent experience)
- 5+ years of software development experience
- Strong software design and implementation skills with Python/Go/C++
- Good understanding of Deep Learning and AI frameworks (PyTorch, TensorFlow, etc.)
- Knowledge of AI cluster job scheduling, storage management, and networking management
- Linux kernel knowledge
- Excellent problem-solving and project management skills
Preferred qualifications:
- Experience in GPU cluster scale continuous profiling & analysis tools/platforms
- Solid experience in large AI job troubleshooting and failure detection/recovery
- Skillful in Deep Learning application performance analysis and optimization
- Knowledge of GPU/CPU architecture and application performance or power efficiency analysis
NVIDIA offers competitive salaries, comprehensive benefits, and the opportunity to work with some of the most brilliant and talented people in the world. The company is committed to fostering a diverse work environment and is an equal opportunity employer.