Tesla's Supercomputing/AI infrastructure team is seeking an HPC Engineer to support and improve their AI/ML cluster infrastructure. This role is crucial for maintaining and enhancing the platform that enables Tesla's Full-Self-Driving (FSD), Tesla Bot, and Dojo engineering teams to be productive.
As an HPC Engineer, you will be responsible for:
- Managing and operating AI infrastructure
- Monitoring compute/GPU/network metrics
- Linux troubleshooting & performance tuning
- Collaborating with the Data Center team to coordinate server operations
- Facilitating neural network training at scale
- Streamlining FSD development
- Enabling Dojo to become the most powerful supercomputer
Key responsibilities include:
- Supporting AI/ML cluster infrastructure on GPU and Dojo platforms
- Improving monitoring & self-healing pipelines and security posture
- Optimizing server, storage, and network performance
- Managing HPC clusters, workloads, and applications
- Automation and systems engineering
- Participating in 24x7 on-call rotation
The ideal candidate will have:
- Proficiency in Python or Bash scripting
- Strong Linux & network fundamentals
- Experience with configuration management software and systems monitoring
- Knowledge of high-throughput low-latency networks and GPU-based computing systems
- Familiarity with Slurm, LSF, and parallel file systems
- A Bachelor's Degree in a relevant field or exceptional skills
- 3+ years of related experience
This role offers a competitive salary range of $120,000 - $300,000 annually, plus cash and stock awards, and a comprehensive benefits package. Join Tesla in pushing the boundaries of AI and autonomous technology!