Taro Logo

Distributed Training Engineer, Sora

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity.
$295,000 - $440,000
Machine Learning
Staff Software Engineer
Hybrid
1,000 - 5,000 Employees
7+ years of experience
This job posting is no longer active. Check out these related jobs instead:
Staff Machine Learning Engineer - Infrastructure

Staff Machine Learning Engineer position at Marqeta focusing on ML infrastructure development and implementation using AWS services.

Staff Software Engineer - AI Platforms

Staff Software Engineer position at Datadog focusing on building AI infrastructure and evaluation systems for machine learning platforms.

Machine Learning Systems Engineer

Machine Learning Systems Engineer role at Substack focused on building and maintaining ML infrastructure and pipelines.

AI/ML System Performance Simulation Architect - Datacenter

Lead performance architecture for AI/ML systems at Apple, focusing on datacenter architecture optimization and simulation for machine learning applications.

AI/ML System Performance Simulation Architect - Datacenter

AI/ML System Performance Simulation Architect role at Apple focusing on designing and optimizing computer architectures for Machine Learning applications.

Job Description

The Sora team at OpenAI is working on making video a key capability of OpenAI's foundation models. As a Distributed Training Engineer for Sora, you will work on improving the training throughput for our internal training framework and enable researchers to experiment with new ideas. This role requires strong engineering skills, the ability to write bug-free machine learning code, and deep knowledge of supercomputer performance.

Key responsibilities include:

  • Collaborating with researchers to develop systems-efficient video models and architectures
  • Applying the latest techniques to achieve impressive hardware efficiency for training runs
  • Profiling and optimizing the training framework

The ideal candidate should have experience with multi-modal ML pipelines, strong software engineering skills (particularly in Python), experience with understanding and optimizing training kernels, and a passion for understanding stable training dynamics.

OpenAI offers a competitive compensation package, including a salary range of $295K – $440K, generous equity, and comprehensive benefits such as medical insurance, mental health support, 401(k) matching, unlimited time off, and paid parental leave.

This role is based in San Francisco, CA, with a hybrid work model of 3 days in the office per week. OpenAI is committed to diversity, equality, and creating an inclusive environment for all employees.

Last updated a year ago

Responsibilities For Distributed Training Engineer, Sora

  • Collaborate with researchers to enable them to develop systems-efficient video models and architectures
  • Apply the latest techniques to our internal training framework to achieve impressive hardware efficiency for our training runs
  • Profile and optimize our training framework

Requirements For Distributed Training Engineer, Sora

Python
  • Experience working with multi-modal ML pipelines
  • Strong software engineering skills and proficiency in Python
  • Experience understanding and optimizing training kernels
  • Passion for understanding stable training dynamics
  • Ability to dive deep into systems implementations to improve performance and maintainability

Benefits For Distributed Training Engineer, Sora

Equity
Medical Insurance
Dental Insurance
Vision Insurance
401k
Education Budget
Parental Leave
Mental Health Assistance
  • Medical, dental, and vision insurance for you and your family
  • Mental health and wellness support
  • 401(k) plan with 50% matching
  • Unlimited time off and 13 company holidays per year
  • Paid parental leave (20 weeks) and family-planning support
  • Annual learning & development stipend ($1,500 per year)
  • Equity