Member of Technical Staff, Training Infra Engineer

AI company training and deploying frontier models for developers and enterprises to power AI systems for content generation, semantic search, RAG, and agents.
Machine Learning
Senior Software Engineer
Remote
501 - 1,000 Employees
5+ years of experience
AI

Description For Member of Technical Staff, Training Infra Engineer

Cohere is at the forefront of AI development, focusing on training and deploying frontier models for developers and enterprises. This role as a Member of Technical Staff, Training Infra Engineer offers a unique opportunity to work on cutting-edge AI infrastructure. You'll be responsible for designing and implementing high-performance training pipelines, optimizing infrastructure, and bridging the gap between research and production. The company boasts one of the highest compute-to-engineer ratios globally and maintains a flexible approach to engineering and research work.

The position offers the chance to work with state-of-the-art technology and some of the best researchers in the field. You'll be contributing to critical infrastructure that enables large-scale model training, working with technologies like Python, JAX, PyTorch, and Kubernetes. The role requires strong software engineering skills and experience with distributed training systems.

Cohere values diversity and maintains an inclusive work environment, with offices in major tech hubs like London, Toronto, San Francisco, and New York, while also supporting remote work. The company offers comprehensive benefits including health and dental coverage, mental health support, parental leave, and generous vacation time. This is an excellent opportunity for someone passionate about AI infrastructure who wants to make a significant impact in the field of machine learning and artificial intelligence.

The company's mission to "scale intelligence to serve humanity" drives their work in developing AI systems for content generation, semantic search, RAG, and agents. They emphasize customer value and maintain a fast-paced, innovation-focused environment where each team member contributes to advancing model capabilities.

Last updated a day ago

Responsibilities For Member of Technical Staff, Training Infra Engineer

  • Design and write high-performant and scalable software for training
  • Improve training setup from an infrastructure and codebase performance standpoint
  • Craft and implement tools to speed up training cycles
  • Research, implement, and experiment with ideas on supercompute and data infrastructure
  • Work with researchers in the field

Requirements For Member of Technical Staff, Training Infra Engineer

Python
Kubernetes
  • Extremely strong software engineering skills
  • Proficiency in Python and ML frameworks (JAX, Pytorch and XLA/MLIR)
  • Experience with distributed training infrastructures (Kubernetes, Slurm) and frameworks (Ray)
  • Experience using large-scale distributed training strategies
  • Hands on experience on training large model at scale

Benefits For Member of Technical Staff, Training Infra Engineer

Dental Insurance
Medical Insurance
Mental Health Assistance
Parental Leave
  • Weekly lunch stipend, in-office lunches & snacks
  • Full health and dental benefits
  • Mental health budget
  • 100% Parental Leave top-up for 6 months (Canada, US, and UK)
  • Personal enrichment benefits
  • Remote-flexible work
  • Co-working stipend
  • 6 weeks of vacation

Interested in this job?

Jobs Related To Cohere Member of Technical Staff, Training Infra Engineer

Member of Technical Staff, Training Performance Engineer

Senior ML Performance Engineer role at Cohere, optimizing training systems for frontier AI models, combining software engineering and machine learning expertise.

Senior Software Engineer - Windows AI Agent

Senior Software Engineer position at Microsoft focusing on Windows AI Agent development, specializing in scalable model infrastructure and cloud-based AI workflows.

Machine Learning Engineer

Senior Machine Learning Engineer role at Adobe, developing innovative ML models and deploying AI solutions for the Digital Experience platform. Salary range: $120,700-$228,600.

Senior MLOps / AIOps Engineer

Senior MLOps/AIOps Engineer position at Oracle in Casablanca, focusing on ML model deployment, CI/CD pipelines, and production infrastructure for enterprise AI systems.

Senior Machine Learning Engineer, Trust & Safety

Senior Machine Learning Engineer position at Hinge focusing on Trust & Safety, developing AI solutions for content moderation and user safety.