Taro Logo

ML Engineer Large-scale AI Infrastructure

A Silicon Valley startup combining Generative AI with biology and medicine, pioneering pan-modal Large Biological Models (LBM) for healthcare transformation.
Machine Learning
Mid-Level Software Engineer
In-Person
2+ years of experience
AI · Healthcare · Biotech
This job posting may no longer be active. You may be interested in these related jobs instead:

Description For ML Engineer Large-scale AI Infrastructure

GenBio is a pioneering Silicon Valley startup at the intersection of Generative AI and biomedicine. With headquarters in Silicon Valley and a presence in Paris, we're revolutionizing healthcare through Large Biological Models (LBM). Our team of visionary scientists, engineers, and entrepreneurs is dedicated to decoding biology holistically and enabling next-generation life-transforming solutions.

As our ML Engineer for Large-scale AI Infrastructure, you'll be at the forefront of building and maintaining the computational backbone that powers our breakthrough research. You'll work with cutting-edge GPU clusters, implement distributed training systems, and optimize performance for our large-scale AI models. This role combines expertise in machine learning infrastructure with high-performance computing, requiring both technical depth and collaborative skills.

The ideal candidate will bring strong experience in GPU cluster management, distributed systems, and deep learning frameworks. You'll work alongside leading minds in AI and Biological Science, contributing to a mission that could fundamentally transform healthcare and biological research. This is an opportunity to join an exceptionally strong R&D team that's leading the charge in LLM and generative AI applications in biomedicine.

We offer a unique environment where innovation meets impact, and your work will directly contribute to advancing the future of biology and medicine through AI. Join us in our mission to pioneer new paradigms in healthcare, working with state-of-the-art technology and alongside world-class experts in both AI and biological sciences.

Last updated 7 months ago

Responsibilities For ML Engineer Large-scale AI Infrastructure

  • Design, deploy, and maintain high-performance GPU clusters
  • Implement distributed computing techniques for parallel training
  • Fine-tune GPU clusters and deep learning frameworks for optimal performance
  • Collaborate with data scientists and machine learning engineers
  • Ensure GPU clusters can scale effectively
  • Troubleshoot and resolve issues related to GPU clusters
  • Create and maintain documentation for GPU cluster configuration

Requirements For ML Engineer Large-scale AI Infrastructure

Python
Kubernetes
  • Master's or Ph.D. degree in computer science or related field with focus on High-Performance Computing, Distributed Systems, or Deep Learning
  • 2+ years proven experience in managing GPU clusters
  • Strong expertise in distributed deep learning and parallel training techniques
  • Proficiency in PyTorch, Megatron-LM, DeepSpeed
  • Programming skills in Python and experience with GPU-accelerated libraries
  • Knowledge of performance profiling and optimization tools for HPC and deep learning
  • Familiarity with resource management and scheduling systems
  • Strong background in distributed systems, cloud computing, and containerization

Interested in this job?