ML Engineer Large-scale AI Infrastructure

GenBio

A Silicon Valley startup combining Generative AI with biology and medicine, pioneering pan-modal Large Biological Models (LBM) for healthcare transformation.

San Francisco, CA, USA

Machine Learning

Mid-Level Software Engineer

In-Person

2+ years of experience

AI · Healthcare · Biotech

This job posting is no longer active. Check out these related jobs instead:

Job Description

GenBio is a pioneering Silicon Valley startup at the intersection of Generative AI and biomedicine. With headquarters in Silicon Valley and a presence in Paris, we're revolutionizing healthcare through Large Biological Models (LBM). Our team of visionary scientists, engineers, and entrepreneurs is dedicated to decoding biology holistically and enabling next-generation life-transforming solutions.

As our ML Engineer for Large-scale AI Infrastructure, you'll be at the forefront of building and maintaining the computational backbone that powers our breakthrough research. You'll work with cutting-edge GPU clusters, implement distributed training systems, and optimize performance for our large-scale AI models. This role combines expertise in machine learning infrastructure with high-performance computing, requiring both technical depth and collaborative skills.

The ideal candidate will bring strong experience in GPU cluster management, distributed systems, and deep learning frameworks. You'll work alongside leading minds in AI and Biological Science, contributing to a mission that could fundamentally transform healthcare and biological research. This is an opportunity to join an exceptionally strong R&D team that's leading the charge in LLM and generative AI applications in biomedicine.

We offer a unique environment where innovation meets impact, and your work will directly contribute to advancing the future of biology and medicine through AI. Join us in our mission to pioneer new paradigms in healthcare, working with state-of-the-art technology and alongside world-class experts in both AI and biological sciences.

Last updated 9 months ago

Responsibilities For ML Engineer Large-scale AI Infrastructure

Design, deploy, and maintain high-performance GPU clusters
Implement distributed computing techniques for parallel training
Fine-tune GPU clusters and deep learning frameworks for optimal performance
Collaborate with data scientists and machine learning engineers
Ensure GPU clusters can scale effectively
Troubleshoot and resolve issues related to GPU clusters
Create and maintain documentation for GPU cluster configuration

Requirements For ML Engineer Large-scale AI Infrastructure

Python

Kubernetes

Master's or Ph.D. degree in computer science or related field with focus on High-Performance Computing, Distributed Systems, or Deep Learning
2+ years proven experience in managing GPU clusters
Strong expertise in distributed deep learning and parallel training techniques
Proficiency in PyTorch, Megatron-LM, DeepSpeed
Programming skills in Python and experience with GPU-accelerated libraries
Knowledge of performance profiling and optimization tools for HPC and deep learning
Familiarity with resource management and scheduling systems
Strong background in distributed systems, cloud computing, and containerization