Taro Logo

Member of Technical Staff, Pre-Training Data Engineer

AI company training and deploying frontier models for developers and enterprises building AI systems for content generation, semantic search, RAG, and agents.
Toronto, ON, CanadaOttawa, ON, CanadaSan Francisco, CA, USA
Data
Mid-Level Software Engineer
Remote
AI

Description For Member of Technical Staff, Pre-Training Data Engineer

Cohere is at the forefront of AI development, training and deploying frontier models for developers and enterprises. As a Pre-Training Data Engineer, you'll be instrumental in developing the data infrastructure that powers Cohere's advanced language models. The role combines technical expertise with research innovation, focusing on end-to-end management of training data including ingestion, cleaning, filtering, and optimization.

You'll work with diverse data sources including web data, code data, and multilingual corpora, ensuring their quality and reliability. The position requires strong software engineering skills, particularly in Python, and experience with data processing frameworks like Apache Spark or Apache Beam. You'll be designing scalable pipelines, conducting data ablations, and experimenting with data mixtures to enhance model performance.

The company offers an inclusive work environment with offices in major tech hubs like Toronto, San Francisco, New York, London, and Paris, while embracing remote work flexibility. Benefits include comprehensive health coverage, mental health support, generous parental leave, and 6 weeks of vacation. You'll be joining a team of world-class researchers and engineers who are passionate about their craft and committed to scaling intelligence to serve humanity.

This role presents a unique opportunity to bridge the gap between raw data and cutting-edge AI models, directly contributing to improvements in critical training metrics. If you're passionate about transforming data into the foundation of AI systems and want to work on challenging problems with significant impact, this position offers the perfect blend of technical challenge and meaningful contribution to the future of AI technology.

Last updated 6 hours ago

Responsibilities For Member of Technical Staff, Pre-Training Data Engineer

  • Design and build scalable data pipelines to ingest, clean, filter, and optimize diverse datasets
  • Conduct data ablations to assess data quality and experiment with data mixtures
  • Develop robust data modeling techniques for optimal training efficiency
  • Research and implement innovative data curation methods
  • Collaborate with cross-functional teams to ensure data pipelines meet requirements

Requirements For Member of Technical Staff, Pre-Training Data Engineer

Python
  • Strong software engineering skills, with proficiency in Python and experience building data pipelines
  • Familiarity with data processing frameworks such as Apache Spark, Apache Beam, Pandas, or similar tools
  • Experience working with large-scale datasets, including web data, code data, and multilingual corpora
  • Knowledge of data quality assessment techniques and experimentation with data mixtures
  • A passion for bridging research and engineering to solve complex data-related challenges in AI model training

Benefits For Member of Technical Staff, Pre-Training Data Engineer

Dental Insurance
Medical Insurance
Mental Health Assistance
Parental Leave
  • Weekly lunch stipend, in-office lunches & snacks
  • Full health and dental benefits
  • Mental health budget
  • 100% Parental Leave top-up for 6 months (Canada, US, and UK)
  • Personal enrichment benefits for arts, culture, fitness, and workspace improvement
  • Remote-flexible work environment
  • Co-working stipend
  • 6 weeks of vacation

Interested in this job?

Jobs Related To Cohere Member of Technical Staff, Pre-Training Data Engineer