Cohere is at the forefront of AI development, training and deploying frontier models for developers and enterprises. As a Pre-Training Data Engineer, you'll be instrumental in developing the data infrastructure that powers Cohere's advanced language models. The role combines technical expertise with research innovation, focusing on end-to-end management of training data including ingestion, cleaning, filtering, and optimization.
You'll work with diverse data sources including web data, code data, and multilingual corpora, ensuring their quality and reliability. The position requires strong software engineering skills, particularly in Python, and experience with data processing frameworks like Apache Spark or Apache Beam. You'll be designing scalable pipelines, conducting data ablations, and experimenting with data mixtures to enhance model performance.
The company offers an inclusive work environment with offices in major tech hubs like Toronto, San Francisco, New York, London, and Paris, while embracing remote work flexibility. Benefits include comprehensive health coverage, mental health support, generous parental leave, and 6 weeks of vacation. You'll be joining a team of world-class researchers and engineers who are passionate about their craft and committed to scaling intelligence to serve humanity.
This role presents a unique opportunity to bridge the gap between raw data and cutting-edge AI models, directly contributing to improvements in critical training metrics. If you're passionate about transforming data into the foundation of AI systems and want to work on challenging problems with significant impact, this position offers the perfect blend of technical challenge and meaningful contribution to the future of AI technology.