Web Crawling & Indexing Engineer

Mistral AI

Mistral AI is a tight-knit, nimble team dedicated to bringing cutting-edge AI technology to the world, with a mission to make AI ubiquitous and open.

London, UK

Backend

Mid-Level Software Engineer

In-Person

3+ years of experience

This job posting may no longer be active. You may be interested in these related jobs instead:

Description For Web Crawling & Indexing Engineer

Mistral AI is seeking a skilled and motivated Web Crawling and Data Indexing Engineer to join their dynamic engineering team. The ideal candidate will have a strong background in web scraping, data extraction, and indexing, with a focus on leveraging advanced tools and technologies to gather and process large-scale data from various web sources.

Key Responsibilities:

Develop and maintain web crawlers using Python libraries such as Beautiful Soup to extract data from target websites.
Utilize headless browsing techniques, such as Chrome DevTools, to automate and optimize data collection processes.
Collaborate with cross-functional teams to identify, scrape, and integrate data from APIs to support business objectives.
Create and implement efficient parsing patterns using regular expressions, XPaths, and CSS selectors to ensure accurate data extraction.
Design and manage distributed job queues using technologies such as Redis, Kubernetes, and Postgres to handle large-scale data processing tasks.
Develop strategies to monitor and ensure data quality, accuracy, and integrity throughout the crawling and indexing process.
Continuously improve and optimize existing web crawling infrastructure to maximize efficiency and adapt to new challenges.

Qualifications & Profile:

Bachelor's or master's degree in computer science, information systems, or information technology
Strong understanding of web technologies, data structures, and algorithms.
Knowledge of database management systems and data warehousing.
Proficiency in programming languages such as Python, Java, or C++.
Mastery of Web Technologies: Understanding of HTML, CSS, and JavaScript.
Knowledge of HTTP and HTTPS protocols
Good understanding of data structures and algorithms
Knowledge of databases (SQL or NoSQL)
Understanding of distributed systems and technologies like Hadoop or Spark
Experience using web Scraping Libraries and Frameworks like Scrapy, BeautifulSoup, Selenium, or MechanicalSoup
Understanding of search engine optimization and web crawling
Experience in Machine Learning to improve the efficiency and accuracy of web crawling
Familiarity with tools such as Pandas, NumPy, and Matplotlib for data analysis and visualization

Mistral AI offers a competitive benefits package and a dynamic work environment. Join a creative, low-ego, team-spirited group passionate about AI and fostering a competitive yet fun work atmosphere. The company hires passionate individuals from all over the world, with teams distributed between France, UK, and USA.

Last updated a year ago

Responsibilities For Web Crawling & Indexing Engineer

Develop and maintain web crawlers using Python libraries
Utilize headless browsing techniques for data collection
Collaborate with cross-functional teams on data integration
Create efficient parsing patterns for accurate data extraction
Design and manage distributed job queues for large-scale data processing
Develop strategies to ensure data quality and integrity
Continuously improve and optimize web crawling infrastructure

Requirements For Web Crawling & Indexing Engineer

Python

Java

Redis

Kubernetes

PostgreSQL

Bachelor's or master's degree in computer science, information systems, or information technology
Strong understanding of web technologies, data structures, and algorithms
Knowledge of database management systems and data warehousing
Proficiency in programming languages such as Python, Java, or C++
Understanding of HTML, CSS, and JavaScript
Knowledge of HTTP and HTTPS protocols
Understanding of data structures and algorithms
Knowledge of databases (SQL or NoSQL)
Understanding of distributed systems and technologies like Hadoop or Spark
Experience with web scraping libraries and frameworks
Understanding of search engine optimization and web crawling
Experience in Machine Learning for web crawling optimization
Familiarity with data analysis and visualization tools

Benefits For Web Crawling & Indexing Engineer

Medical Insurance

Parental Leave

Daily lunch vouchers
Contribution to a Gympass subscription
Monthly contribution to a mobility pass
Full health insurance for you and your family
Generous parental leave policy