Mistral AI is seeking a skilled and motivated Web Crawling and Data Indexing Engineer to join their dynamic engineering team. The ideal candidate will have a strong background in web scraping, data extraction, and indexing, with a focus on leveraging advanced tools and technologies to gather and process large-scale data from various web sources.
Key Responsibilities:
- Develop and maintain web crawlers using Python libraries such as Beautiful Soup to extract data from target websites.
- Utilize headless browsing techniques, such as Chrome DevTools, to automate and optimize data collection processes.
- Collaborate with cross-functional teams to identify, scrape, and integrate data from APIs to support business objectives.
- Create and implement efficient parsing patterns using regular expressions, XPaths, and CSS selectors to ensure accurate data extraction.
- Design and manage distributed job queues using technologies such as Redis, Kubernetes, and Postgres to handle large-scale data processing tasks.
- Develop strategies to monitor and ensure data quality, accuracy, and integrity throughout the crawling and indexing process.
- Continuously improve and optimize existing web crawling infrastructure to maximize efficiency and adapt to new challenges.
Qualifications & Profile:
- Bachelor's or master's degree in computer science, information systems, or information technology
- Strong understanding of web technologies, data structures, and algorithms.
- Knowledge of database management systems and data warehousing.
- Proficiency in programming languages such as Python, Java, or C++.
- Mastery of Web Technologies: Understanding of HTML, CSS, and JavaScript.
- Knowledge of HTTP and HTTPS protocols
- Good understanding of data structures and algorithms
- Knowledge of databases (SQL or NoSQL)
- Understanding of distributed systems and technologies like Hadoop or Spark
- Experience using web Scraping Libraries and Frameworks like Scrapy, BeautifulSoup, Selenium, or MechanicalSoup
- Understanding of search engine optimization and web crawling
- Experience in Machine Learning to improve the efficiency and accuracy of web crawling
- Familiarity with tools such as Pandas, NumPy, and Matplotlib for data analysis and visualization
Mistral AI offers a competitive benefits package and a dynamic work environment. Join a creative, low-ego, team-spirited group passionate about AI and fostering a competitive yet fun work atmosphere. The company hires passionate individuals from all over the world, with teams distributed between France, UK, and USA.