Taro Logo

Web Crawling & Indexing Engineer

Mistral AI is a tight-knit, nimble team dedicated to bringing cutting-edge AI technology to the world, with a mission to make AI ubiquitous and open.
Backend
Mid-Level Software Engineer
In-Person
3+ years of experience
This job posting may no longer be active. You may be interested in these related jobs instead:

Description For Web Crawling & Indexing Engineer

Mistral AI is seeking a skilled and motivated Web Crawling and Data Indexing Engineer to join their dynamic engineering team. The ideal candidate will have a strong background in web scraping, data extraction, and indexing, with a focus on leveraging advanced tools and technologies to gather and process large-scale data from various web sources.

Key Responsibilities:

  • Develop and maintain web crawlers using Python libraries such as Beautiful Soup to extract data from target websites.
  • Utilize headless browsing techniques, such as Chrome DevTools, to automate and optimize data collection processes.
  • Collaborate with cross-functional teams to identify, scrape, and integrate data from APIs to support business objectives.
  • Create and implement efficient parsing patterns using regular expressions, XPaths, and CSS selectors to ensure accurate data extraction.
  • Design and manage distributed job queues using technologies such as Redis, Kubernetes, and Postgres to handle large-scale data processing tasks.
  • Develop strategies to monitor and ensure data quality, accuracy, and integrity throughout the crawling and indexing process.
  • Continuously improve and optimize existing web crawling infrastructure to maximize efficiency and adapt to new challenges.

Qualifications & Profile:

  • Bachelor's or master's degree in computer science, information systems, or information technology
  • Strong understanding of web technologies, data structures, and algorithms.
  • Knowledge of database management systems and data warehousing.
  • Proficiency in programming languages such as Python, Java, or C++.
  • Mastery of Web Technologies: Understanding of HTML, CSS, and JavaScript.
  • Knowledge of HTTP and HTTPS protocols
  • Good understanding of data structures and algorithms
  • Knowledge of databases (SQL or NoSQL)
  • Understanding of distributed systems and technologies like Hadoop or Spark
  • Experience using web Scraping Libraries and Frameworks like Scrapy, BeautifulSoup, Selenium, or MechanicalSoup
  • Understanding of search engine optimization and web crawling
  • Experience in Machine Learning to improve the efficiency and accuracy of web crawling
  • Familiarity with tools such as Pandas, NumPy, and Matplotlib for data analysis and visualization

Mistral AI offers a competitive benefits package and a dynamic work environment. Join a creative, low-ego, team-spirited group passionate about AI and fostering a competitive yet fun work atmosphere. The company hires passionate individuals from all over the world, with teams distributed between France, UK, and USA.

Last updated a year ago

Responsibilities For Web Crawling & Indexing Engineer

  • Develop and maintain web crawlers using Python libraries
  • Utilize headless browsing techniques for data collection
  • Collaborate with cross-functional teams on data integration
  • Create efficient parsing patterns for accurate data extraction
  • Design and manage distributed job queues for large-scale data processing
  • Develop strategies to ensure data quality and integrity
  • Continuously improve and optimize web crawling infrastructure

Requirements For Web Crawling & Indexing Engineer

Python
Java
Redis
Kubernetes
PostgreSQL
  • Bachelor's or master's degree in computer science, information systems, or information technology
  • Strong understanding of web technologies, data structures, and algorithms
  • Knowledge of database management systems and data warehousing
  • Proficiency in programming languages such as Python, Java, or C++
  • Understanding of HTML, CSS, and JavaScript
  • Knowledge of HTTP and HTTPS protocols
  • Understanding of data structures and algorithms
  • Knowledge of databases (SQL or NoSQL)
  • Understanding of distributed systems and technologies like Hadoop or Spark
  • Experience with web scraping libraries and frameworks
  • Understanding of search engine optimization and web crawling
  • Experience in Machine Learning for web crawling optimization
  • Familiarity with data analysis and visualization tools

Benefits For Web Crawling & Indexing Engineer

Medical Insurance
Parental Leave
  • Daily lunch vouchers
  • Contribution to a Gympass subscription
  • Monthly contribution to a mobility pass
  • Full health insurance for you and your family
  • Generous parental leave policy