Design a multithreaded web crawler. Given a starting URL and a number of threads, implement a web crawler that explores the website and extracts all unique URLs.
Details:
Example:
Let's say we want to crawl https://www.example.com
with 4 threads.
The crawler should:
https://www.example.com
.Clarifications/Assumptions:
robots.txt
, rate limiting).How would you implement this, considering thread safety, efficiency, and avoiding cycles?
When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:
Think of a web crawler as a machine that explores the internet link by link. A brute force approach means the crawler visits every single page it can find, without trying to be smart or efficient. It's like a tourist determined to see every shop in town, regardless of how interesting they are.
Here's how the algorithm would work step-by-step:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse, urljoin
def web_crawler_brute_force(seed_url, max_pages_to_visit=100):
pages_visited = set()
pages_to_visit = [seed_url]
while pages_to_visit and len(pages_visited) < max_pages_to_visit:
current_page_url = pages_to_visit.pop(0)
# Prevents visiting the same URL multiple times
if current_page_url in pages_visited:
continue
try:
response = requests.get(current_page_url)
response.raise_for_status()
except requests.exceptions.RequestException as e:
print(f"Error fetching {current_page_url}: {e}")
continue
pages_visited.add(current_page_url)
soup = BeautifulSoup(response.text, 'html.parser')
all_links = soup.find_all('a', href=True)
# Extract and normalize the URLs
for link in all_links:
absolute_url = urljoin(current_page_url, link['href'])
parsed_url = urlparse(absolute_url)
normalized_url = parsed_url.scheme + "://" + parsed_url.netloc + parsed_url.path
# Ensure the url is in the same domain
if urlparse(seed_url).netloc != urlparse(normalized_url).netloc:
continue
if normalized_url not in pages_visited and normalized_url not in pages_to_visit:
# Add only new and valid URLs
pages_to_visit.append(normalized_url)
print(f"Visited: {current_page_url}")
return pages_visited
To efficiently crawl a website with multiple threads, we distribute the work by assigning URLs to different workers. We need to make sure no worker duplicates effort and the whole crawling process respects website rules and stays organized.
Here's how the algorithm would work step-by-step:
import threading
import queue
import requests
import urllib.robotparser
from bs4 import BeautifulSoup
import time
import re
class WebCrawler:
def __init__(self, root_url, number_of_threads=4, delay=1):
self.root_url = root_url
self.number_of_threads = number_of_threads
self.delay = delay
self.url_queue = queue.Queue()
self.url_queue.put(root_url)
self.crawled_urls = set()
self.lock = threading.Lock()
self.robot_parser = urllib.robotparser.RobotFileParser()
self.robot_parser.set_url(self.root_url + '/robots.txt')
try:
self.robot_parser.read()
except:
print("robots.txt not found or unable to parse.")
def can_fetch(self, url):
try:
return self.robot_parser.can_fetch('*', url)
except:
return True
def crawl(self):
threads = []
for thread_id in range(self.number_of_threads):
thread = threading.Thread(target=self.worker, args=(thread_id,))
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
def worker(self, thread_id):
while True:
try:
current_url = self.url_queue.get(timeout=1)
except queue.Empty:
break
if not self.can_fetch(current_url):
print(f"Thread {thread_id}: Cannot fetch {current_url} due to robots.txt")
self.url_queue.task_done()
continue
with self.lock:
if current_url in self.crawled_urls:
self.url_queue.task_done()
continue
#Ensuring only one thread accesses and marks the url crawled
self.crawled_urls.add(current_url)
try:
response = requests.get(current_url, timeout=5)
response.raise_for_status()
except requests.exceptions.RequestException as e:
print(f"Thread {thread_id}: Error fetching {current_url}: {e}")
self.url_queue.task_done()
continue
new_urls = self.extract_links(current_url, response.text)
with self.lock:
for new_url in new_urls:
if new_url not in self.crawled_urls:
self.url_queue.put(new_url)
print(f"Thread {thread_id}: Crawled {current_url}, Queue size: {self.url_queue.qsize()}, Crawled: {len(self.crawled_urls)}")
self.url_queue.task_done()
time.sleep(self.delay)
def extract_links(self, base_url, html):
urls = []
try:
soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('a', href=True):
url = link['href']
url = self.normalize_url(base_url, url)
if url and url.startswith(self.root_url):
urls.append(url)
except Exception as e:
print(f"Error extracting links from {base_url}: {e}")
return urls
def normalize_url(self, base_url, url):
from urllib.parse import urljoin, urlparse
url = urljoin(base_url, url)
parsed_url = urlparse(url)
url = parsed_url.scheme + '://' + parsed_url.netloc + parsed_url.path
url = url.rstrip('/')
return url
if __name__ == '__main__':
# Example usage:
root_website_url = "https://www.example.com"
#Instantiate WebCrawler with 4 threads and 1 second delay
crawler = WebCrawler(root_website_url, number_of_threads=4, delay=1)
# Start the crawling process
crawler.crawl()
Case | How to Handle |
---|---|
Null or empty seed URL | Return an empty list or raise an IllegalArgumentException to indicate invalid input. |
Seed URL is malformed or invalid | Catch the MalformedURLException and log an error; proceed to other URLs if possible or return an empty list if it's the only seed. |
Robots.txt prevents access to a large portion of the website | Respect the robots.txt rules and avoid crawling disallowed URLs, potentially reducing the crawl scope significantly. |
Circular links (cycles) on the website | Use a set to track visited URLs to prevent infinite loops and redundant crawling. |
Extremely large website (millions of pages) | Implement a breadth-first search (BFS) with a limited depth and efficient data structures to manage the crawl queue and visited URLs to avoid excessive memory usage; also consider using a distributed system for very large crawls. |
Website returns HTTP error codes (404, 500, etc.) | Handle HTTP error codes appropriately by logging errors, retrying a limited number of times, or skipping the URL. |
Website has rate limiting or anti-crawling measures | Implement polite crawling strategies such as respecting crawl-delay directives, using a user-agent string, and handling 429 (Too Many Requests) errors with exponential backoff. |
Integer overflow when calculating queue size or thread pool size | Use appropriate data types (e.g., long) for queue size and thread pool size calculations and validate them before casting to int. |