Taro Logo

Web Crawler Multithreaded

Medium
Google logo
Google
4 views
Topics:
GraphsRecursionDynamic Programming

Design a multithreaded web crawler. You are given a startUrl and need to crawl all reachable pages from it, staying within the same domain (hostname). Your crawler should:

  1. Extract all URLs from a given webpage.
  2. Visit each URL, ensuring you don't revisit the same URL.
  3. Operate using multiple threads to improve crawling speed.
  4. Stay within the same hostname as the startUrl.

Consider the following constraints and requirements:

  • Input: startUrl (string).
  • Output: A list of all unique URLs crawled, in any order.
  • Threads: Use multiple threads to crawl concurrently. The number of threads should be configurable.
  • Same Hostname: Only crawl URLs that have the same hostname as the startUrl.
  • Avoid Cycles: Do not get stuck in infinite loops due to cyclic links.
  • Synchronization: Ensure thread safety when accessing shared data structures (e.g., visited URLs set, queue of URLs to visit).
  • Error Handling: Handle network errors gracefully.

For example, if startUrl is "http://example.com/", your crawler should only visit pages under the example.com domain. If the initial page contains links to "http://example.com/page1" and "http://example.com/page2", the crawler should visit these as well, and so on, until all reachable pages within the domain have been crawled. A link to "http://differentdomain.com/" should be ignored.

How would you implement this efficiently, considering thread safety and performance?

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. What is the expected scale of the web to be crawled (number of pages, depth of links), and how many threads should I be using to maximize throughput without overwhelming the system?
  2. Are there any specific domains or URL patterns that should be excluded or prioritized during the crawl?
  3. How should I handle circular dependencies (i.e., cycles of links) to prevent infinite loops?
  4. What is the expected format of the web pages (e.g., HTML, XML), and what information should be extracted from each page?
  5. What constitutes a 'visited' URL? Should I only store the exact URL, or should I normalize URLs (e.g., removing trailing slashes, handling URL encoding) to avoid treating equivalent URLs as distinct?

Brute Force Solution

Approach

Think of a web crawler as a machine that explores the internet link by link. A brute force approach means the crawler visits every single page it can find, without trying to be smart or efficient. It's like a tourist determined to see every shop in town, regardless of how interesting they are.

Here's how the algorithm would work step-by-step:

  1. Start with a single web page, the 'seed' URL.
  2. Visit this page and grab all the links it contains.
  3. Add all the links you just found to a list of pages to visit later.
  4. Pick the next link from your list of pages to visit.
  5. Visit that page and again, grab all the links it contains.
  6. Add those newly found links to the list of pages to visit later.
  7. Keep doing this, visiting each page on the list, grabbing its links, and adding those links to the list.
  8. Make sure you don't visit the same page twice, or you will be stuck forever.
  9. Repeat until you have visited all the pages you can find, or you hit some limit.

Code Implementation

import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse, urljoin

def web_crawler_brute_force(seed_url, max_pages_to_visit=100):
    pages_visited = set()
    pages_to_visit = [seed_url]

    while pages_to_visit and len(pages_visited) < max_pages_to_visit:
        current_page_url = pages_to_visit.pop(0)

        # Prevents visiting the same URL multiple times
        if current_page_url in pages_visited:
            continue

        try:
            response = requests.get(current_page_url)
            response.raise_for_status()
        except requests.exceptions.RequestException as e:
            print(f"Error fetching {current_page_url}: {e}")
            continue

        pages_visited.add(current_page_url)

        soup = BeautifulSoup(response.text, 'html.parser')
        all_links = soup.find_all('a', href=True)

        # Extract and normalize the URLs
        for link in all_links:
            absolute_url = urljoin(current_page_url, link['href'])
            parsed_url = urlparse(absolute_url)
            normalized_url = parsed_url.scheme + "://" + parsed_url.netloc + parsed_url.path

            # Ensure the url is in the same domain
            if urlparse(seed_url).netloc != urlparse(normalized_url).netloc:
                continue

            if normalized_url not in pages_visited and normalized_url not in pages_to_visit:
                # Add only new and valid URLs
                pages_to_visit.append(normalized_url)

        print(f"Visited: {current_page_url}")

    return pages_visited

Big(O) Analysis

Time Complexity
O(b^d)Let 'b' be the average branching factor, representing the average number of links found on each page, and 'd' be the maximum depth of the crawl (how many layers of links we follow from the seed URL). In the worst-case scenario, the crawler explores every reachable page. At depth 1, we have 'b' pages. At depth 2, each of those 'b' pages links to 'b' new pages, resulting in b*b = b^2 pages. Continuing this pattern, at depth 'd', we can potentially visit b^d pages. Therefore, the time complexity is O(b^d).
Space Complexity
O(N)The algorithm uses a queue (or list) to store URLs to visit, and a set to track visited URLs. In the worst-case scenario, the queue might hold all unique URLs on the website, and the set will store all visited URLs. Therefore, the auxiliary space is proportional to the number of unique URLs on the website, denoted as N. This results in a space complexity of O(N).

Optimal Solution

Approach

To efficiently crawl a website with multiple threads, we distribute the work by assigning URLs to different workers. We need to make sure no worker duplicates effort and the whole crawling process respects website rules and stays organized.

Here's how the algorithm would work step-by-step:

  1. Start with a list of URLs you want to visit, beginning with the website's homepage.
  2. Create several worker threads, each responsible for fetching and processing web pages.
  3. Give each worker a unique ID so they can be tracked, as well as a way to access the shared task queue.
  4. Use a central queue to hold URLs that need to be visited. Workers fetch URLs from this queue.
  5. Implement a mechanism to avoid visiting the same URL multiple times by keeping track of already visited URLs.
  6. Each worker gets a URL from the queue, downloads the corresponding web page, extracts all links from it, and adds those new links to the queue, unless they are already crawled.
  7. Respect the website's 'robots.txt' file to avoid crawling pages the site owner doesn't want you to access.
  8. Put checks in place to prevent your crawler from overloading the website with requests and being blocked.
  9. Use a synchronization mechanism to properly maintain the queue, to ensure that only one worker thread access it at a time to avoid conflicts. It should also be used to prevent the same url from being fetched twice.
  10. Continue this process until all URLs in the queue have been processed or some defined criteria has been met (e.g., a certain number of pages crawled or a certain time limit reached).

Code Implementation

import threading
import queue
import requests
import urllib.robotparser
from bs4 import BeautifulSoup
import time
import re

class WebCrawler:
    def __init__(self, root_url, number_of_threads=4, delay=1):
        self.root_url = root_url
        self.number_of_threads = number_of_threads
        self.delay = delay
        self.url_queue = queue.Queue()
        self.url_queue.put(root_url)
        self.crawled_urls = set()
        self.lock = threading.Lock()
        self.robot_parser = urllib.robotparser.RobotFileParser()
        self.robot_parser.set_url(self.root_url + '/robots.txt')
        try:
            self.robot_parser.read()
        except:
            print("robots.txt not found or unable to parse.")
        
    def can_fetch(self, url):
        try:
            return self.robot_parser.can_fetch('*', url)
        except:
            return True

    def crawl(self):
        threads = []
        for thread_id in range(self.number_of_threads):
            thread = threading.Thread(target=self.worker, args=(thread_id,))
            threads.append(thread)
            thread.start()

        for thread in threads:
            thread.join()

    def worker(self, thread_id):
        while True:
            try:
                current_url = self.url_queue.get(timeout=1)
            except queue.Empty:
                break

            if not self.can_fetch(current_url):
                print(f"Thread {thread_id}: Cannot fetch {current_url} due to robots.txt")
                self.url_queue.task_done()
                continue

            with self.lock:
                if current_url in self.crawled_urls:
                    self.url_queue.task_done()
                    continue
                #Ensuring only one thread accesses and marks the url crawled
                self.crawled_urls.add(current_url)

            try:
                response = requests.get(current_url, timeout=5)
                response.raise_for_status()
            except requests.exceptions.RequestException as e:
                print(f"Thread {thread_id}: Error fetching {current_url}: {e}")
                self.url_queue.task_done()
                continue

            new_urls = self.extract_links(current_url, response.text)

            with self.lock:
                for new_url in new_urls:
                    if new_url not in self.crawled_urls:
                        self.url_queue.put(new_url)

            print(f"Thread {thread_id}: Crawled {current_url}, Queue size: {self.url_queue.qsize()}, Crawled: {len(self.crawled_urls)}")
            self.url_queue.task_done()
            time.sleep(self.delay)

    def extract_links(self, base_url, html):
        urls = []
        try:
            soup = BeautifulSoup(html, 'html.parser')
            for link in soup.find_all('a', href=True):
                url = link['href']
                url = self.normalize_url(base_url, url)
                if url and url.startswith(self.root_url):
                    urls.append(url)
        except Exception as e:
            print(f"Error extracting links from {base_url}: {e}")
        return urls

    def normalize_url(self, base_url, url):
        from urllib.parse import urljoin, urlparse

        url = urljoin(base_url, url)
        parsed_url = urlparse(url)
        url = parsed_url.scheme + '://' + parsed_url.netloc + parsed_url.path
        url = url.rstrip('/')
        return url

if __name__ == '__main__':
    # Example usage:
    root_website_url = "https://www.example.com"
    #Instantiate WebCrawler with 4 threads and 1 second delay
    crawler = WebCrawler(root_website_url, number_of_threads=4, delay=1)
    # Start the crawling process
    crawler.crawl()

Big(O) Analysis

Time Complexity
O(N * M)Let N be the total number of unique URLs crawled and M be the average number of links extracted from each crawled page. Each of the N URLs needs to be fetched and processed. For each URL, on average M new URLs are extracted, and each of these new URLs needs to be checked against the set of already visited URLs to avoid duplicates. The check against visited URLs will have a time complexity associated with it, but in the worst case, assuming a simple linear search or hash table access of already visited URLs is O(1). The total time is dominated by fetching and processing each of the N urls and extracting M links on average from each page. Therefore the total runtime complexity is O(N * M).
Space Complexity
O(N)The primary auxiliary space is used by the central queue to hold URLs to be visited and the set of already visited URLs. In the worst-case scenario, where we explore a large portion of the website, both the queue and the visited URLs set could store a significant number of URLs, potentially up to N, where N is the number of unique URLs crawled. The worker threads themselves use a constant amount of memory. Therefore, the overall space complexity is dominated by the queue and the visited URLs set, resulting in O(N).

Edge Cases

CaseHow to Handle
Null or empty seed URLReturn an empty list or raise an IllegalArgumentException to indicate invalid input.
Seed URL is malformed or invalidCatch the MalformedURLException and log an error; proceed to other URLs if possible or return an empty list if it's the only seed.
Robots.txt prevents access to a large portion of the websiteRespect the robots.txt rules and avoid crawling disallowed URLs, potentially reducing the crawl scope significantly.
Circular links (cycles) on the websiteUse a set to track visited URLs to prevent infinite loops and redundant crawling.
Extremely large website (millions of pages)Implement a breadth-first search (BFS) with a limited depth and efficient data structures to manage the crawl queue and visited URLs to avoid excessive memory usage; also consider using a distributed system for very large crawls.
Website returns HTTP error codes (404, 500, etc.)Handle HTTP error codes appropriately by logging errors, retrying a limited number of times, or skipping the URL.
Website has rate limiting or anti-crawling measuresImplement polite crawling strategies such as respecting crawl-delay directives, using a user-agent string, and handling 429 (Too Many Requests) errors with exponential backoff.
Integer overflow when calculating queue size or thread pool sizeUse appropriate data types (e.g., long) for queue size and thread pool size calculations and validate them before casting to int.