Taro Logo

Web Crawler

Medium
Anthropic logo
Anthropic
136 views
Topics:
Graphs

Given a starting url, implement a web crawler to crawl all websites that are reachable from the starting url. You should only crawl websites under the same domain as the starting url.

Implement the Crawler class:

  • Crawler(string startUrl) Initializes the system with the start url.
  • List<string> crawl(string url) Returns a list of URLs obtained by crawling the given url.

Assume you have an implementation of a HtmlParser class that:

  • HtmlParser.getUrls(url) Returns a list of URLs that are all on the same domain as the url.

You should return all URLs in lexicographical order.

Example 1:

Input:
startUrl = "http://news.yahoo.com/news/topics/",
urls = [
  "http://news.yahoo.com",
  "http://news.yahoo.com/news",
  "http://news.yahoo.com/news/topics/",
  "http://news.google.com",
  "http://news.yahoo.com/us"
]
Output: ["http://news.yahoo.com","http://news.yahoo.com/news","http://news.yahoo.com/news/topics/","http://news.yahoo.com/us"]

Constraints:

  • 1 <= urls.length <= 1000
  • 1 <= urls[i].length <= 300
  • startUrl will be in the same domain as urls.
  • 1 <= startUrl.length <= 300
  • All URLs will be valid URLs that start with either http:// or https://.
  • When you are testing your solution, the filtering mechanism will be implemented after your code runs.

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. What is the maximum depth of the website I should crawl, and how should I handle circular references or links back to already visited pages?
  2. What format should I return the crawled data in? Should it be a list of URLs, a structured object representing the website's content, or something else?
  3. How should I handle different types of files, such as images or PDFs? Should I only crawl HTML pages, or should I attempt to extract text from other file types?
  4. Are there any robots.txt rules or other restrictions on crawling certain parts of the website that I should respect?
  5. How should I handle errors, such as timeouts or broken links, during the crawling process? Should I retry, log the error, or stop crawling the website?

Brute Force Solution

Approach

The brute force web crawler visits every single webpage reachable from a starting point. It methodically explores the web by following every link it finds. This ensures that no possible page is missed, though it might be very inefficient.

Here's how the algorithm would work step-by-step:

  1. Begin with a starting webpage.
  2. Visit this webpage and identify all the links to other webpages that are present on it.
  3. Add all those newly found webpage links to a list of webpages that need to be visited.
  4. Pick a webpage from the list of webpages that need to be visited and visit it.
  5. Repeat the process of identifying all the links to other webpages on this newly visited page and add them to the list.
  6. Continue this process of visiting webpages, identifying links, and adding them to the list until the list of webpages that need to be visited is empty.
  7. Keep track of every webpage you have visited so that you don't visit the same webpage multiple times.
  8. Optionally, you can apply some filtering to the discovered links (e.g., to stay within a specific domain) before adding them to the list of webpages to be visited.

Code Implementation

import requests
def web_crawler(start_url, max_pages=100):
    pages_to_visit = [start_url]
    visited_pages = set()
    page_count = 0

    while pages_to_visit and page_count < max_pages:
        current_url = pages_to_visit.pop(0)

        # Avoid re-visiting already processed URLs
        if current_url in visited_pages:
            continue

        try:
            response = requests.get(current_url)
            response.raise_for_status()  # Raise HTTPError for bad responses (4xx or 5xx)
            page_content = response.text

            # Mark the current URL as visited to avoid loops
            visited_pages.add(current_url)
            page_count += 1

            # Extract all the links from the current webpage
            new_urls = extract_links(page_content, current_url)

            # Add discovered links for future exploration
            for new_url in new_urls:
                if new_url not in visited_pages and new_url not in pages_to_visit:
                    pages_to_visit.append(new_url)

        except requests.exceptions.RequestException as e:
            print(f"Error crawling {current_url}: {e}")

    return visited_pages

def extract_links(page_content, base_url):
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(page_content, 'html.parser')
    links = []
    for anchor in soup.find_all('a', href=True):
        link = anchor['href']
        absolute_url = get_absolute_url(base_url, link)
        if absolute_url:
            links.append(absolute_url)
    return links

def get_absolute_url(base_url, url):
    from urllib.parse import urljoin, urlparse
    absolute_url = urljoin(base_url, url)
    parsed_url = urlparse(absolute_url)
    if parsed_url.scheme not in ('http', 'https'):
        return None
    return absolute_url

Big(O) Analysis

Time Complexity
O(V+E)The web crawler visits each webpage (node) and follows links (edges) between them. Let V represent the number of webpages (vertices) visited and E represent the number of links (edges) followed. The crawler's runtime is determined by the time it takes to visit all vertices and explore all edges emanating from those vertices. Therefore, the time complexity is proportional to V + E, representing visiting all vertices and exploring all edges.
Space Complexity
O(N)The algorithm uses a queue to store the list of webpages to be visited and a set to keep track of visited webpages. In the worst-case scenario, where the crawler explores almost all webpages reachable from the starting page before revisiting any, both the queue and the set could potentially store a number of webpages proportional to N, where N is the total number of reachable webpages. Therefore, the auxiliary space used is O(N) due to the storage of these webpages in the queue and the set.

Optimal Solution

Approach

The most efficient way to crawl a website is to explore it layer by layer, like tracing a tree. We avoid repeatedly visiting the same pages and make sure to visit all important pages by using this method.

Here's how the algorithm would work step-by-step:

  1. Start with a list containing just the initial website address (the 'seed' URL).
  2. Take the first address from the list.
  3. Visit the webpage at that address and collect all the other website addresses (URLs) found on that page.
  4. Before adding a newly found website address to our list, check if we've already visited it or if it's already waiting to be visited. If we have, skip it.
  5. If it's a new address, add it to the end of our list.
  6. Repeat steps 2 through 5 until the list of website addresses to visit is empty.
  7. By processing webpages layer by layer and remembering the addresses we've already seen, we efficiently explore the website without getting stuck in loops or wasting time.

Code Implementation

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse

def web_crawler(seed_url, max_urls=100):
    visited_urls = set()
    urls_to_visit = [seed_url]
    crawled_urls_count = 0

    while urls_to_visit and crawled_urls_count < max_urls:
        current_url = urls_to_visit.pop(0)

        # Avoid re-crawling already processed URLs.
        if current_url in visited_urls:
            continue

        try:
            response = requests.get(current_url)
            response.raise_for_status()

            soup = BeautifulSoup(response.text, 'html.parser')

            # This extracts all the links from the current webpage.
            for anchor in soup.find_all('a', href=True):
                absolute_url = urljoin(current_url, anchor['href'])

                # Normalize URLs to prevent duplicates.
                parsed_url = urlparse(absolute_url)
                normalized_url = parsed_url.scheme + "://" + parsed_url.netloc + parsed_url.path

                # Avoid re-crawling already processed URLs.
                if normalized_url not in visited_urls:
                    urls_to_visit.append(normalized_url)

        except requests.exceptions.RequestException as e:
            print(f"Error crawling {current_url}: {e}")

        visited_urls.add(current_url)
        crawled_urls_count += 1

    return visited_urls

Big(O) Analysis

Time Complexity
O(n*m)Let n be the number of unique URLs on the website and m be the average number of links on each page. The algorithm visits each of the n URLs once. For each URL visited, it extracts links, which takes O(m) time on average, assuming a constant time to process each link. Therefore, the total time complexity is O(n*m), where n is the number of unique URLs and m is the average number of links per page.
Space Complexity
O(N)The algorithm uses a queue (list) to store URLs to visit and a set to keep track of visited URLs. In the worst-case scenario, the queue could hold all the URLs on the website, and the set would also store all the URLs, where N is the total number of unique URLs on the website. Therefore, the auxiliary space used is proportional to the number of URLs on the website. This leads to a space complexity of O(N).

Edge Cases

CaseHow to Handle
Null or empty start URLReturn an empty list or throw an IllegalArgumentException as there is no starting point for crawling.
Invalid URL format in start URL or discovered linksUse a try-catch block when creating URL objects to gracefully handle MalformedURLException and log the error, skipping the invalid URL.
Circular links/Cycles in the website graphMaintain a 'visited' set of URLs to prevent infinite loops and redundant crawling of the same pages.
Website returning HTTP error codes (404, 500, etc.)Check the HTTP status code before parsing the page content and skip processing if it's an error code, logging the error.
Maximum crawl depth reachedStop crawling when the maximum allowed depth is reached to prevent excessive resource consumption.
Robots.txt disallows crawling certain pathsImplement Robots Exclusion Protocol parsing to respect website's crawling restrictions before visiting any URL.
Large number of outgoing links on a single page causing memory issuesImplement a limit on the number of links extracted from a single page to prevent memory exhaustion.
Network issues causing connection timeoutsImplement retry mechanisms with exponential backoff for failed network requests to improve resilience.