Given a starting url
, implement a web crawler to crawl all websites that are reachable from the starting url
. You should only crawl websites under the same domain as the starting url
.
Implement the Crawler
class:
Crawler(string startUrl)
Initializes the system with the start url
.List<string> crawl(string url)
Returns a list of URLs obtained by crawling the given url
.Assume you have an implementation of a HtmlParser
class that:
HtmlParser.getUrls(url)
Returns a list of URLs that are all on the same domain as the url.You should return all URLs in lexicographical order.
Example 1:
Input: startUrl = "http://news.yahoo.com/news/topics/", urls = [ "http://news.yahoo.com", "http://news.yahoo.com/news", "http://news.yahoo.com/news/topics/", "http://news.google.com", "http://news.yahoo.com/us" ] Output: ["http://news.yahoo.com","http://news.yahoo.com/news","http://news.yahoo.com/news/topics/","http://news.yahoo.com/us"]
Constraints:
1 <= urls.length <= 1000
1 <= urls[i].length <= 300
startUrl
will be in the same domain as urls
.1 <= startUrl.length <= 300
http://
or https://
.When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:
The brute force web crawler visits every single webpage reachable from a starting point. It methodically explores the web by following every link it finds. This ensures that no possible page is missed, though it might be very inefficient.
Here's how the algorithm would work step-by-step:
import requests
def web_crawler(start_url, max_pages=100):
pages_to_visit = [start_url]
visited_pages = set()
page_count = 0
while pages_to_visit and page_count < max_pages:
current_url = pages_to_visit.pop(0)
# Avoid re-visiting already processed URLs
if current_url in visited_pages:
continue
try:
response = requests.get(current_url)
response.raise_for_status() # Raise HTTPError for bad responses (4xx or 5xx)
page_content = response.text
# Mark the current URL as visited to avoid loops
visited_pages.add(current_url)
page_count += 1
# Extract all the links from the current webpage
new_urls = extract_links(page_content, current_url)
# Add discovered links for future exploration
for new_url in new_urls:
if new_url not in visited_pages and new_url not in pages_to_visit:
pages_to_visit.append(new_url)
except requests.exceptions.RequestException as e:
print(f"Error crawling {current_url}: {e}")
return visited_pages
def extract_links(page_content, base_url):
from bs4 import BeautifulSoup
soup = BeautifulSoup(page_content, 'html.parser')
links = []
for anchor in soup.find_all('a', href=True):
link = anchor['href']
absolute_url = get_absolute_url(base_url, link)
if absolute_url:
links.append(absolute_url)
return links
def get_absolute_url(base_url, url):
from urllib.parse import urljoin, urlparse
absolute_url = urljoin(base_url, url)
parsed_url = urlparse(absolute_url)
if parsed_url.scheme not in ('http', 'https'):
return None
return absolute_url
The most efficient way to crawl a website is to explore it layer by layer, like tracing a tree. We avoid repeatedly visiting the same pages and make sure to visit all important pages by using this method.
Here's how the algorithm would work step-by-step:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
def web_crawler(seed_url, max_urls=100):
visited_urls = set()
urls_to_visit = [seed_url]
crawled_urls_count = 0
while urls_to_visit and crawled_urls_count < max_urls:
current_url = urls_to_visit.pop(0)
# Avoid re-crawling already processed URLs.
if current_url in visited_urls:
continue
try:
response = requests.get(current_url)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
# This extracts all the links from the current webpage.
for anchor in soup.find_all('a', href=True):
absolute_url = urljoin(current_url, anchor['href'])
# Normalize URLs to prevent duplicates.
parsed_url = urlparse(absolute_url)
normalized_url = parsed_url.scheme + "://" + parsed_url.netloc + parsed_url.path
# Avoid re-crawling already processed URLs.
if normalized_url not in visited_urls:
urls_to_visit.append(normalized_url)
except requests.exceptions.RequestException as e:
print(f"Error crawling {current_url}: {e}")
visited_urls.add(current_url)
crawled_urls_count += 1
return visited_urls
Case | How to Handle |
---|---|
Null or empty start URL | Return an empty list or throw an IllegalArgumentException as there is no starting point for crawling. |
Invalid URL format in start URL or discovered links | Use a try-catch block when creating URL objects to gracefully handle MalformedURLException and log the error, skipping the invalid URL. |
Circular links/Cycles in the website graph | Maintain a 'visited' set of URLs to prevent infinite loops and redundant crawling of the same pages. |
Website returning HTTP error codes (404, 500, etc.) | Check the HTTP status code before parsing the page content and skip processing if it's an error code, logging the error. |
Maximum crawl depth reached | Stop crawling when the maximum allowed depth is reached to prevent excessive resource consumption. |
Robots.txt disallows crawling certain paths | Implement Robots Exclusion Protocol parsing to respect website's crawling restrictions before visiting any URL. |
Large number of outgoing links on a single page causing memory issues | Implement a limit on the number of links extracted from a single page to prevent memory exhaustion. |
Network issues causing connection timeouts | Implement retry mechanisms with exponential backoff for failed network requests to improve resilience. |