Taro Logo

Design Search Autocomplete System

Hard
Uber logo
Uber
0 views
Topics:
TreesStrings

Design a search autocomplete system for a search engine.

  1. Initialization: The system is initialized with a list of sentences and their corresponding frequencies. For example:

    sentences = ["i love you", "island", "iroman", "i love leetcode"]
    times = [5, 3, 2, 2]
    
  2. Input: The system receives a character as input. It should return the top 3 sentences that start with the current input. The ranking criteria are:

    • Higher frequency sentences are ranked higher.
    • If frequencies are the same, sentences are ranked lexicographically (alphabetical order).
  3. Termination: When the input character is '#', it signifies the end of a sentence. The current sentence (formed by all inputs since the last '#') should be added to the system with a frequency of 1 (or incremented if it already exists).

  4. Example:

    AutocompleteSystem(sentences, times)
    input('i') // Returns ["i love you", "island", "i love leetcode"]
    input(' ') // Returns ["i love you", "i love leetcode"]
    input('a') // Returns []
    input('#') // Returns []
    input('i') // Returns ["i love you", "island", "i love leetcode"]
    
  5. Constraints:

    • The input will consist of lowercase letters, spaces, and the '#' character.
    • Sentences are unique.
    • The frequency of a sentence is a non-negative integer.

Explain how you would design and implement such a system efficiently, considering both time and space complexity.

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. What is the scale of the dataset (number of search terms and their frequency)?
  2. How frequently will the search terms and their frequencies be updated?
  3. What are the criteria for determining the relevance of a search term (e.g., frequency, recency, popularity)?
  4. How many autocomplete suggestions should the system return?
  5. Should the search be prefix-based, or should it also handle fuzzy matching or typo correction?

Brute Force Solution

Approach

The brute force approach to autocomplete involves checking every possible search term prefix against every term in our dataset. We will check each term in the dataset to see if any of them start with the given prefix. Then return all the terms that begin with this prefix.

Here's how the algorithm would work step-by-step:

  1. Take the input prefix (the beginning part of what the user is typing).
  2. Go through each term in our entire list of possible search terms, one by one.
  3. For each term, check if it starts with the input prefix.
  4. If the term does start with the input prefix, then include this term in a list of suggestions.
  5. Once you've gone through every term in our list, return the list of suggestions.
  6. These are all the search terms that begin with the prefix entered by the user.

Code Implementation

def autocomplete_brute_force(search_terms, prefix):
    suggestions = []

    # Iterate through each term to find matches
    for search_term in search_terms:

        # Check if the current term starts with the prefix
        if search_term.startswith(prefix):

            # Add matching term to the suggestions
            suggestions.append(search_term)

    return suggestions

Big(O) Analysis

Time Complexity
O(nk)We iterate through each of the n terms in the dataset. For each term, we perform a prefix check. The prefix check involves comparing the prefix of length k with the beginning of each term. In the worst-case scenario, this comparison takes k time for each of the n terms. Therefore, the overall time complexity is O(nk), where n is the number of terms in the dataset and k is the length of the longest term or the prefix length, whichever is smaller.
Space Complexity
O(N)The provided algorithm iterates through each term in the list of possible search terms and adds terms that start with the given prefix to a list of suggestions. In the worst-case scenario, every term in the dataset starts with the input prefix, and therefore, all N terms will be added to the list of suggestions. This list of suggestions represents auxiliary space. Hence, the space complexity is O(N), where N is the total number of possible search terms.

Optimal Solution

Approach

The best way to build an autocomplete system is to create a structure to quickly find popular search terms that start with what the user has typed. We use a tree-like structure that helps us efficiently retrieve suggestions and prioritize them based on their frequency.

Here's how the algorithm would work step-by-step:

  1. First, organize all the possible search terms into a special tree where each node represents a letter. Words that share a prefix follow the same path down the tree.
  2. As the user types, travel down the tree based on the letters they've entered. This quickly narrows down the potential search terms.
  3. At each node, keep track of the most popular search terms that can be reached from that point. This way, you don't have to search through everything every time.
  4. When the user pauses typing, get the most popular search terms from the current node in the tree. These are your suggestions.
  5. Show these suggestions to the user, ordered by how popular they are. More frequent searches appear first.
  6. To keep the system up-to-date, whenever someone selects a suggested search term, increase its popularity score in the tree. This helps the system learn and improve over time.

Code Implementation

class TrieNode:
    def __init__(self):
        self.children = {}
        self.is_end_of_word = False
        self.popularity = 0
        self.top_suggestions = []

class AutocompleteSystem:
    def __init__(self, sentences, times):
        self.root = TrieNode()
        self.current_node = self.root
        self.current_prefix = ""
        for i in range(len(sentences)):
            self.add_sentence(sentences[i], times[i])

    def add_sentence(self, sentence, popularity_count):
        node = self.root
        for char in sentence:
            if char not in node.children:
                node.children[char] = TrieNode()
            node = node.children[char]
        node.is_end_of_word = True
        node.popularity = popularity_count

    def get_top_suggestions(self, node):
        suggestions = []
        self.dfs(node, self.current_prefix, suggestions)
        suggestions.sort(key=lambda item: (-item[0], item[1]))
        return [item[1] for item in suggestions[:3]]

    def dfs(self, node, current_string, suggestions):
        if node.is_end_of_word:
            suggestions.append((node.popularity, current_string))

        for char, child in node.children.items():
            self.dfs(child, current_string + char, suggestions)

    def input(self, character):
        if character == '#':
            self.add_sentence(self.current_prefix, 1) # Add the sentence with popularity 1
            self.current_node = self.root
            self.current_prefix = ""
            return []

        self.current_prefix += character

        if character not in self.current_node.children:
            self.current_node.children[character] = TrieNode()
            self.current_node = self.current_node.children[character]

            # Need to create a new node to continue search
            return []

        self.current_node = self.current_node.children[character]

        # Prioritize suggestions with high frequency
        return self.get_top_suggestions(self.current_node)

Big(O) Analysis

Time Complexity
O(L + KlogK)Building the Trie involves inserting each search term, where L represents the total length of all search terms. Traversing the Trie to find the node corresponding to the user's prefix takes O(p) time, where p is the length of the prefix, which is bounded by L. Finding the top K most popular terms at a node involves sorting or using a heap of size K, taking O(KlogK) time, where K is the number of suggestions returned. Therefore, the dominant operations are Trie construction and suggestion retrieval, leading to a total time complexity of O(L + KlogK).
Space Complexity
O(N*L)The primary space consumption arises from the Trie data structure, where N represents the number of search terms and L is the average length of these terms. Each search term is stored as a path in the Trie, requiring space proportional to its length. In the worst-case scenario, the Trie stores all search terms without significant prefix sharing, leading to a space complexity proportional to the sum of the lengths of all search terms. Each node also stores the most popular search terms, contributing an additional space proportional to the number of search terms at each level, which in the worst case is again related to N*L. Thus, the overall auxiliary space complexity is O(N*L).

Edge Cases

CaseHow to Handle
Empty search queryReturn an empty list or the top k most frequent results, depending on design preference.
Null or invalid characters in the search queryFilter out invalid characters or handle them according to the problem's requirements (e.g., replace with a space).
Database of suggestions is emptyReturn an empty list if there are no suggestions in the database.
Maximum query length exceeds system limitationsTruncate the query or return an error if the query exceeds the allowed maximum length.
Large number of suggestions matching the prefixImplement pagination or limit the number of results returned to maintain performance.
Suggestions with identical prefixes but different relevance scoresSort suggestions with identical prefixes by their relevance scores to prioritize more relevant results.
User types very fast, generating multiple queries in quick successionDebounce or throttle the requests to the autocomplete system to prevent overloading the server.
Unicode characters in search query or suggestionsEnsure proper encoding and handling of Unicode characters to prevent display or indexing issues.