Taro Logo

Design Search Autocomplete System

Hard
Apple logo
Apple
1 view
Topics:
TreesStringsDatabase Problems

Design Search Autocomplete System

Design a search autocomplete system that provides top 3 most frequent relevant sentences based on user input.

  1. Initialization: The system is initialized with a list of sentences and their corresponding frequencies (hotness). For example:

    sentences = ["i love you", "island", "ironman", "i love coding"]

    times = [5, 3, 2, 2]

    This means "i love you" has a frequency of 5, "island" has a frequency of 3, and so on.

  2. Input: The system receives user input character by character. After each character, it should return the top 3 most frequent sentences starting with the current input.

    • If the input exactly matches any sentence, return it

    • If user inputs a special character #, it means the user finished typing a new sentence. In this case, add the sentence to data structure and set the frequency as 1 or increment the frequency if the sentence already exists.

    • Sentences are ranked by frequency (hotness) first and then lexicographical order.

    • If less than 3 sentences match the current input, return all matching sentences.

  3. Example:

    AutocompleteSystem(["i love you", "island", "ironman", "i love coding"], [5, 3, 2, 2])
    input("i") -> ["i love you", "island", "i love coding"]
    input(" ") -> ["i love you", "i love coding"]
    input("a") -> []
    input("#") -> []
    input("i love it") -> []
    input("#") -> []
    
    • Initially the system has sentences: ["i love you", "island", "ironman", "i love coding"] with frequencies [5, 3, 2, 2].
    • When the user types "i", the system suggests "i love you" (5), "island" (3), and "i love coding" (2).
    • When the user types a space after "i", the system suggests "i love you" and "i love coding".
    • When the user types "a" after "i ", there are no matches.
    • Then the user types "#", the new sentence "i love it" is added to the system with a frequency of 1.

What data structures and algorithms would you use to implement this system efficiently? Consider edge cases and analyze the time and space complexity of your solution.

Solution


Design Search Autocomplete System

Let's discuss how to design a search autocomplete system. This system should provide search suggestions as the user types a query.

1. Naive Solution

A simple approach would be to store all sentences in a list. For each character the user types, iterate through the entire list of sentences and find the sentences that start with the current input. Sort these matching sentences by hotness (frequency), and return the top 3.

  • Data Structure: List of strings (sentences) and their frequencies.
  • Algorithm:
    1. When a character is typed, filter the list of sentences to find those starting with the current input.
    2. Sort the filtered sentences by frequency (hotness) in descending order and then lexicographically.
    3. Return the top 3.

Example:

sentences = ["i love you", "island", "ironman", "i love coding"]

If the user types "i ", the algorithm would:

  1. Filter the sentences to ["i love you", "i love coding"]
  2. Assume frequencies are {"i love you": 5, "i love coding": 2}. Sort by frequency and then lexicographically: ["i love you", "i love coding"].
  3. Return ["i love you", "i love coding"] (since there are only two matches).

Big O Analysis:

  • Time Complexity: O(N * K * log(K)), where N is the number of sentences and K is the average length of a sentence. The N comes from the filtering, the K from comparing the strings and log(K) from the sort. String comparison can be considered O(K).
  • Space Complexity: O(N), where N is the number of sentences to store the initial data.

2. Optimal Solution: Trie Data Structure

A more efficient solution involves using a Trie data structure. A Trie (also known as a prefix tree) is a tree-like data structure that stores strings such that common prefixes share the same node. This allows for fast prefix-based search.

  • Data Structure: Trie, where each node stores:
    • A map of characters to child nodes.
    • A boolean indicating whether the node represents the end of a sentence.
    • A frequency count for the sentence.
  • Algorithm:
    1. Initialization: Build a Trie from the input sentences and their frequencies.
    2. Input: For each character typed:
      • Traverse the Trie based on the input character. If the character doesn't exist in the current node's children, return an empty list.
      • Once the traversal is complete (or blocked), perform a Depth-First Search (DFS) or Breadth-First Search (BFS) from the current node to find all sentences originating from that node.
      • Sort the found sentences by frequency and then lexicographically.
      • Return the top 3.

Example:

Using the same sentences as before:

sentences = ["i love you", "island", "ironman", "i love coding"]

  1. Build the trie. Each node will contain the hotness of the string if that node represents the end of a string.
  2. If the user types "i ", the algorithm would:
    • Traverse to the node representing "i ".
    • Perform a DFS to find all sentences starting with "i ": ["i love you", "i love coding"]
    • Sort by frequency and lexicographically (as before).
    • Return top 3.

Code (Python):

class TrieNode:
    def __init__(self):
        self.children = {}
        self.is_sentence = False
        self.frequency = 0

class AutocompleteSystem:
    def __init__(self, sentences, times):
        self.root = TrieNode()
        self.current_prefix = ""
        
        for i, sentence in enumerate(sentences):
            self.add_sentence(sentence, times[i])

    def add_sentence(self, sentence, time):
        node = self.root
        for char in sentence:
            if char not in node.children:
                node.children[char] = TrieNode()
            node = node.children[char]
        node.is_sentence = True
        node.frequency = time

    def input(self, c):
        if c == '#':
            self.add_sentence(self.current_prefix, 1)
            self.current_prefix = ""
            return []

        self.current_prefix += c
        node = self.root
        for char in self.current_prefix:
            if char not in node.children:
                return []
            node = node.children[char]

        results = []
        self.dfs(node, self.current_prefix, results)
        results.sort(key=lambda x: (-x[0], x[1]))
        return [sentence for _, sentence in results[:3]]

    def dfs(self, node, sentence, results):
        if node.is_sentence:
            results.append((node.frequency, sentence))

        for char, child in node.children.items():
            self.dfs(child, sentence + char, results)

Big O Analysis:

  • Initialization (Building Trie): O(N * K), where N is the number of sentences and K is the average length of a sentence.
  • Input:
    • Trie Traversal: O(P), where P is the length of the input prefix.
    • DFS (or BFS): O(M), where M is the number of nodes in the subtree rooted at the last traversed node. In the worst case, this could be O(N*K), but on average, it is much smaller.
    • Sorting: O(L * log(L)), where L is the number of sentences found during the DFS. Since we only keep top 3, L will be small in practice, at most the number of sentences with a common prefix.
  • Space Complexity: O(N * K) to store the Trie, where N is the number of sentences and K is the average sentence length.

3. Edge Cases

  • Empty Input: If the input is empty, return an empty list.
  • No Matching Sentences: If no sentences start with the input, return an empty list.
  • Input Character Not in Trie: If, during Trie traversal, an input character is not found, it means there are no matching sentences, so return an empty list.
  • Adding New Sentences: When the user types '#' to submit a new sentence:
    • Add the new sentence to the Trie with a frequency of 1 (or increment if it already exists).
    • Reset the current_prefix to an empty string.
  • Case Sensitivity: The problem statement does not specify case sensitivity. The provided code is case-sensitive. If case-insensitivity is desired, convert all sentences to lowercase during initialization and input processing.