Design Search Autocomplete System

Hard

Google

1 view

Topics:

StringsTrees

Let's design an autocomplete system. Imagine you are building the backend for a search bar that provides suggestions as the user types. Here's how it should work:

Data Storage: You have a dataset of sentences and their corresponding "hotness" scores (number of times the sentence has been searched). For example:
- "i love you": 5
- "island": 3
- "ironman": 2
- "i love leetcode": 2
Input: The user enters characters one at a time. After each character, the system returns the top 3 "hottest" sentences starting with the current input. If two sentences have the same "hotness", sort them lexicographically (alphabetical order).
Output: A list of the top 3 matching sentences.
New Sentences: If the user types a sentence not already in the system and presses '#', the system adds the new sentence with a hotness of 1.

Example:

Assume the system is initialized with the above dataset.

User types "i": System returns ["i love you", "island", "ironman"]
User types " " (space): System returns ["i love you", "i love leetcode"]
User types "a": System returns [] (no matches)
User types "i love new#": System adds "i love new" with hotness 1.
User types "i": System now returns ["i love you", "island", "i love new"]

Specific Requirements:

Implement the AutocompleteSystem class with the following methods:
- AutocompleteSystem(String[] sentences, int[] times): Initializes the system with the given sentences and their hotness scores.
- List<String> input(char c): Processes the next character c typed by the user and returns the top 3 matching sentences.
The sentences consist of lowercase English letters, spaces, and the special character '#'.
The hotness scores are non-negative integers.

Discuss the data structures you would use, the algorithms for efficient searching, and how you would handle edge cases such as empty input or no matching sentences.

Design Search Autocomplete System

Hard

Google

1 view

Topics:

StringsTrees

Let's design an autocomplete system. Imagine you are building the backend for a search bar that provides suggestions as the user types. Here's how it should work:

Data Storage: You have a dataset of sentences and their corresponding "hotness" scores (number of times the sentence has been searched). For example:
- "i love you": 5
- "island": 3
- "ironman": 2
- "i love leetcode": 2
Input: The user enters characters one at a time. After each character, the system returns the top 3 "hottest" sentences starting with the current input. If two sentences have the same "hotness", sort them lexicographically (alphabetical order).
Output: A list of the top 3 matching sentences.
New Sentences: If the user types a sentence not already in the system and presses '#', the system adds the new sentence with a hotness of 1.

Example:

Assume the system is initialized with the above dataset.

User types "i": System returns ["i love you", "island", "ironman"]
User types " " (space): System returns ["i love you", "i love leetcode"]
User types "a": System returns [] (no matches)
User types "i love new#": System adds "i love new" with hotness 1.
User types "i": System now returns ["i love you", "island", "i love new"]

Specific Requirements:

Implement the AutocompleteSystem class with the following methods:
- AutocompleteSystem(String[] sentences, int[] times): Initializes the system with the given sentences and their hotness scores.
- List<String> input(char c): Processes the next character c typed by the user and returns the top 3 matching sentences.
The sentences consist of lowercase English letters, spaces, and the special character '#'.
The hotness scores are non-negative integers.

Discuss the data structures you would use, the algorithms for efficient searching, and how you would handle edge cases such as empty input or no matching sentences.

Solution

Design Search Autocomplete System

Let's discuss how to design a search autocomplete system. This involves storing historical search data and providing suggestions as the user types. We'll cover a naive approach and then an optimized solution using a Trie data structure.

1. Naive Approach

A straightforward approach would be to store all previously searched sentences and their frequencies. When the user types an input, we iterate through all stored sentences, check if the sentence starts with the input, and if it does, add it to the suggestions list. Sort the suggestions by frequency and return the top 3.

Algorithm:

Store Data: Maintain a list of sentences and their frequencies (e.g., using a hash map).
Input: When the user inputs a character, retrieve the current input string.
Search: Iterate through the stored sentences. If a sentence starts with the input string, add it to a list of candidates.
Sort: Sort the candidates by frequency in descending order. If frequencies are the same, sort lexicographically.
Return: Return the top 3 candidates.

Example:

Stored data:

"i love you": 5
"island": 3
"ironman": 2
"i love leetcode": 2

Input: "i "

Sentences starting with "i ": "i love you", "i love leetcode"
Sorted: "i love you" (5), "i love leetcode" (2)
Return: ["i love you", "i love leetcode"]

Complexity Analysis:

Time Complexity: O(N * M * log K), where N is the number of stored sentences, M is the average length of sentences, and K is the number of matching sentences (K <= N).
Searching for matching sentences takes O(N * M).
Sorting the matching sentences takes O(K * log K).
Space Complexity: O(N * M) to store the sentences and their frequencies.

2. Optimal Solution: Trie Data Structure

The Trie (also known as a prefix tree) is a tree-like data structure that is very efficient for prefix-based search. Each node in the Trie represents a character, and the path from the root to a node represents a prefix. We can augment the Trie nodes to store frequency information.

Algorithm:

Trie Construction:
- Create a Trie node with children (a map of characters to Trie nodes) and a frequency counter.
- Insert each sentence into the Trie. For each character in the sentence, traverse or create a new node. Update the frequency at the terminal node.
Search:
- Start from the root of the Trie and traverse based on the input characters. If a character is not found, return an empty list.
- Once the prefix node is found, perform a Depth-First Search (DFS) or Breadth-First Search (BFS) to find all sentences starting with the prefix.
- Collect the sentences and their frequencies. Sort them by frequency and lexicographically, and return the top 3.

Code:

class TrieNode:
    def __init__(self):
        self.children = {}
        self.sentences = [] # List of (frequency, sentence) tuples

class AutocompleteSystem:
    def __init__(self, sentences, times):
        self.root = TrieNode()
        self.keyword = ""
        for i, sentence in enumerate(sentences):
            self.add(sentence, times[i])

    def add(self, sentence, time):
        node = self.root
        for char in sentence:
            if char not in node.children:
                node.children[char] = TrieNode()
            node = node.children[char]
            node.sentences.append((-time, sentence)) # Store negative time for easier sorting
            node.sentences.sort()
            if len(node.sentences) > 3:
                node.sentences.pop()

    def search(self, c):
        if c == '#':
            self.add(self.keyword, 1)
            self.keyword = ""
            return []

        self.keyword += c
        node = self.root
        for char in self.keyword:
            if char not in node.children:
                return []
            node = node.children[char]

        return [sentence for _, sentence in node.sentences]

Complexity Analysis:

Trie Construction: O(N * M), where N is the number of sentences, and M is the average length of a sentence.
Search: O(P + K * log K), where P is the length of the prefix and K is the number of sentences with that prefix stored in Trie node (max 3 in our case). The sorting is limited to top 3.
Space Complexity: O(N * M), where N is the number of sentences, and M is the average length of a sentence.

Edge Cases:

Empty Input: If the input is empty, return an empty list or the most frequent sentences overall (depending on the requirements).
No Matching Sentences: If there are no sentences starting with the input, return an empty list.
Input Contains Special Characters: Handle special characters appropriately. The provided code handles # as the end of sentence character.
Frequency Tie: If multiple sentences have the same frequency, sort them lexicographically.
Large Dataset: For a very large dataset, consider using a distributed Trie or other distributed data structures to handle the load and improve performance.

Solution

Design Search Autocomplete System

1. Naive Approach

Algorithm:

Store Data: Maintain a list of sentences and their frequencies (e.g., using a hash map).
Input: When the user inputs a character, retrieve the current input string.
Search: Iterate through the stored sentences. If a sentence starts with the input string, add it to a list of candidates.
Sort: Sort the candidates by frequency in descending order. If frequencies are the same, sort lexicographically.
Return: Return the top 3 candidates.

Example:

Stored data:

"i love you": 5
"island": 3
"ironman": 2
"i love leetcode": 2

Input: "i "

Sentences starting with "i ": "i love you", "i love leetcode"
Sorted: "i love you" (5), "i love leetcode" (2)
Return: ["i love you", "i love leetcode"]

Complexity Analysis:

Time Complexity: O(N * M * log K), where N is the number of stored sentences, M is the average length of sentences, and K is the number of matching sentences (K <= N).
Searching for matching sentences takes O(N * M).
Sorting the matching sentences takes O(K * log K).
Space Complexity: O(N * M) to store the sentences and their frequencies.

2. Optimal Solution: Trie Data Structure

Algorithm:

Trie Construction:
- Create a Trie node with children (a map of characters to Trie nodes) and a frequency counter.
- Insert each sentence into the Trie. For each character in the sentence, traverse or create a new node. Update the frequency at the terminal node.
Search:
- Start from the root of the Trie and traverse based on the input characters. If a character is not found, return an empty list.
- Once the prefix node is found, perform a Depth-First Search (DFS) or Breadth-First Search (BFS) to find all sentences starting with the prefix.
- Collect the sentences and their frequencies. Sort them by frequency and lexicographically, and return the top 3.

Code:

class TrieNode:
    def __init__(self):
        self.children = {}
        self.sentences = [] # List of (frequency, sentence) tuples

class AutocompleteSystem:
    def __init__(self, sentences, times):
        self.root = TrieNode()
        self.keyword = ""
        for i, sentence in enumerate(sentences):
            self.add(sentence, times[i])

    def add(self, sentence, time):
        node = self.root
        for char in sentence:
            if char not in node.children:
                node.children[char] = TrieNode()
            node = node.children[char]
            node.sentences.append((-time, sentence)) # Store negative time for easier sorting
            node.sentences.sort()
            if len(node.sentences) > 3:
                node.sentences.pop()

    def search(self, c):
        if c == '#':
            self.add(self.keyword, 1)
            self.keyword = ""
            return []

        self.keyword += c
        node = self.root
        for char in self.keyword:
            if char not in node.children:
                return []
            node = node.children[char]

        return [sentence for _, sentence in node.sentences]

Complexity Analysis:

Trie Construction: O(N * M), where N is the number of sentences, and M is the average length of a sentence.
Search: O(P + K * log K), where P is the length of the prefix and K is the number of sentences with that prefix stored in Trie node (max 3 in our case). The sorting is limited to top 3.
Space Complexity: O(N * M), where N is the number of sentences, and M is the average length of a sentence.

Edge Cases:

Empty Input: If the input is empty, return an empty list or the most frequent sentences overall (depending on the requirements).
No Matching Sentences: If there are no sentences starting with the input, return an empty list.
Input Contains Special Characters: Handle special characters appropriately. The provided code handles # as the end of sentence character.
Frequency Tie: If multiple sentences have the same frequency, sort them lexicographically.
Large Dataset: For a very large dataset, consider using a distributed Trie or other distributed data structures to handle the load and improve performance.