Taro Logo

Implement Trie (Prefix Tree)

Medium
Google logo
Google
2 views
Topics:
StringsTrees

Let's design a Trie (pronounced "try") or prefix tree data structure, commonly used for efficient string key storage and retrieval. Tries are especially valuable in applications like autocomplete and spell checkers.

Your task is to implement the Trie class with the following methods:

  1. Trie(): Initializes an empty Trie.
  2. void insert(String word): Inserts a given word into the Trie.
  3. boolean search(String word): Returns true if the word is present in the Trie (i.e., has been inserted previously); otherwise, returns false.
  4. boolean startsWith(String prefix): Returns true if there exists any word in the Trie that starts with the given prefix; otherwise, returns false.

For instance, consider this sequence of operations:

Trie trie = new Trie();
trie.insert("apple");
trie.search("apple");   // Returns true
trie.search("app");     // Returns false
trie.startsWith("app"); // Returns true
trie.insert("app");
trie.search("app");     // Returns true

What data structures would you use, and how would you implement the insert, search, and startsWith methods efficiently? Consider time and space complexity. How would your implementation handle edge cases, such as inserting or searching for an empty string? What is the Big O runtime of your solution?

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. What character set will the words and prefixes consist of, and will they be case-sensitive?
  2. Are the words and prefixes guaranteed to be valid strings, or should I handle potential null or empty string inputs?
  3. What is the maximum length of a word that can be inserted into the Trie?
  4. How many insert, search, and startsWith operations should I expect to be performed in total?
  5. If `startsWith(prefix)` finds multiple words that start with the given prefix, should I return all of them or just indicate that at least one exists?

Brute Force Solution

Approach

We want to build a system to quickly find words that start with a given set of letters. The brute force approach to this is to simply check every word we know against the given letters each time we want to find something.

Here's how the algorithm would work step-by-step:

  1. When we first start, we have a big list of all the words we know.
  2. Someone gives us a set of letters and asks us to find all the words that start with those letters.
  3. We go through each word in our big list, one by one.
  4. For each word, we check if it begins with the set of letters we were given.
  5. If a word starts with the given letters, we add it to a new list of words that match.
  6. Once we have gone through every word in our original list, we give back the list of matching words.

Code Implementation

def find_words_with_prefix_brute_force(list_of_words, prefix):
    matching_words = []

    # Iterate through each word in the given list of words
    for current_word in list_of_words:

        # Check if the current word starts with the given prefix
        if current_word.startswith(prefix):

            # If it does, add the word to the list of matching words
            matching_words.append(current_word)

    # Return the list of words that start with the given prefix
    return matching_words

def main():
    word_list = ["apple", "apricot", "banana", "avocado", "kiwi"]
    prefix = "ap"

    # Find words in the word list that start with the given prefix
    words_with_prefix = find_words_with_prefix_brute_force(word_list, prefix)

    print(f"Words with prefix '{prefix}': {words_with_prefix}")

    prefix = "ban"

    # Find words in the word list that start with the given prefix
    words_with_prefix = find_words_with_prefix_brute_force(word_list, prefix)

    print(f"Words with prefix '{prefix}': {words_with_prefix}")

if __name__ == "__main__":
    main()

Big(O) Analysis

Time Complexity
O(n*m)The described solution iterates through a list of words, where 'n' represents the number of words in the list. For each word, it checks if the word starts with the given set of letters. 'm' represents the length of the prefix we are checking against (the set of letters). In the worst-case scenario, the prefix comparison takes m time. Therefore, the overall time complexity is O(n*m), where we do m work n times.
Space Complexity
O(M)The described brute force approach creates a new list to store the matching words. In the worst-case scenario, every word in the original list starts with the given prefix, and thus every word is added to the new list. If there are M words in the original list, the new list could grow to a size of M. Therefore, the auxiliary space required is proportional to the number of words in the original list, which is O(M).

Optimal Solution

Approach

A Trie is like a special kind of tree designed to store words in a way that makes searching for them, or parts of them, very fast. Instead of storing whole words at each spot, it cleverly breaks them down into individual letters and shares common prefixes to save space and speed things up.

Here's how the algorithm would work step-by-step:

  1. Imagine each spot in the Trie as a node, representing a single letter or the end of a word.
  2. The starting point of the Trie is an empty node, often called the root.
  3. To add a word, start at the root and follow a path, creating new nodes for each letter of the word if they don't already exist.
  4. Each node has connections to other nodes, one for each letter of the alphabet that could come next.
  5. When you reach the last letter of a word, mark that node as the end of a valid word.
  6. To search for a word, start at the root and follow the path corresponding to the letters in the word. If you reach a dead end, the word isn't in the Trie.
  7. If you reach the end of the word and the node is marked as the end of a valid word, then the word exists in the Trie.
  8. To check if a word is a prefix of any word in the Trie, follow the path of the prefix. If the path exists, the prefix is valid, even if the node isn't marked as the end of a valid word.

Code Implementation

class TrieNode:
    def __init__(self):
        self.children = {}
        self.is_end_of_word = False

class Trie:

    def __init__(self):
        self.root = TrieNode()

    def insert(self, word):
        current_node = self.root
        for char in word:
            # Create node if path doesn't exist
            if char not in current_node.children:
                current_node.children[char] = TrieNode()

            current_node = current_node.children[char]

        # Mark last node as end of word
        current_node.is_end_of_word = True

    def search(self, word):
        current_node = self.root
        for char in word:
            if char not in current_node.children:
                return False
            current_node = current_node.children[char]

        # Verify word exists and is marked as such
        return current_node.is_end_of_word

    def startsWith(self, prefix):
        current_node = self.root
        for char in prefix:
            # Path doesn't exist, so prefix can't exist
            if char not in current_node.children:
                return False

            current_node = current_node.children[char]

        return True

Big(O) Analysis

Time Complexity
O(m)The primary operations within the Trie (insert, search, startsWith) involve traversing the Trie based on the length of the input word or prefix, denoted as 'm'. Inserting involves creating nodes along a path of length 'm' in the worst case. Searching and checking prefixes also require traversing a path of length 'm' at most. Therefore, the time complexity for each of these operations is directly proportional to the length of the word or prefix being processed.
Space Complexity
O(M * K)The space complexity of a Trie depends on the number and length of the words stored. In the worst-case scenario, where no words share a common prefix, each word will have its own dedicated path from the root. Let M be the average length of the words, and K be the total number of words inserted into the Trie. Each node stores references to up to 26 child nodes (for each letter of the alphabet), and the total number of nodes in the Trie can be up to M * K in the worst case. Thus, the auxiliary space required to store the Trie data structure is proportional to O(M * K).

Edge Cases

CaseHow to Handle
Null or empty word/prefix input to insert, search, or startsWithHandle null/empty input gracefully, possibly by returning false for search/startsWith or doing nothing for insert.
Word/prefix containing non-alphabetic characters (e.g., numbers, symbols)Define the allowed character set explicitly and either reject invalid characters or handle them according to the problem specification.
Very long word insertions, approaching memory limitsConsider potential memory exhaustion with extremely long words and consider if the data structure can be optimized for space or if the input needs to be chunked or limited.
Inserting the same word multiple timesThe insert operation should simply overwrite or re-insert the word, so subsequent searches return true.
Searching for a prefix that is also a complete word in the TrieThe startsWith method should return true since a word exists that starts with the given prefix.
Searching for a word that is a prefix of another word in the TrieThe search method should return true only if the exact word exists and is marked as the end of a word.
Trie is empty and search or startsWith is calledBoth search and startsWith should return false if the Trie is initially empty.
Case sensitivity of input words and prefixesDefine whether the Trie should be case-sensitive or case-insensitive and normalize the input accordingly (e.g., convert all input to lowercase).