Taro Logo

Number of Distinct Substrings in a String

Medium
Uber logo
Uber
0 views
Topics:
StringsTrees

Let's explore a string-related problem. You are given a string s. Your task is to find the number of distinct substrings present in the given string.

For example:

  • If s = "abcabc", the distinct substrings are {"a", "ab", "abc", "abca", "abcab", "abcabc", "b", "bc", "bca", "bcab", "bcabc", "c", "ca", "cab", "cabc"}. Therefore, the answer would be 15 (after removing duplicates).
  • If s = "aaaa", the distinct substrings are {"a", "aa", "aaa", "aaaa"}. The answer would be 4.
  • If s = "", the distinct substrings are {}. The answer would be 0.

Could you design an algorithm to solve this problem efficiently? Consider approaches and their time/space complexities, and cover edge cases.

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. What is the maximum length of the input string?
  2. Does the string contain only ASCII characters, or can it contain Unicode characters?
  3. Is a substring case-sensitive (i.e., is 'a' different from 'A')?
  4. Should I return the count of distinct substrings, or the actual set/list of distinct substrings?
  5. Is an empty string a valid input, and if so, how should I handle it?

Brute Force Solution

Approach

To find the number of different pieces within a string using brute force, we will look at every possible piece that can be cut out from the string. This involves considering pieces of all possible lengths and starting positions.

Here's how the algorithm would work step-by-step:

  1. First, grab all pieces that are one character long.
  2. Next, grab all pieces that are two characters long, starting at the beginning, then shifting one character at a time, all the way to the end.
  3. Continue this process for pieces of length three, then four, and so on, until you've considered pieces as long as the entire original string.
  4. As you grab each piece, keep track of whether you've seen that exact piece before.
  5. At the end, count up the number of unique pieces you found, ignoring any duplicates.

Code Implementation

def number_of_distinct_substrings_brute_force(input_string):
    unique_substrings = set()
    string_length = len(input_string)

    for substring_length in range(1, string_length + 1):
        # Iterate through all possible substring lengths

        for starting_index in range(string_length - substring_length + 1):
            # Iterate through all possible starting positions

            ending_index = starting_index + substring_length
            substring = input_string[starting_index:ending_index]

            # Only add the substring if we haven't seen it before
            if substring not in unique_substrings:

                unique_substrings.add(substring)

    return len(unique_substrings)

Big(O) Analysis

Time Complexity
O(n^3)The algorithm iterates through all possible substring lengths, which ranges from 1 to n, where n is the length of the input string. For each length, it iterates through all possible starting positions, which also takes O(n) time. Finally, for each substring, the algorithm needs to check if it already exists in the set of distinct substrings, which takes O(n) time in the worst case if we're doing a naive string comparison, thus we have a cost of n * n * n. Therefore, the overall time complexity is O(n^3).
Space Complexity
O(N^2)The algorithm stores substrings in order to check for duplicates. In the worst case, where all substrings are distinct, the set of unique substrings grows to contain substrings of length 1, 2, ..., N, where N is the length of the input string. The space required to store all these substrings can be approximated as the sum of the lengths of all substrings, which can be O(N^2). Therefore, the auxiliary space complexity is O(N^2).

Optimal Solution

Approach

Finding all the different substrings directly is slow. Instead, we'll cleverly build a tree-like structure to represent all the substrings, making sure to avoid duplicates. The key insight is to use this structure to count distinct substrings efficiently.

Here's how the algorithm would work step-by-step:

  1. Imagine a tree where each path from the root represents a substring. The root represents the empty string.
  2. Start building this tree by adding substrings one character at a time, starting from the beginning of the original string.
  3. For each character we add to our substring, check if the new, longer substring already exists as a path in our tree. If it does, we don't add it again (this prevents duplicates).
  4. If the new substring doesn't exist, add it to the tree as a new branch. This creates a new path representing this distinct substring.
  5. Continue this process, systematically adding all possible substrings, character by character, to our tree.
  6. At the end, the number of paths (or nodes, excluding the root) in the tree tells us the number of distinct substrings.
  7. By using this tree structure, we only store each distinct substring once, making the counting process much faster than generating all possible substrings directly.

Code Implementation

class Node:
    def __init__(self):
        self.children = {}

def number_of_distinct_substrings(input_string):
    root = Node()
    distinct_substring_count = 0

    for i in range(len(input_string)):
        current_node = root
        for j in range(i, len(input_string)):
            character = input_string[j]

            # Check if the character exists as a child
            if character not in current_node.children:

                # Create a new node for the character
                current_node.children[character] = Node()
                distinct_substring_count += 1

            current_node = current_node.children[character]

    return distinct_substring_count

Big(O) Analysis

Time Complexity
O(n^2)The algorithm iterates through all possible substrings of the input string of length n. Constructing each substring involves iterating from the starting position to the ending position, and checking if the new substring already exists in the tree. The outer loop implicitly runs up to n times (for starting positions). The inner operations (substring generation and tree search/insertion) can take up to O(n) time in the worst case for each starting position. Therefore, in the worst-case scenario, checking and inserting each substring into the tree structure contributes O(n) time complexity. Thus, the overall time complexity is approximately proportional to n * n, simplifying to O(n^2).
Space Complexity
O(N^2)The algorithm constructs a tree-like structure to represent all distinct substrings. In the worst-case scenario, where all substrings are unique, each substring will correspond to a node in the tree. The number of possible substrings of a string of length N is N(N+1)/2, which is proportional to N^2. Therefore, the auxiliary space used by the tree structure is proportional to N^2, resulting in a space complexity of O(N^2).

Edge Cases

CaseHow to Handle
Null or empty input stringReturn 0 as there are no substrings in an empty string.
String with only one characterReturn 1 as the only substring is the string itself.
String with all identical characters (e.g., 'aaaa')The number of distinct substrings will be n, where n is the length of the string.
Very long string (approaching memory limits)Ensure the chosen algorithm is memory-efficient; consider using a rolling hash or a trie with careful memory management to avoid exceeding memory limits.
String containing Unicode charactersEnsure the chosen language's string handling and substring extraction correctly handles Unicode characters and their varying lengths.
String with maximum allowed characters (e.g., all printable ASCII characters)The chosen algorithm should efficiently handle the large number of substrings that result from a diverse character set.
String contains only whitespace characters.The solution should still enumerate and count these whitespace substrings as distinct.
Integer overflow when calculating hash values for substrings (if using hashing)Use a sufficiently large integer type (e.g., long in Java) and consider using modulo arithmetic to prevent overflow.