Taro Logo

Distinct Subsequences

Hard
Coupang logo
Coupang
0 views
Topics:
StringsDynamic Programming

Given two strings s and t, return the number of distinct subsequences of s which equals t.

For example:

Input: s = "rabbbit", t = "rabbit" Output: 3 Explanation: As shown below, there are 3 ways you can generate "rabbit" from s. rabbbit rabbbit rabbbit

Input: s = "babgbag", t = "bag" Output: 5 Explanation: As shown below, there are 5 ways you can generate "bag" from s. babgbag babgba babgb babgb babgbag

Write a function that takes two strings as input and returns an integer representing the number of distinct subsequences. Consider edge cases such as empty strings and different lengths of input strings. Explain the time and space complexity of your solution.

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. What are the maximum lengths of the input strings `s` and `t`? Are they of comparable length?
  2. Can `s` or `t` be null or empty strings? What should I return in those cases?
  3. Are the characters in the strings ASCII, or can they be Unicode?
  4. If `t` is not a subsequence of `s`, what should I return?
  5. Are we concerned about integer overflow if the number of distinct subsequences is very large?

Brute Force Solution

Approach

The goal is to find how many times one string appears as a subsequence within another. The brute force approach involves checking every possible way to select characters from the longer string to see if they match the shorter string.

Here's how the algorithm would work step-by-step:

  1. Consider every possible combination of characters you could pick from the longer string.
  2. For each combination, check if the characters you picked, in the order you picked them, exactly match the shorter string.
  3. If they match, count it as one successful way to form the subsequence.
  4. After checking every single possible combination of characters from the longer string, add up all the successful matches you found.
  5. The final sum represents the total number of distinct subsequences.

Code Implementation

def distinct_subsequences_brute_force(longer_string, shorter_string):
    number_of_subsequences = 0

    def generate_subsequence_and_check(index, current_subsequence):
        nonlocal number_of_subsequences

        # Base case: If the subsequence matches target.
        if len(current_subsequence) == len(shorter_string):
            if current_subsequence == shorter_string:
                number_of_subsequences += 1
            return

        # Base case: If we've exhausted the longer string.
        if index == len(longer_string):
            return

        # Explore including the current character.
        generate_subsequence_and_check(
            index + 1,
            current_subsequence + longer_string[index],
        )

        # Explore excluding the current character.
        # We explore both with and without the current char.
        generate_subsequence_and_check(index + 1, current_subsequence)

    generate_subsequence_and_check(0, "")
    return number_of_subsequences

Big(O) Analysis

Time Complexity
O(2^n)The brute force approach considers every possible subsequence of the longer string. For a string of length n, there are 2^n possible subsequences (each character can either be included or excluded). For each of these subsequences, we need to compare it with the shorter string to see if they match. The comparison takes at most O(m) time where m is the length of the shorter string, but since the number of subsequences dominates, we can approximate the time complexity to be O(2^n * m). Since the number of subsequences grows exponentially, the O(m) comparison can be dropped. Therefore the final time complexity is O(2^n).
Space Complexity
O(1)The brute force approach, as described, iterates through all possible subsequences by generating combinations. This involves tracking the indices of selected characters, which can be done without creating any auxiliary data structures that scale with the input size. It only requires a fixed number of variables for indices and counting matches, regardless of the lengths of the input strings. Therefore, the auxiliary space complexity is constant.

Optimal Solution

Approach

The most efficient way to solve this problem involves figuring out how many times the smaller string appears as a subsequence within the larger string. Instead of brute-force searching every possible subsequence, we use a clever method to reuse already computed results. The key is to break the problem down into smaller, overlapping subproblems and store their solutions to avoid redundant calculations.

Here's how the algorithm would work step-by-step:

  1. Imagine a table where each cell represents the number of distinct subsequences of a part of the larger string that match a part of the smaller string.
  2. Start filling the table from the beginning. If the first characters of the two strings we're currently looking at match, the number of distinct subsequences is the sum of: a) the number of distinct subsequences we found without using this character from the larger string and b) the number of distinct subsequences we found using this character from both the larger and smaller string.
  3. If the characters don't match, it means that we can't use the current character from the larger string to form the subsequence of the smaller string, so we take the number of distinct subsequences without using this character from the larger string.
  4. Continue filling the table row by row, using previously calculated values to compute the current value. This avoids recomputing the same thing multiple times.
  5. The final answer will be in the bottom-right cell of the table, representing the number of distinct subsequences of the entire larger string that match the entire smaller string.

Code Implementation

def distinct_subsequences(larger_string, smaller_string):
    larger_string_length = len(larger_string)
    smaller_string_length = len(smaller_string)

    # dp_table[i][j] is the number of distinct subsequences
    # of larger_string[:i] that equals smaller_string[:j]
    dp_table = [[0] * (smaller_string_length + 1) for _ in range(larger_string_length + 1)]

    # Initialization: Empty smaller_string is a subsequence of any larger_string
    for i in range(larger_string_length + 1):
        dp_table[i][0] = 1

    for i in range(1, larger_string_length + 1):
        for j in range(1, smaller_string_length + 1):

            # If chars match, we add two possibilities.
            if larger_string[i - 1] == smaller_string[j - 1]:
                dp_table[i][j] = dp_table[i - 1][j - 1] + dp_table[i - 1][j]

            # No match, inherit from the previous larger_string state.
            else:

                dp_table[i][j] = dp_table[i - 1][j]

    return dp_table[larger_string_length][smaller_string_length]

Big(O) Analysis

Time Complexity
O(m*n)The algorithm constructs a 2D table (or matrix) to store intermediate results. The dimensions of this table are determined by the lengths of the two input strings, s (length n) and t (length m). The algorithm iterates through each cell of this m x n table. For each cell, it performs a constant-time operation (comparison and addition or assignment), based on whether the corresponding characters in the two strings match or not. Therefore, the time complexity is directly proportional to the size of the table, resulting in O(m*n).
Space Complexity
O(m*n)The provided algorithm utilizes a table (implicitly described as a 2D array) to store intermediate results, where each cell represents the number of distinct subsequences. The dimensions of this table are determined by the lengths of the larger string (s) and the smaller string (t). Specifically, if the length of 's' is 'n' and the length of 't' is 'm', the table will have 'm' rows and 'n' columns. Thus, the auxiliary space used is proportional to the product of the lengths of the two input strings, resulting in a space complexity of O(m*n).

Edge Cases

CaseHow to Handle
Empty string S or TIf T is empty, there's one subsequence (empty string); if S is empty but T isn't, there are zero subsequences.
S is shorter than TReturn 0 immediately, as T cannot be a subsequence of S.
S and T are identicalReturn 1, as S is trivially a subsequence of itself.
T contains characters not in SReturn 0, as T cannot be formed as a subsequence of S.
S and T are very long strings (close to the maximum allowed length)Ensure that the dynamic programming table's size does not exceed available memory and that integer overflow is handled correctly in the calculation of subsequence counts.
S and T contain only a single repeating character (e.g., S = 'aaaa', T = 'aa')Handle the combinatorial calculation correctly, where the number of subsequences will involve binomial coefficients.
Integer overflow when calculating the number of subsequencesUse a larger integer data type (e.g., long long in C++, long in Java) or apply modulo operation to prevent overflow.
Large number of duplicate characters in S and TDynamic programming approach naturally handles duplicates, counting each distinct subsequence instance.