Taro Logo

Shortest Common Supersequence

Hard
Meta logo
Meta
6 views
Topics:
StringsDynamic Programming

Given two strings str1 and str2, return the shortest string that has both str1 and str2 as subsequences. If there are multiple valid strings, return any of them.

A string s is a subsequence of string t if deleting some number of characters from t (possibly 0) results in the string s.

For example:

str1 = "abac", str2 = "cab" Output: "cabac"

Explanation: str1 = "abac" is a subsequence of "cabac" because we can delete the first "c". str2 = "cab" is a subsequence of "cabac" because we can delete the last "ac". The answer provided is the shortest such string that satisfies these properties.

str1 = "aaaaaaaa", str2 = "aaaaaaaa" Output: "aaaaaaaa"

Write a function to solve the shortest common supersequence problem. Your function should take two strings as input and return the shortest common supersequence.

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. Can the input strings contain special characters, or are they limited to alphanumeric characters?
  2. What is the maximum length of the input strings?
  3. If multiple shortest common supersequences exist, is any one of them acceptable, or is there a specific one I should return based on some criteria (e.g., lexicographical order)?
  4. Are the input strings case-sensitive?
  5. What should I return if either of the input strings is null or empty?

Brute Force Solution

Approach

The goal is to find the shortest string that contains both of the original strings as subsequences. The brute force approach is to explore every possible combination of characters from the two strings, checking if that combination is a valid supersequence. We keep looking at every possible combination until we find the absolute shortest.

Here's how the algorithm would work step-by-step:

  1. Start by generating all possible merged strings using characters from both input strings.
  2. For each generated string, check if the first input string appears as a subsequence.
  3. Then, check if the second input string also appears as a subsequence in the generated string.
  4. If both input strings are subsequences, the generated string is a common supersequence.
  5. Compare the length of this common supersequence with the shortest one found so far.
  6. If the new supersequence is shorter, remember it as the shortest.
  7. Repeat this process for every possible merged string combination until all have been explored.
  8. Finally, report the shortest common supersequence that was found.

Code Implementation

def shortest_common_supersequence_brute_force(string1, string2):
    shortest_supersequence = None

    def generate_merged_strings(index1, index2, current_string):
        nonlocal shortest_supersequence

        # If we've reached the end of both strings, check the supersequence
        if index1 == len(string1) and index2 == len(string2):
            if is_subsequence(string1, current_string) and is_subsequence(string2, current_string):
                if shortest_supersequence is None or len(current_string) < len(shortest_supersequence):
                    shortest_supersequence = current_string
            return

        # If we still have characters in string1, explore including them
        if index1 < len(string1):
            generate_merged_strings(index1 + 1, index2, current_string + string1[index1])

        # If we still have characters in string2, explore including them
        if index2 < len(string2):
            generate_merged_strings(index1, index2 + 1, current_string + string2[index2])

    def is_subsequence(subsequence, supersequence):
        subsequence_index = 0
        supersequence_index = 0

        while subsequence_index < len(subsequence) and supersequence_index < len(supersequence):
            if subsequence[subsequence_index] == supersequence[supersequence_index]:
                subsequence_index += 1

            supersequence_index += 1

        # Ensure all characters in subsequence are found
        return subsequence_index == len(subsequence)

    generate_merged_strings(0, 0, "")

    # Return the shortest supersequence found
    return shortest_supersequence

Big(O) Analysis

Time Complexity
O((2^(m+n)) * (m+n))The algorithm generates all possible merged strings from two input strings of length m and n. In the worst case, this involves exploring combinations of length up to m+n, leading to approximately 2^(m+n) possible strings. For each generated string, it checks if both input strings are subsequences, requiring a traversal of the generated string (length m+n) for each check. Therefore, the overall time complexity is approximately O((2^(m+n)) * (m+n)), representing exponential growth based on the combined length of the input strings.
Space Complexity
O(2^(m+n) * (m+n))The algorithm generates all possible merged strings, where m and n are the lengths of the two input strings. In the worst case, the number of such merged strings grows exponentially, specifically O(2^(m+n)). For each generated string, which can have a maximum length of m+n, the algorithm checks if the original strings are subsequences. Storing each generated string thus requires O(m+n) space. Therefore, the overall space complexity is O(2^(m+n) * (m+n)) due to the storage of these generated merged strings.

Optimal Solution

Approach

The best way to find the shortest combined string containing two given strings is by recognizing the overlap between them. We can find the longest shared part, and cleverly merge the strings around that.

Here's how the algorithm would work step-by-step:

  1. First, find the longest common sequence that exists within both input strings. This shared sequence is the key to minimizing the length of the supersequence.
  2. Next, construct the shortest common supersequence by including the characters that are unique to each string, while using the longest common sequence as the overlapping part.
  3. Imagine the two strings being merged together with the common sequence as the bridge connecting them. Any remaining parts are simply attached to the ends of the 'bridge'.
  4. The resulting string will be the shortest possible string containing both of the originals.

Code Implementation

def shortest_common_supersequence(string1, string2):
    length_string1 = len(string1)
    length_string2 = len(string2)

    # Initialize a table to store lengths of common subsequences.
    longest_common_subsequence_lengths = [([0] * (length_string2 + 1)) for _ in range(length_string1 + 1)]

    for i in range(1, length_string1 + 1):
        for j in range(1, length_string2 + 1):
            if string1[i - 1] == string2[j - 1]:
                # If characters match, increment length of common subsequence.
                longest_common_subsequence_lengths[i][j] = longest_common_subsequence_lengths[i - 1][j - 1] + 1

            else:
                # Pick the longest amongst the possible options.
                longest_common_subsequence_lengths[i][j] = max(longest_common_subsequence_lengths[i - 1][j], longest_common_subsequence_lengths[i][j - 1])

    # Backtrack to construct the shortest common supersequence.
    i = length_string1
    j = length_string2
    shortest_common_supersequence_string = ""

    while i > 0 and j > 0:
        if string1[i - 1] == string2[j - 1]:
            # If characters match, include the char and move diagonally.
            shortest_common_supersequence_string = string1[i - 1] + shortest_common_supersequence_string
            i -= 1
            j -= 1

        else:
            # Move to the cell with the longer common subsequence.
            if longest_common_subsequence_lengths[i - 1][j] > longest_common_subsequence_lengths[i][j - 1]:
                # Add the character from string1.
                shortest_common_supersequence_string = string1[i - 1] + shortest_common_supersequence_string
                i -= 1

            else:
                # Add the character from string2.
                shortest_common_supersequence_string = string2[j - 1] + shortest_common_supersequence_string
                j -= 1

    # Add any remaining characters from string1.
    while i > 0:
        shortest_common_supersequence_string = string1[i - 1] + shortest_common_supersequence_string
        i -= 1

    # Add any remaining characters from string2.
    while j > 0:
        shortest_common_supersequence_string = string2[j - 1] + shortest_common_supersequence_string
        j -= 1

    return shortest_common_supersequence_string

Big(O) Analysis

Time Complexity
O(m*n)Finding the longest common subsequence (LCS) between the two input strings of length m and n requires building a table of size (m+1)x(n+1). Each cell in the table is filled by comparing characters from the two strings and potentially taking the maximum of two previously computed values. Filling each cell takes constant time, but since there are m*n cells, the computation of the LCS takes O(m*n) time. Constructing the shortest common supersequence from the LCS takes O(m+n) time which is dominated by O(m*n). Thus the overall time complexity is O(m*n).
Space Complexity
O(N*M)The algorithm primarily uses dynamic programming to find the longest common subsequence (LCS). This typically involves creating a 2D table to store the lengths of LCS for different prefixes of the two input strings, where N and M are the lengths of the two input strings respectively. The size of this table will be (N+1) * (M+1), which is the dominant factor in the auxiliary space complexity. Therefore, the space complexity is O(N*M), where N and M are the lengths of the input strings.

Edge Cases

CaseHow to Handle
Both input strings are emptyReturn an empty string since the shortest common supersequence of two empty strings is an empty string.
One input string is empty, the other is notReturn the non-empty string as it is the shortest common supersequence.
Input strings are identicalReturn either of the input strings as the shortest common supersequence.
Very long input strings (potential memory issues with DP table)Consider using a more space-efficient algorithm or technique such as rolling arrays or divide and conquer if the strings are extremely long.
Input strings contain non-ASCII charactersEnsure the character comparison and string concatenation handles Unicode/UTF-8 characters correctly.
Strings with overlapping characters but no common subsequenceThe LCS-based approach correctly identifies the overlapping characters to construct the SCS.
Multiple valid shortest common supersequences existReturn any one of the valid shortest common supersequences since the problem statement doesn't require a specific one.
Null input stringsThrow an IllegalArgumentException or return null/empty string depending on the API contract, after checking for null inputs.