Taro Logo

Longest Common Subsequence

Medium
Google logo
Google
2 views
Topics:
StringsDynamic Programming

Given two strings text1 and text2, return the length of their longest common subsequence. If there is no common subsequence, return 0. A subsequence of a string is a new string generated from the original string with some characters (can be none) deleted without changing the relative order of the remaining characters.

  • For example, "ace" is a subsequence of "abcde".

A common subsequence of two strings is a subsequence that is common to both strings.

Here are a few examples:

  1. text1 = "abcde", text2 = "ace" should return 3 because the longest common subsequence is "ace" and its length is 3.
  2. text1 = "abc", text2 = "abc" should return 3 because the longest common subsequence is "abc" and its length is 3.
  3. text1 = "abc", text2 = "def" should return 0 because there is no common subsequence.

How would you approach this problem efficiently? Can you discuss the time and space complexity of your solution?

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. What are the maximum lengths of the input strings, and what character set do they contain (e.g., ASCII, Unicode)?
  2. Are the input strings case-sensitive, or should I treat them as case-insensitive?
  3. If multiple longest common subsequences exist, is any one acceptable, or is there a specific preference?
  4. If either input string is null or empty, what should I return (e.g., null, empty string, throw an exception)?
  5. Should the returned subsequence be a new string object, or can I return a substring of one of the inputs?

Brute Force Solution

Approach

The longest common subsequence problem asks us to find the longest sequence of elements that appear in the same order within two different sequences. A brute force approach means we'll explore absolutely every possible subsequence of one sequence and check if it is a subsequence of the other sequence.

Here's how the algorithm would work step-by-step:

  1. Consider the first sequence. Generate every single possible subsequence. This means trying every possible combination of elements from the first sequence, keeping their original order.
  2. For each subsequence we create, check if that exact subsequence exists in the second sequence, maintaining the order.
  3. If a subsequence from the first sequence is found in the second sequence, we call it a common subsequence.
  4. Keep track of all the common subsequences that you find.
  5. After checking all possible subsequences from the first sequence, find the longest common subsequence among the ones you have recorded. That is your answer.

Code Implementation

def longest_common_subsequence_brute_force(first_sequence, second_sequence):
    all_subsequences = generate_all_subsequences(first_sequence)
    common_subsequences = []

    for subsequence in all_subsequences:
        if is_subsequence(subsequence, second_sequence):
            common_subsequences.append(subsequence)

    if not common_subsequences:
        return ""
    
    longest_subsequence = ""
    for subsequence in common_subsequences:
        if len(subsequence) > len(longest_subsequence):
            longest_subsequence = subsequence

    return longest_subsequence

def generate_all_subsequences(sequence):
    subsequences = []
    number_of_elements = len(sequence)

    for i in range(1 << number_of_elements):
        subsequence = ""
        for j in range(number_of_elements):
            # If the j-th bit is set in i, include the element
            if (i >> j) & 1:
                subsequence += sequence[j]
        subsequences.append(subsequence)

    return subsequences

def is_subsequence(subsequence, sequence):
    subsequence_index = 0
    sequence_index = 0

    # Iterate through both sequences to check
    while subsequence_index < len(subsequence) and sequence_index < len(sequence):
        if subsequence[subsequence_index] == sequence[sequence_index]:
            subsequence_index += 1

        sequence_index += 1

    # If we've reached the end of the subsequence, it is a subsequence
    return subsequence_index == len(subsequence)

if __name__ == '__main__':
    sequence_one = "ABCDGH"
    sequence_two = "AEDFHR"
    # Finding the LCS will validate impl
    longest_common = longest_common_subsequence_brute_force(sequence_one, sequence_two)
    print(f"Longest Common Subsequence: {longest_common}")

    sequence_one = "AGGTAB"
    sequence_two = "GXTXAYB"
    longest_common = longest_common_subsequence_brute_force(sequence_one, sequence_two)
    # Finding the LCS will validate impl
    print(f"Longest Common Subsequence: {longest_common}")

    sequence_one = "ABCDAF"
    sequence_two = "ACBCF"
    longest_common = longest_common_subsequence_brute_force(sequence_one, sequence_two)
    # Finding the LCS will validate impl
    print(f"Longest Common Subsequence: {longest_common}")

Big(O) Analysis

Time Complexity
O(2^n * m)Generating all possible subsequences of the first sequence, which has length n, takes O(2^n) time because each element can either be in the subsequence or not. For each of these 2^n subsequences, we need to check if it exists as a subsequence in the second sequence of length m. Checking if a subsequence exists in the second sequence takes O(m) time in the worst case. Therefore, the overall time complexity is O(2^n * m).
Space Complexity
O(2^N)The algorithm generates all possible subsequences of the first sequence. Generating all subsequences of a sequence of length N requires storing up to 2^N subsequences. Each subsequence needs to be stored temporarily for comparison. Thus, the auxiliary space needed to store these subsequences grows exponentially with the input size N.

Optimal Solution

Approach

The best way to find the longest common subsequence is to build up a table that shows the lengths of common subsequences for different parts of the input sequences. This avoids recomputing the same information over and over again, leading to a much faster solution.

Here's how the algorithm would work step-by-step:

  1. Imagine you have a table. The rows represent one sequence, and the columns represent the other sequence.
  2. Each cell in the table will store the length of the longest common subsequence of the parts of the sequences up to that row and column.
  3. Start filling the table from the top left. If the characters at the current row and column match, then the longest common subsequence length is one plus the length in the diagonal cell above and to the left.
  4. If the characters don't match, then the longest common subsequence length is the maximum of the lengths in the cell above and the cell to the left.
  5. Keep filling in the table until you reach the bottom right cell. The value in this cell is the length of the longest common subsequence of the entire sequences.
  6. To find the actual subsequence, start at the bottom right cell and work backwards. If the characters at the current row and column match, then that character is part of the longest common subsequence. Move diagonally up and to the left.
  7. If the characters don't match, move to the cell that has the larger value (either above or to the left). This will lead you back to the beginning of the longest common subsequence.

Code Implementation

def longest_common_subsequence(string1, string2):
    length_of_string1 = len(string1)
    length_of_string2 = len(string2)

    # Initialize a table to store lengths of common subsequences
    dp_table = [([0] * (length_of_string2 + 1)) for _ in range(length_of_string1 + 1)]

    for i in range(1, length_of_string1 + 1):
        for j in range(1, length_of_string2 + 1):
            # If characters match, increment length from diagonal
            if string1[i - 1] == string2[j - 1]:

                dp_table[i][j] = dp_table[i - 1][j - 1] + 1

            # If no match, take the max from top or left
            else:

                dp_table[i][j] = max(dp_table[i - 1][j], dp_table[i][j - 1])

    # Start building the subsequence backwards
    index = dp_table[length_of_string1][length_of_string2]
    longest_subsequence = [''] * index
    i = length_of_string1
    j = length_of_string2

    # Traceback to construct the LCS
    while i > 0 and j > 0:

        if string1[i - 1] == string2[j - 1]:
            # If matched, add character to LCS
            longest_subsequence[index - 1] = string1[i - 1]
            i -= 1
            j -= 1
            index -= 1
        # If not matched, move to the larger value
        elif dp_table[i - 1][j] > dp_table[i][j - 1]:
            i -= 1
        else:
            j -= 1
    
    return "".join(longest_subsequence)

Big(O) Analysis

Time Complexity
O(mn)The algorithm constructs an (m+1) x (n+1) table where m and n are the lengths of the input sequences. Filling each cell in the table involves a constant time operation: either a comparison and addition, or taking the maximum of two adjacent cells. Since we must fill every cell in the table, the algorithm performs a constant amount of work for each of the m*n cells. Therefore, the time complexity is O(mn).
Space Complexity
O(mn)The algorithm's space complexity is dominated by the table used to store lengths of common subsequences. This table has dimensions m x n, where m is the length of the first sequence and n is the length of the second sequence. Therefore, the auxiliary space required grows proportionally to the product of the lengths of the two input sequences. This results in a space complexity of O(mn).

Edge Cases

CaseHow to Handle
One or both input strings are null or emptyReturn an empty string or null depending on requirements when either input is invalid.
One or both input strings are very long (e.g., exceeding reasonable memory limits)The dynamic programming table can consume significant memory; consider iterative approaches or check for potential out-of-memory errors.
Input strings are identicalThe longest common subsequence should be the entire string in this case.
Input strings have no common charactersThe longest common subsequence should be an empty string.
Input strings contain non-ASCII charactersEnsure the character comparison and storage are Unicode-aware to handle diverse character sets correctly.
The same character appears multiple times in both strings, but not in the same order.The dynamic programming approach correctly finds the longest subsequence, even if characters are interleaved.
Integer overflow in the DP table (if lengths are close to the maximum integer value)Use appropriate data types (e.g., long) or check for potential overflows during the DP table updates if very long strings are expected.
One string is a substring of the other.The solution will correctly find the shorter string as the LCS.