Taro Logo

Longest Common Subsequence

Medium
Asked by:
Profile picture
Profile picture
Profile picture
Profile picture
+17
More companies
Profile picture
Profile picture
Profile picture
Profile picture
Profile picture
Profile picture
Profile picture
Profile picture
Profile picture
Profile picture
Profile picture
Profile picture
Profile picture
Profile picture
Profile picture
Profile picture
Profile picture
221 views
Topics:
StringsDynamic Programming

Given two strings text1 and text2, return the length of their longest common subsequence. If there is no common subsequence, return 0.

A subsequence of a string is a new string generated from the original string with some characters (can be none) deleted without changing the relative order of the remaining characters.

  • For example, "ace" is a subsequence of "abcde".

A common subsequence of two strings is a subsequence that is common to both strings.

Example 1:

Input: text1 = "abcde", text2 = "ace" 
Output: 3  
Explanation: The longest common subsequence is "ace" and its length is 3.

Example 2:

Input: text1 = "abc", text2 = "abc"
Output: 3
Explanation: The longest common subsequence is "abc" and its length is 3.

Example 3:

Input: text1 = "abc", text2 = "def"
Output: 0
Explanation: There is no such common subsequence, so the result is 0.

Constraints:

  • 1 <= text1.length, text2.length <= 1000
  • text1 and text2 consist of only lowercase English characters.

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. What is the expected character set for the input strings? For example, are they limited to lowercase English letters, or can they include uppercase, numbers, or special symbols?
  2. What are the constraints on the lengths of `text1` and `text2`? Are they expected to be of similar length, or can one be significantly longer than the other?
  3. Can either of the input strings be empty? If so, what is the expected output when one or both strings are empty?
  4. Is the comparison case-sensitive? For example, should 'A' be considered a common subsequence of 'ABC' and 'a'?
  5. The problem asks for the length of the longest common subsequence. If there are multiple common subsequences with the same maximum length, is just returning that length sufficient?

Brute Force Solution

Approach

To find the longest shared sequence of letters between two words, the brute force strategy is to look at every possible sequence of letters from the first word. Then, for each of those sequences, we check if it can also be found in the second word, in the same order.

Here's how the algorithm would work step-by-step:

  1. First, generate every single possible sequence of letters from the first word, no matter how short or long. A sequence is just a collection of letters from the word, kept in their original order, but not necessarily next to each other.
  2. For example, if the word is 'CAT', the possible sequences are 'C', 'A', 'T', 'CA', 'CT', 'AT', and 'CAT'.
  3. Take each of these generated sequences one by one.
  4. For each sequence, check if you can find its letters in the second word, in the exact same order.
  5. To do this check, scan through the second word. When you find the first letter of the sequence, continue scanning from that point to find the second letter, and so on, until you've found all the letters of the sequence.
  6. If you successfully find all the letters of a sequence in the second word, that sequence is a 'common subsequence'.
  7. Keep a running count of the longest common subsequence you've found so far.
  8. After checking all possible sequences from the first word, the longest one you recorded is your answer.

Code Implementation

def longest_common_subsequence_brute_force(text1, text2):
    longest_length_so_far = 0
    
    def generate_all_subsequences(index, current_subsequence, all_subsequences):
        if index == len(text1):
            if current_subsequence:
                all_subsequences.add(current_subsequence)
            return
        
        # This recursive call generates subsequences that include the character at the current index.
        generate_all_subsequences(index + 1, current_subsequence + text1[index], all_subsequences)
        
        # This recursive call generates subsequences that do NOT include the character at the current index.
        generate_all_subsequences(index + 1, current_subsequence, all_subsequences)
        
    all_subsequences_of_text1 = set()
    generate_all_subsequences(0, "", all_subsequences_of_text1)
    
    for candidate_subsequence in all_subsequences_of_text1:
        text2_pointer = 0
        subsequence_pointer = 0
        
        while text2_pointer < len(text2) and subsequence_pointer < len(candidate_subsequence):
            if text2[text2_pointer] == candidate_subsequence[subsequence_pointer]:
                subsequence_pointer += 1
            text2_pointer += 1
        
        # We must check if the entire candidate subsequence was found within text2.
        if subsequence_pointer == len(candidate_subsequence):
            if len(candidate_subsequence) > longest_length_so_far:
                longest_length_so_far = len(candidate_subsequence)

    return longest_length_so_far

Big(O) Analysis

Time Complexity
Space Complexity
O(M * 2^M)The primary driver of space complexity is the explicit generation and storage of all subsequences from the first word. For a word of length M, there are 2^M possible subsequences, and the longest of these subsequences has length M. Storing all these subsequences simultaneously requires space proportional to the number of subsequences multiplied by their average length, leading to exponential space usage relative to the length of the first word.

Optimal Solution

Approach

Instead of checking every possible combination of shared parts, we can build a solution by making a series of smaller, simpler decisions. We'll create a table to keep track of the best answers for smaller pieces of the original problem and use those results to figure out the final answer.

Here's how the algorithm would work step-by-step:

  1. Imagine a grid where the letters of the first word form the columns and the letters of the second word form the rows.
  2. The goal is to fill in each cell of this grid with a number representing the length of the longest shared sequence up to that point.
  3. Start by looking at the first letter of each word. If they match, the longest shared sequence so far is one letter long. If not, it's zero.
  4. Now, to fill in any other cell in the grid, look at the two corresponding letters from the words.
  5. If the letters match, you've found another shared letter! The value for this cell is one more than the value in the cell diagonally above and to the left.
  6. If the letters do not match, the longest shared sequence at this point is simply the best result you've found so far. So, you take the larger number from either the cell directly above or the cell directly to the left.
  7. Continue filling out the entire grid this way, building upon your previous answers.
  8. Once the whole grid is filled, the number in the very bottom-right corner is your final answer – the length of the longest shared sequence for the two complete words.

Code Implementation

def longest_common_subsequence(text1, text2):
    length_of_text1 = len(text1)
    length_of_text2 = len(text2)

    # This grid stores the LCS length for all subproblems.
    dp_grid = [[0] * (length_of_text2 + 1) for _ in range(length_of_text1 + 1)]

    for row_index in range(1, length_of_text1 + 1):
        for column_index in range(1, length_of_text2 + 1):
            # If characters match, the LCS is one longer than the LCS of the preceding substrings.
            if text1[row_index - 1] == text2[column_index - 1]:
                dp_grid[row_index][column_index] = 1 + dp_grid[row_index - 1][column_index - 1]
            else:
                # If they don't match, we must take the best result from the subproblems.
                dp_grid[row_index][column_index] = max(dp_grid[row_index - 1][column_index], dp_grid[row_index][column_index - 1])

    # The bottom-right cell contains the solution for the full strings.
    return dp_grid[length_of_text1][length_of_text2]

Big(O) Analysis

Time Complexity
O(n * m)The solution involves creating a grid where the dimensions are determined by the lengths of the two input strings, let's call them n and m. To find the final answer, we must compute a value for every single cell within this grid. The process of filling the grid requires iterating through each row (corresponding to a character in one string) and for each row, iterating through every column (corresponding to a character in the other string). This nested iteration over the lengths of the two strings results in a total of n * m operations, which simplifies to a time complexity of O(n * m).
Space Complexity
O(M*N)The plain English explanation describes creating a grid to store intermediate results for smaller pieces of the problem. This grid has its rows corresponding to one word and columns to the other, requiring us to build an auxiliary 2D data structure. If the first word has a length of M and the second has a length of N, this grid will have M * N cells. Therefore, the extra space required is directly proportional to the product of the lengths of the two input strings.

Edge Cases

One or both input strings are empty
How to Handle:
The longest common subsequence with an empty string is always of length 0, which the base cases of the dynamic programming solution correctly establish.
Input strings are identical
How to Handle:
The algorithm will correctly return the length of the string itself, as the entire string is its own longest common subsequence.
Input strings have no characters in common
How to Handle:
The algorithm will correctly return 0 as no matches will be found during the comparison process.
One string is a subsequence of the other
How to Handle:
The solution correctly identifies the shorter string as the longest common subsequence and returns its length.
Strings with a single character
How to Handle:
The solution will correctly return 1 if the characters are the same and 0 otherwise.
Strings containing special characters, numbers, or different cases
How to Handle:
The character comparisons are based on their underlying values, so the algorithm handles any valid character set correctly.
Very long input strings (e.g., up to 1000 characters each)
How to Handle:
A dynamic programming approach with O(m*n) time and space complexity is efficient enough to pass within typical time limits for these constraints.
Strings with highly repetitive characters (e.g., 'aaaaa' and 'aa')
How to Handle:
The dynamic programming solution correctly handles repeated characters by considering their positions and relative order.