Taro Logo

Regular Expression Matching

Hard
Amazon logo
Amazon
3 views
Topics:
StringsDynamic ProgrammingRecursion

Given an input string s and a pattern p, implement regular expression matching with support for . and * where:

  • . Matches any single character.
  • * Matches zero or more of the preceding element.

The matching should cover the entire input string (not partial).

For example:

  1. s = "aa", p = "a" Output: false Explanation: "a" does not match the entire string "aa".
  2. s = "aa", p = "a*" Output: true Explanation: '*' means zero or more of the preceding element, 'a'. Therefore, by repeating 'a' once, it becomes "aa".
  3. s = "ab", p = "." Output: true Explanation: "." means "zero or more (*) of any character (.)".

Explain your approach and provide the code with time and space complexity analysis.

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. Can the pattern `p` contain characters other than lowercase letters, '.', and '*'?
  2. If the pattern `p` starts with '*', can I assume the input string `s` is empty or that the pattern is invalid?
  3. Is an empty string `s` a valid input, and if so, what should the output be for different patterns `p`?
  4. How should I handle consecutive '*' characters in the pattern `p`? Should they be treated as a single '*'?
  5. Is the matching expected to be an exact match of the entire string `s`, or can it be a substring match?

Brute Force Solution

Approach

The goal is to see if a given text matches a specific pattern. The brute force method involves trying all possible ways the pattern could match the text by considering every potential alignment and interpretation of the pattern's special characters.

Here's how the algorithm would work step-by-step:

  1. Start by checking if the beginning of the text matches the beginning of the pattern.
  2. If the pattern has a '*' character, try matching zero, one, or multiple instances of the preceding character in the pattern to parts of the text.
  3. If the pattern has a '.' character, try matching it to any single character in the text.
  4. For each possible match found at the beginning, see if the rest of the pattern matches the rest of the text in the same way.
  5. Keep exploring different matching possibilities by shifting where the pattern starts matching in the text and repeating the above steps.
  6. If any of these possibilities leads to the entire pattern matching the entire text, then we have a successful match.

Code Implementation

def is_regular_expression_match(text, pattern):
    text_length = len(text)
    pattern_length = len(pattern)

    def backtrack(text_index, pattern_index):
        # If we've reached the end of the pattern, check if we're also at the end of the text.
        if pattern_index == pattern_length:
            return text_index == text_length

        # Check if the next character in the pattern is '*'.
        if pattern_index + 1 < pattern_length and pattern[pattern_index + 1] == '*':
            # Try matching zero occurrences of the character before '*'.
            if backtrack(text_index, pattern_index + 2):
                return True

            # Try matching one or more occurrences.
            while text_index < text_length and (pattern[pattern_index] == '.' or text[text_index] == pattern[pattern_index]):
                if backtrack(text_index + 1, pattern_index + 2):
                    return True
                text_index += 1

            return False

        # If the next pattern character is not '*' then need to check the current characters.
        else:
            if text_index < text_length and (pattern[pattern_index] == '.' or text[text_index] == pattern[pattern_index]):
                return backtrack(text_index + 1, pattern_index + 1)
            else:
                return False

    return backtrack(0, 0)

Big(O) Analysis

Time Complexity
O(2^(m+n))The provided regular expression matching algorithm, in the worst case, explores all possible alignments and interpretations of the pattern within the text due to the '*' and '.' characters. 'n' represents the length of the text and 'm' represents the length of the pattern. The algorithm recursively tries different matching possibilities which can lead to exponential branching, specifically O(2^(m+n)). This exponential behavior arises because, for each character in both the text and pattern, the algorithm might explore multiple paths depending on whether the characters match directly or if wildcard characters are involved.
Space Complexity
O(N*M)The described approach utilizes recursion to explore all possible matching combinations between the text and the pattern. The maximum depth of the recursion depends on the lengths of the text and the pattern, where N is the length of the text and M is the length of the pattern. In the worst-case scenario, each recursive call stores its own copies of variables and possibly substrings within stack frames. Therefore, the space complexity is proportional to the product of the lengths of text and pattern which translates to O(N*M) due to the overlapping subproblems that are inherently explored by brute-force recursion.

Optimal Solution

Approach

The key to solving this efficiently is to avoid recomputing results we've already found. We use a technique called dynamic programming to break the problem down into smaller, overlapping subproblems, solving each only once and storing the results to reuse later.

Here's how the algorithm would work step-by-step:

  1. Imagine a table where each cell represents whether a part of the text matches a part of the pattern.
  2. We'll fill this table in a specific order, starting from the beginning of both the text and the pattern.
  3. For each cell, we check if the current character in the text matches the current character (or a wildcard) in the pattern.
  4. If it matches, the answer for that cell depends on whether the previous parts of the text and pattern matched. We look at the cell diagonally up and to the left.
  5. If the pattern has a '*' character, it means the previous character in the pattern can appear zero or more times. This is the tricky part. We need to consider both possibilities: either we use the character (and stay in the same row), or we don't use the character (and move up one row).
  6. The final cell in the table tells us whether the entire text matches the entire pattern. Because we filled the table in a smart way, we only needed to do each calculation once, making the whole process much faster.

Code Implementation

def is_match(text, pattern): 
    text_length = len(text)
    pattern_length = len(pattern)

    # dp_table[i][j] is true if text[0..i-1] matches pattern[0..j-1]
    dp_table = [[False] * (pattern_length + 1) for _ in range(text_length + 1)]

    # Empty text and empty pattern match
    dp_table[0][0] = True

    # Deals with patterns like a* or a*b* or a*b*c*
    for pattern_index in range(1, pattern_length + 1): 
        if pattern[pattern_index - 1] == '*':
            dp_table[0][pattern_index] = dp_table[0][pattern_index - 2]

    for text_index in range(1, text_length + 1): 
        for pattern_index in range(1, pattern_length + 1):
            if pattern[pattern_index - 1] == text[text_index - 1] or pattern[pattern_index - 1] == '.':
                # Characters match, so result depends on previous subproblem
                dp_table[text_index][pattern_index] = dp_table[text_index - 1][pattern_index - 1]
            elif pattern[pattern_index - 1] == '*':
                # '*' matches zero occurrences of the previous element
                dp_table[text_index][pattern_index] = dp_table[text_index][pattern_index - 2]

                # '*' matches one or more occurrences of the previous element
                if pattern[pattern_index - 2] == text[text_index - 1] or pattern[pattern_index - 2] == '.':
                    # If character before '*' matches text char or is a wildcard.
                    dp_table[text_index][pattern_index] = dp_table[text_index][pattern_index] or dp_table[text_index - 1][pattern_index]
            else:
                # No match
                dp_table[text_index][pattern_index] = False

    # Result is stored in the bottom right cell
    return dp_table[text_length][pattern_length]

Big(O) Analysis

Time Complexity
O(m*n)The dynamic programming solution constructs a table of size (m+1) x (n+1), where m is the length of the text and n is the length of the pattern. Filling each cell in the table involves a constant amount of work, primarily comparing characters and potentially considering multiple possibilities if a '*' is present in the pattern. Since we must iterate through all m*n cells, the dominant factor determining runtime is the size of the table itself. Therefore, the overall time complexity is O(m*n).
Space Complexity
O(M*N)The algorithm utilizes a table (a 2D array) to store intermediate results using dynamic programming. The dimensions of this table are determined by the lengths of the text and the pattern. If the text has length N and the pattern has length M, the table will have M rows and N columns. Therefore, the auxiliary space required is proportional to the product of M and N, where N is the length of text and M is the length of the pattern.

Edge Cases

CaseHow to Handle
Empty text stringIf the text string is empty, the regular expression must also be empty or consist only of '*' characters to match.
Empty pattern stringAn empty pattern only matches an empty text string.
Pattern starts with '*'A pattern starting with '*' is invalid and should be handled as no match unless the text is empty and the pattern is only '*'.
Consecutive '*'' characters in patternTreat consecutive '*' characters as a single '*' as multiple consecutive '*' chars have the same matching effect.
Text string contains special regex charactersEscaping special regex characters in the text or using character class matching is necessary for correct matching.
Pattern contains only '*'A pattern containing only '*' should match any string, including the empty string.
Text string is very longThe algorithm's time complexity (especially for backtracking solutions) may become an issue with very long text strings; consider dynamic programming.
Pattern contains many '*'Excessive '*' characters can lead to exponential backtracking in recursive solutions, thus DP is advised.