Taro Logo

Shortest Matching Substring

Hard
Asked by:
Profile picture
4 views
Topics:
Strings

You are given a string s and a pattern string p, where p contains exactly two '*' characters.

The '*' in p matches any sequence of zero or more characters.

Return the length of the shortest substring in s that matches p. If there is no such substring, return -1.

Note: The empty substring is considered valid.

Example 1:

Input: s = "abaacbaecebce", p = "ba*c*ce"

Output: 8

Explanation:

The shortest matching substring of p in s is "baecebce".

Example 2:

Input: s = "baccbaadbc", p = "cc*baa*adb"

Output: -1

Explanation:

There is no matching substring in s.

Example 3:

Input: s = "a", p = "**"

Output: 0

Explanation:

The empty substring is the shortest matching substring.

Example 4:

Input: s = "madlogic", p = "*adlogi*"

Output: 6

Explanation:

The shortest matching substring of p in s is "adlogi".

Constraints:

  • 1 <= s.length <= 105
  • 2 <= p.length <= 105
  • s contains only lowercase English letters.
  • p contains only lowercase English letters and exactly two '*'.

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. Can you clarify the character set for both the input string and the pattern string? Should I assume ASCII, or are Unicode characters possible?
  2. What should I return if the pattern string is not found as a substring within the input string? Should I return null, an empty string, or raise an exception?
  3. Can either the input string or the pattern string be empty or null? If so, what is the expected behavior?
  4. If multiple shortest matching substrings exist, should I return any one of them, or is there a specific criteria for which one to return (e.g., the first occurrence)?
  5. Is the pattern considered to be case-sensitive? For example, should "abc" match "Abc"?

Brute Force Solution

Approach

The brute force strategy for finding the shortest matching substring involves checking every possible substring within the larger string to see if it contains the target string. We systematically explore all options, starting with the smallest possible substrings and increasing their size until a match is found. Then, we compare all the matching substrings to find the shortest one.

Here's how the algorithm would work step-by-step:

  1. First, consider all substrings of length 1 within the larger string.
  2. Check if any of those single-character substrings contain the target string. If so, note it.
  3. Next, consider all substrings of length 2 within the larger string.
  4. Again, check if any of these two-character substrings contain the target string, and note any matches.
  5. Continue this process, increasing the length of the substrings you are checking each time.
  6. If you find multiple substrings that contain the target string, keep track of the shortest one you've seen so far.
  7. Once you have checked all possible substrings, the shortest substring you noted will be the result.

Code Implementation

def shortest_matching_substring_brute_force(larger_string, target_string):
    shortest_substring = None
    # Iterate through all possible substring lengths
    for substring_length in range(1, len(larger_string) + 1):
        # Iterate through all possible start positions for the current length
        for start_index in range(len(larger_string) - substring_length + 1):
            substring = larger_string[start_index:start_index + substring_length]

            # Check if the current substring contains the target string
            if target_string in substring:

                #If this is the first match, or shorter than prev
                if shortest_substring is None or len(substring) < len(shortest_substring):
                    shortest_substring = substring

    return shortest_substring

Big(O) Analysis

Time Complexity
O(n³)The algorithm iterates through all possible substring lengths, from 1 up to n, where n is the length of the larger string. For each length, it iterates through all possible starting positions for that substring, which is also proportional to n. Within each substring, it checks if the target string is contained, which takes O(n) time in the worst case (using naive contains). Therefore, the overall time complexity is O(n * n * n), or O(n³).
Space Complexity
O(1)The provided brute force strategy primarily uses variables to store the current substring being checked and the shortest matching substring found so far. The number of these variables remains constant regardless of the input string's length, N. No auxiliary data structures like arrays or hash maps are created that scale with N. Therefore, the space complexity is constant.

Optimal Solution

Approach

The goal is to find the smallest section of a big text that contains all the words from a smaller list of words. We achieve this efficiently by expanding a window and shrinking it whenever possible, avoiding unnecessary checks.

Here's how the algorithm would work step-by-step:

  1. First, create a window that starts at the beginning of the big text.
  2. Expand the window to the right until it contains all the words from the smaller list at least once.
  3. Once all the words are found, try shrinking the window from the left. Check if removing the leftmost word still leaves all required words present in the window.
  4. Keep shrinking the window from the left until removing the leftmost word would cause the window to no longer contain all the required words.
  5. If the current window is smaller than the smallest one found so far, remember it.
  6. Move the right edge of the window one step further and repeat the process of expanding and shrinking until the right edge reaches the end of the big text.
  7. The smallest window recorded during this process is the shortest matching substring.

Code Implementation

def find_shortest_matching_substring(text, words):
    if not text or not words:
        return ""

    word_frequencies = {}
    for word in words:
        word_frequencies[word] = word_frequencies.get(word, 0) + 1

    required_words_count = len(word_frequencies)
    formed_words_count = 0

    window_start = 0
    window_end = 0

    word_index_map = {}

    min_length = float('inf')
    min_start = 0

    while window_end < len(text):
        current_word = text[window_end]
        if current_word in word_frequencies:
            word_index_map[current_word] = word_index_map.get(current_word, 0) + 1
            if word_index_map[current_word] == word_frequencies[current_word]:
                formed_words_count += 1

        # Shrink the window when all words are found
        while formed_words_count == required_words_count:
            if window_end - window_start + 1 < min_length:
                min_length = window_end - window_start + 1
                min_start = window_start

            left_word = text[window_start]

            #Only adjust if the leftmost word is in the target
            if left_word in word_frequencies:

                word_index_map[left_word] -= 1

                #If no longer satisfied, decrement the count
                if word_index_map[left_word] < word_frequencies[left_word]:
                    formed_words_count -= 1

            window_start += 1

        window_end += 1

    if min_length == float('inf'):
        return ""

    # Return the shortest substring
    return text[min_start:min_start + min_length]

Big(O) Analysis

Time Complexity
O(n)The algorithm iterates through the text using a sliding window approach. The outer loop expands the window (right pointer), which iterates through the entire text of length n in the worst case. The inner loop shrinks the window (left pointer). Although there's a nested loop structure, each index in the text is visited at most twice, once by the right pointer and once by the left pointer. Therefore, the total number of operations is proportional to n, resulting in a time complexity of O(n).
Space Complexity
O(K)The space complexity is determined by the auxiliary space required to store the frequency of words to find. This is represented by the smaller list of words, denoted as K. The algorithm keeps track of the required words and their counts using a hash map or similar data structure, hence the space used is proportional to the number of words in the smaller list. Therefore, the auxiliary space is O(K) where K is the number of unique words we are searching for.

Edge Cases

Null or empty source string
How to Handle:
Return an empty string or null, indicating no matching substring found, and handle potential NullPointerException.
Null or empty target string
How to Handle:
Return the original source string if the target is empty as every source string contains an empty string or throw an IllegalArgumentException if nulls are not allowed.
Target string longer than source string
How to Handle:
Return an empty string or null, indicating no match is possible, since the target can't be a substring of the source.
Source string contains only target string characters
How to Handle:
Return the shortest substring containing all target characters, even if the target appears elsewhere within it.
Target string contains duplicate characters
How to Handle:
Ensure the solution correctly accounts for multiple occurrences of characters in the target when searching for a match.
Source string with multiple occurrences of target string
How to Handle:
The solution should find the shortest substring, not necessarily the first occurrence.
No matching substring exists
How to Handle:
Return an empty string or null if the source string does not contain all characters in the target string.
Very large strings (performance concerns)
How to Handle:
Ensure the algorithm scales efficiently (ideally O(n) or O(n log n)) and avoids excessive memory allocation for large inputs by using sliding window and character frequency map.