Taro Logo

Number of Matching Subsequences

Medium
2 months ago

Given a string s and an array of strings words, return the number of words[i] that is a subsequence of s. A subsequence of a string is a new string generated from the original string with some characters (can be none) deleted without changing the relative order of the remaining characters. For example, "ace" is a subsequence of "abcde".

Example 1: Input: s = "abcde", words = ["a","bb","acd","ace"] Output: 3 Explanation: There are three strings in words that are a subsequence of s: "a", "acd", "ace".

Example 2: Input: s = "dsahjpjauf", words = ["ahjpjau","ja","ahbwzgqnuk","tnmlanowax"] Output: 2

Constraints:

  • 1 <= s.length <= 5 * 10^4
  • 1 <= words.length <= 5000
  • 1 <= words[i].length <= 50
  • s and words[i] consist of only lowercase English letters.
Sample Answer

Naive Solution

The naive solution is to iterate through each word in the words array and check if it is a subsequence of the string s. For each word, we can use two pointers, one for the string s and one for the current word. We iterate through s, and if we find a character that matches the current character in the word, we advance the word pointer. If the word pointer reaches the end of the word, it means the word is a subsequence of s.

def is_subsequence(s, word):
    i = 0
    j = 0
    while i < len(s) and j < len(word):
        if s[i] == word[j]:
            j += 1
        i += 1
    return j == len(word)

def num_matching_subsequence_naive(s, words):
    count = 0
    for word in words:
        if is_subsequence(s, word):
            count += 1
    return count

# Example usage
s = "abcde"
words = ["a", "bb", "acd", "ace"]
result = num_matching_subsequence_naive(s, words)
print(f"Number of matching subsequences: {result}") # Output: 3

Optimal Solution

An optimized solution is to use a hash map to store the indices of each character in the string s. Then, for each word in the words array, we can iterate through the characters of the word and use binary search to find the next index in s that is greater than the previous index we found. This avoids iterating through the entire string s for each character in the word.

from collections import defaultdict

def num_matching_subsequence_optimal(s, words):
    char_indices = defaultdict(list)
    for i, char in enumerate(s):
        char_indices[char].append(i)

    count = 0
    for word in words:
        prev_index = -1
        is_subseq = True
        for char in word:
            indices = char_indices[char]
            
            # Binary search for the next valid index
            l, r = 0, len(indices) - 1
            next_index = -1
            while l <= r:
                mid = (l + r) // 2
                if indices[mid] > prev_index:
                    next_index = indices[mid]
                    r = mid - 1
                else:
                    l = mid + 1
            
            if next_index == -1:
                is_subseq = False
                break
            prev_index = next_index
        if is_subseq:
            count += 1
    return count

# Example usage
s = "abcde"
words = ["a", "bb", "acd", "ace"]
result = num_matching_subsequence_optimal(s, words)
print(f"Number of matching subsequences: {result}") # Output: 3

s = "dsahjpjauf"
words = ["ahjpjau", "ja", "ahbwzgqnuk", "tnmlanowax"]
result = num_matching_subsequence_optimal(s, words)
print(f"Number of matching subsequences: {result}") # Output: 2

Big(O) Run-time Analysis

Naive Solution

  • The is_subsequence function takes O(n) time, where n is the length of the string s.
  • The num_matching_subsequence_naive function iterates through each word in the words array, and for each word, it calls the is_subsequence function. Therefore, the overall time complexity is O(m * n * k), where m is the number of words in the words array, n is the average length of the strings in words, and k is the length of string s.

Optimal Solution

  • Creating the char_indices hash map takes O(k) time, where k is the length of the string s.
  • For each word in the words array, we iterate through the characters of the word and use binary search to find the next index in s. The binary search takes O(log p) time, where p is the number of occurrences of the character in s. In the worst case, p can be k.
  • Therefore, the overall time complexity is O(k + m * n * log k), where m is the number of words in the words array, n is the average length of the strings in words, and k is the length of the string s.

Big(O) Space Usage Analysis

Naive Solution

  • The is_subsequence function takes O(1) space.
  • The num_matching_subsequence_naive function takes O(1) additional space. Thus, the overall space complexity is O(1).

Optimal Solution

  • The char_indices hash map takes O(k) space, where k is the length of the string s (in the worst case, all characters in s are distinct, and we store an index for each character).
  • The num_matching_subsequence_optimal function takes O(1) additional space. Thus, the overall space complexity is O(k).

Edge Cases

  1. Empty string s: If the string s is empty, then no word can be a subsequence of s, unless the word is also empty. Handle this explicitly in the beginning.
  2. Empty word in words: An empty word is always a subsequence of any string s. Count it towards result.
  3. Large number of words: If the number of words is very large, the optimal solution is more efficient since it precomputes character indices for s which amortizes the cost of searching within s.
  4. Very long words: If some words are exceptionally long, the binary search portion may become significant, but it's still more efficient than comparing against the entire s string in a nested manner for each word.
  5. Words containing characters not in s: If a word contains a character not present in s, it's automatically not a subsequence. The optimal solution efficiently handles this due to the char_indices hashmap; a missing key immediately flags the word as not being a subsequence.