Taro Logo

Number of Matching Subsequences

Medium
Asked by:
Profile picture
Profile picture
Profile picture
Profile picture
+4
More companies
Profile picture
Profile picture
Profile picture
Profile picture
51 views
Topics:
StringsTwo PointersBinary Search

Given a string s and an array of strings words, return the number of words[i] that is a subsequence of s.

A subsequence of a string is a new string generated from the original string with some characters (can be none) deleted without changing the relative order of the remaining characters.

  • For example, "ace" is a subsequence of "abcde".

Example 1:

Input: s = "abcde", words = ["a","bb","acd","ace"]
Output: 3
Explanation: There are three strings in words that are a subsequence of s: "a", "acd", "ace".

Example 2:

Input: s = "dsahjpjauf", words = ["ahjpjau","ja","ahbwzgqnuk","tnmlanowax"]
Output: 2

Constraints:

  • 1 <= s.length <= 5 * 104
  • 1 <= words.length <= 5000
  • 1 <= words[i].length <= 50
  • s and words[i] consist of only lowercase English letters.

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. What are the maximum lengths of the string `s` and the words in the `words` array? Are there any specific constraints on these lengths?
  2. Can the string `s` or the words in the `words` array be empty or null?
  3. Are the strings in the `words` array guaranteed to contain only lowercase English letters, like the main string `s`?
  4. If a word in `words` appears multiple times as a subsequence in `s`, should it be counted multiple times in the final count?
  5. Is case sensitivity important? Should I assume all strings are lowercase, or do I need to handle uppercase letters as well?

Brute Force Solution

Approach

The brute force method to find matching subsequences involves checking each word in a list to see if it can be formed by picking characters from another bigger word in the correct order. We'll try every possible way to match the characters.

Here's how the algorithm would work step-by-step:

  1. Take the first word from the list of smaller words.
  2. Compare the first letter of the small word with the letters of the big word, one by one, from left to right.
  3. If you find a matching letter in the big word, move to the next letter in the small word and continue comparing from where you left off in the big word.
  4. If you reach the end of the small word by finding matching letters in the big word in the correct order, it means this small word is a matching subsequence. Keep a count of it.
  5. If you reach the end of the big word before finishing the small word, it means this small word is not a matching subsequence.
  6. Repeat this process for every word in the list of smaller words.
  7. Finally, return the total count of small words that were matching subsequences of the big word.

Code Implementation

def number_of_matching_subsequences_brute_force(big_word, list_of_smaller_words):
    matching_subsequence_count = 0

    for small_word in list_of_smaller_words:
        small_word_index = 0
        big_word_index = 0

        while small_word_index < len(small_word) and big_word_index < len(big_word):
            # We are checking if the characters match.
            if small_word[small_word_index] == big_word[big_word_index]:
                small_word_index += 1
            big_word_index += 1

        # Increment count if small_word is subsequence.
        if small_word_index == len(small_word):
            matching_subsequence_count += 1

    return matching_subsequence_count

Big(O) Analysis

Time Complexity
O(S * L)The algorithm iterates through each of the S words in the input array. For each word, the algorithm iterates through the longer string L to find matching characters. Therefore, in the worst-case scenario where each word is checked against the entirety of L, the time complexity is proportional to the product of the number of words (S) and the length of the longer string (L). This results in a time complexity of O(S * L).
Space Complexity
O(1)The described brute force method iterates through the words and compares characters using a few index variables to track progress in both the big word and the small word being checked. No auxiliary data structures like arrays, hashmaps, or significant recursion are created or used. The space required for these index variables is constant regardless of the size of the input words or the number of smaller words, thus the auxiliary space complexity is O(1).

Optimal Solution

Approach

Instead of checking if each word in the word list is a subsequence of the big string individually, we process the big string only once. We keep track of where we are in each word of the word list as we go through the big string. This way we only process each character of the big string once.

Here's how the algorithm would work step-by-step:

  1. For each word in the word list, create a pointer that starts at the beginning of the word.
  2. Go through the big string, character by character.
  3. For each character in the big string, check each word's pointer. If the character in the big string matches the character that the word's pointer is pointing to, advance that pointer to the next character in the word.
  4. If a word's pointer reaches the end of the word, that means that word is a subsequence of the big string. Count it.
  5. After you have gone through the entire big string, the number of words with pointers at the end is the answer.

Code Implementation

def number_of_matching_subsequences(big_string, words):
    word_pointers = [0] * len(words)
    matching_subsequence_count = 0

    # Iterate through the big string
    for big_string_character in big_string:

        for word_index, word in enumerate(words):
            # Check if the current character matches the character pointed to by the pointer
            if word_pointers[word_index] < len(word) and big_string_character == word[word_pointers[word_index]]:
                # Advance the pointer if there is a match
                word_pointers[word_index] += 1

            # Check if the word has been fully matched
            if word_pointers[word_index] == len(word):
                matching_subsequence_count += 1
                # Reset the pointer so the word is only counted once
                word_pointers[word_index] = float('inf')

    return matching_subsequence_count

Big(O) Analysis

Time Complexity
O(S + L*W)The solution iterates through the big string S of length S once. Inside this loop, for each character in S, it potentially iterates through all the words in the word list L (of length L). For each word in the word list, it checks its pointer against the current character in S. The maximum length of all words in the word list is represented as W. Therefore, the time complexity for checking the pointers is L * W within the loop iterating through S. Thus, the overall time complexity is O(S + L*W).
Space Complexity
O(L)The primary auxiliary space usage comes from storing the pointers for each word in the word list. Since we have to keep track of the current index/pointer for each of the L words in the word list, where L is the number of words in the word list, the space required will be proportional to the number of words. Therefore, the space complexity is O(L).

Edge Cases

Empty string S or empty words array
How to Handle:
If S is empty, return 0. If words is empty, return 0.
String S is very long
How to Handle:
Pre-compute indices for each character of S to improve efficiency for repeated subsequence checks.
A word in words is longer than S
How to Handle:
A word longer than S cannot be a subsequence, so skip it.
The words array contains duplicate words
How to Handle:
Count each unique word and its frequency to avoid redundant subsequence checks, incrementing result by frequency.
All words are subsequences of S
How to Handle:
The solution should correctly count all words as subsequences if they are indeed subsequences.
No words are subsequences of S
How to Handle:
The solution should return 0 if no words are subsequences of S.
Word contains characters not in S
How to Handle:
The solution should identify that the word cannot be a subsequence and skip counting it.
Extremely long words array with many distinct words
How to Handle:
Efficient data structures are needed to quickly check and count subsequences to avoid exceeding time limit.