Taro Logo

Substring with Concatenation of All Words

Medium
1 views
a month ago

You are given a string s and an array of strings words. All the strings of words are of the same length.

A concatenated string is a string that exactly contains all the strings of any permutation of words concatenated.

  • For example, if words = ["ab","cd","ef"], then "abcdef", "abefcd", "cdabef", "cdefab", "efabcd", and "efcdab" are all concatenated strings. "acdbef" is not a concatenated string because it is not the concatenation of any permutation of words.

Return an array of the starting indices of all the concatenated substrings in s. You can return the answer in any order.

Example 1:

Input: s = "barfoothefoobarman", words = ["foo","bar"] Output: [0,9]

Explanation: The substring starting at 0 is "barfoo". It is the concatenation of ["bar","foo"] which is a permutation of words. The substring starting at 9 is "foobar". It is the concatenation of ["foo","bar"] which is a permutation of words.

Example 2:

Input: s = "wordgoodgoodgoodbestword", words = ["word","good","best","word"] Output: []

Explanation: There is no concatenated substring.

Example 3:

Input: s = "barfoofoobarthefoobarman", words = ["bar","foo","the"] Output: [6,9,12]

Explanation: The substring starting at 6 is "foobarthe". It is the concatenation of ["foo","bar","the"]. The substring starting at 9 is "barthefoo". It is the concatenation of ["bar","the","foo"]. The substring starting at 12 is "thefoobar". It is the concatenation of ["the","foo","bar"].

Constraints:

  • 1 <= s.length <= 10^4
  • 1 <= words.length <= 5000
  • 1 <= words[i].length <= 30
  • s and words[i] consist of lowercase English letters.
Sample Answer
from collections import Counter


def find_substring(s: str, words: list[str]) -> list[int]:
    """Finds the starting indices of concatenated substrings in s."""
    if not s or not words:
        return []

    word_length = len(words[0])
    num_words = len(words)
    substring_length = word_length * num_words
    result = []

    # Count the frequency of each word in the words array.
    word_count = Counter(words)

    # Iterate through the string s.
    for i in range(len(s) - substring_length + 1):
        # Create a substring of length substring_length.
        substring = s[i : i + substring_length]
        
        # Create a counter for the words in the substring.
        substring_word_count = Counter()
        
        # Iterate through the substring and count the words.
        for j in range(0, substring_length, word_length):
            word = substring[j : j + word_length]
            substring_word_count[word] += 1

        # If the word counts are the same, add the index to the result.
        if substring_word_count == word_count:
            result.append(i)

    return result

# Example Usage
s = "barfoothefoobarman"
words = ["foo", "bar"]
print(find_substring(s, words))  # Output: [0, 9]

s = "wordgoodgoodgoodbestword"
words = ["word", "good", "best", "word"]
print(find_substring(s, words))  # Output: []

s = "barfoofoobarthefoobarman"
words = ["bar", "foo", "the"]
print(find_substring(s, words))  # Output: [6, 9, 12]

Explanation:

  1. Initialization: The function first handles edge cases where the input string s or the words array is empty. It calculates the length of each word, the total number of words, and the total length of the concatenated substring we're looking for.
  2. Word Frequency: A Counter object, word_count, stores the frequency of each word in the words array. This will be used to compare against the frequency of words in the substrings of s.
  3. Iteration: The function iterates through the string s using a sliding window approach. The window size is equal to the total length of the concatenated substring.
  4. Substring Analysis: For each window, a new Counter object, substring_word_count, is created to store the frequency of each word in the current substring. The substring is divided into chunks of size word_length, and each chunk is treated as a word.
  5. Comparison: The substring_word_count is compared with the word_count. If they are identical, it means the current substring is a valid concatenation of the words in words, and the starting index i is added to the result list.
  6. Return: Finally, the function returns the result list containing the starting indices of all valid concatenated substrings.

Brute Force Solution:

A brute-force solution would involve generating all possible permutations of the words array and then searching for each permutation within the string s. This approach is highly inefficient because generating permutations has a time complexity of O(n!), where n is the number of words. Moreover, searching for each permutation in s would further increase the time complexity.

Optimal Solution:

The provided solution is an optimized approach that avoids generating permutations. Instead, it uses a sliding window and frequency counting to efficiently identify concatenated substrings. By using Counter objects, the solution achieves a time complexity of O(N * M), where N is the length of the string s and M is the number of words.

Big(O) Runtime Analysis:

  • Outer Loop: The outer loop iterates through the string s up to len(s) - substring_length + 1 times. This loop runs O(N - M * L) times, where N is the length of s, M is the number of words, and L is the length of each word.

  • Inner Loop: The inner loop iterates through the substring in steps of word_length. This loop runs M times, where M is the number of words.

  • Counter Comparison: Comparing the substring_word_count with word_count takes O(M) time in the worst case.

    Thus, the overall time complexity is O((N - M * L) * M), which simplifies to O(N * M) when M * L is relatively small compared to N. If we consider L a constant, then the runtime is O(N*M).

Big(O) Space Usage Analysis:

  • word_count: The word_count Counter stores the frequency of each word in words. The space required is O(M), where M is the number of words.

  • substring_word_count: The substring_word_count Counter stores the frequency of each word in the current substring. The space required is O(M).

  • result: The result list stores the starting indices of the concatenated substrings. In the worst case, it could store up to N - M * L indices, where N is the length of s, M is the number of words, and L is the length of each word. However, in many practical cases, the number of concatenated substrings is much smaller than N - M * L.

    Therefore, the overall space complexity is O(M).

Edge Cases and How to Handle Them:

  1. Empty Input: If the input string s or the words array is empty, the function should return an empty list.
  2. Unequal Word Lengths: If the words in the words array have different lengths, the function should return an empty list because the problem statement specifies that all words have the same length.
  3. Words Not Found: If the substring contains words that are not in the words array, the substring_word_count will not match the word_count, and the substring will be skipped.
  4. Duplicate Words: The Counter objects handle duplicate words correctly by keeping track of their frequencies.
  5. Overlapping Substrings: If there are overlapping concatenated substrings, the function will identify all of them and add their starting indices to the result list.
  6. Word Length of Zero: If the word length is zero the algorithm will divide by zero. Return an empty list.