Taro Logo

Substring with Concatenation of All Words

Medium
2 views
2 months ago

You are given a string s and an array of strings words. All the strings of words are of the same length.

A concatenated string is a string that exactly contains all the strings of any permutation of words concatenated.

  • For example, if words = ["ab","cd","ef"], then "abcdef", "abefcd", "cdabef", "cdefab", "efabcd", and "efcdab" are all concatenated strings. "acdbef" is not a concatenated string because it is not the concatenation of any permutation of words.

Return an array of the starting indices of all the concatenated substrings in s. You can return the answer in any order.

Example 1:

Input: s = "barfoothefoobarman", words = ["foo","bar"]

Output: [0,9]

Explanation:

The substring starting at 0 is "barfoo". It is the concatenation of ["bar","foo"] which is a permutation of words. The substring starting at 9 is "foobar". It is the concatenation of ["foo","bar"] which is a permutation of words.

Example 2:

Input: s = "wordgoodgoodgoodbestword", words = ["word","good","best","word"]

Output: []

Explanation:

There is no concatenated substring.

Example 3:

Input: s = "barfoofoobarthefoobarman", words = ["bar","foo","the"]

Output: [6,9,12]

Explanation:

The substring starting at 6 is "foobarthe". It is the concatenation of ["foo","bar","the"]. The substring starting at 9 is "barthefoo". It is the concatenation of ["bar","the","foo"]. The substring starting at 12 is "thefoobar". It is the concatenation of ["the","foo","bar"].

Constraints:

  • 1 <= s.length <= 10^4
  • 1 <= words.length <= 5000
  • 1 <= words[i].length <= 30
  • s and words[i] consist of lowercase English letters.
Sample Answer
from collections import Counter

def find_substring(s: str, words: list[str]) -> list[int]:
    """Finds the starting indices of concatenated substrings in s."""
    if not s or not words:
        return []

    word_len = len(words[0])
    num_words = len(words)
    total_len = word_len * num_words
    result = []

    # Count the frequency of each word in the 'words' array
    word_count = Counter(words)

    # Iterate through the string 's'
    for i in range(len(s) - total_len + 1):
        # Create a counter for the words found in the current substring of 's'
        seen_words = Counter()

        # Check if the current substring is a valid concatenation of words
        for j in range(num_words):
            # Extract a word from the current substring
            word = s[i + j * word_len:i + (j + 1) * word_len]

            # If the word is not in the original 'words' array, break the loop
            if word not in word_count:
                break

            # Update the count of the seen word
            seen_words[word] += 1

            # If the count of the seen word exceeds the count in the original 'words' array, break the loop
            if seen_words[word] > word_count[word]:
                break
        else:
            # If the inner loop completes without breaking, it means a valid concatenation is found
            result.append(i)

    return result

# Example Usage
s = "barfoothefoobarman"
words = ["foo", "bar"]
print(find_substring(s, words))

s = "wordgoodgoodgoodbestword"
words = ["word", "good", "best", "word"]
print(find_substring(s, words))

s = "barfoofoobarthefoobarman"
words = ["bar", "foo", "the"]
print(find_substring(s, words))

Explanation:

  1. Initialization:

    • word_len: Length of each word in the words array.
    • num_words: Number of words in the words array.
    • total_len: Total length of the concatenated string (i.e., word_len * num_words).
    • result: List to store the starting indices of the concatenated substrings.
    • word_count: A Counter object that stores the frequency of each word in the words array.
  2. Outer Loop:

    • The outer loop iterates from i = 0 to len(s) - total_len + 1. This ensures that we consider all possible starting positions for a concatenated substring of length total_len within the string s.
  3. Inner Loop and Validation:

    • For each starting index i, the inner loop iterates num_words times.
    • In each iteration, it extracts a word of length word_len from the string s starting at position i + j * word_len.
    • It checks if the extracted word is present in the word_count.
      • If the word is not in word_count, it means this substring cannot be a valid concatenation, so it breaks the inner loop.
    • It maintains a seen_words counter to keep track of the frequency of each word seen in the current substring.
    • If the count of a word in seen_words exceeds its count in word_count, it means this substring cannot be a valid concatenation, so it breaks the inner loop.
  4. Result:

    • If the inner loop completes without breaking (i.e., the else block of the for loop is executed), it means the substring starting at index i is a valid concatenation of the words in words.
    • In this case, the index i is added to the result list.
  5. Return Value:

    • The function returns the result list containing the starting indices of all concatenated substrings found in s.

Time and Space Complexity:

  • Time Complexity: O(N * M * K), where N is the length of the string s, M is the number of words in the words array, and K is the length of each word. The outer loop iterates N - M*K times, and the inner loop iterates M times, extracting a substring of length K in each iteration.
  • Space Complexity: O(M), where M is the number of words in the words array. This is due to the space used by the word_count and seen_words counters, which store at most M words.

Edge Cases:

  • Empty String: If the input string s is empty, the function should return an empty list because there can be no concatenated substrings.
  • Empty Word List: If the input word list words is empty, the function should return an empty list for the same reason.
  • Varying Word Lengths: The problem statement specifies that all words in the words list have the same length. If this condition is violated, the algorithm may not function correctly. You might want to add a check to ensure all words have the same length before proceeding.
  • Words Not Found: If none of the concatenated substrings are found in the input string, the function should return an empty list. The provided code handles this case correctly.
  • Duplicate Words: The code handles duplicate words in the words list correctly by using Counter to keep track of word frequencies. If the frequency of a word in the substring exceeds its frequency in the word list, the substring is considered invalid.
  • Substring Overlap: If there are overlapping concatenated substrings in the input string, the algorithm will identify all of them. The test case s = "barfoofoobarthefoobarman", words = ["bar", "foo", "the"] demonstrates this case.
  • Large Input: For very large input strings or word lists, the algorithm's time complexity of O(N * M * K) may become a concern. In such cases, consider optimizations like using a more efficient data structure or algorithm, or parallelizing the search process.