Taro Logo

Substring with Concatenation of All Words

Hard
Apple logo
Apple
5 views
Topics:
ArraysStringsSliding Windows

You are given a string s and an array of strings words. All the strings of words are of the same length.

A concatenated string is a string that exactly contains all the strings of any permutation of words concatenated.

  • For example, if words = ["ab","cd","ef"], then "abcdef", "abefcd", "cdabef", "cdefab", "efabcd", and "efcdab" are all concatenated strings. "acdbef" is not a concatenated string because it is not the concatenation of any permutation of words.

Return an array of the starting indices of all the concatenated substrings in s. You can return the answer in any order.

Example 1:

Input: s = "barfoothefoobarman", words = ["foo","bar"] Output: [0,9] Explanation: The substring starting at 0 is "barfoo". It is the concatenation of ["bar","foo"] which is a permutation of words. The substring starting at 9 is "foobar". It is the concatenation of ["foo","bar"] which is a permutation of words.

Example 2:

Input: s = "wordgoodgoodgoodbestword", words = ["word","good","best","word"] Output: [] Explanation: There is no concatenated substring.

Example 3:

Input: s = "barfoofoobarthefoobarman", words = ["bar","foo","the"] Output: [6,9,12] Explanation: The substring starting at 6 is "foobarthe". It is the concatenation of ["foo","bar","the"]. The substring starting at 9 is "barthefoo". It is the concatenation of ["bar","the","foo"]. The substring starting at 12 is "thefoobar". It is the concatenation of ["the","foo","bar"].

Constraints:

  • 1 <= s.length <= 10^4
  • 1 <= words.length <= 5000
  • 1 <= words[i].length <= 30
  • s and words[i] consist of lowercase English letters.

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. Can you provide the constraints for the length of the string `s` and the number of words in the `words` array? Also, what is the maximum length of each word in the `words` array?
  2. Are there any duplicate words within the `words` array, and if so, how should that affect the matching of the concatenated substring?
  3. If no substring is found that satisfies the concatenation requirement, what should the function return (e.g., an empty array, null)?
  4. Is the order of words in the `words` array significant when forming the concatenated substring, or can the words appear in any order?
  5. Is the input case-sensitive? For example, should "Word" match "word"?

Brute Force Solution

Approach

We want to find where a specific combination of words appears inside a larger piece of text. The brute force method checks every possible starting point in the larger text to see if our word combination is there. It's like trying to fit a puzzle piece starting at every possible location to see if it fits.

Here's how the algorithm would work step-by-step:

  1. Start at the beginning of the main text.
  2. See if the first word of our combination appears at that spot.
  3. If it does, check if the second word appears right after it, then the third, and so on, until we've checked all the words in the combination.
  4. If all the words in our combination appear in the correct order and right next to each other, we've found a match. Remember where we started.
  5. If the first word of our combination doesn't appear at that starting spot, or if any of the subsequent words don't match, move to the next spot in the main text.
  6. Repeat this process, checking every possible starting spot in the main text until we reach the end.
  7. At the end, we will have a list of all the starting locations where the combination of words was found.

Code Implementation

def find_substring_brute_force(text, words):
    if not text or not words:
        return []

    word_length = len(words[0])
    number_of_words = len(words)
    substring_length = word_length * number_of_words
    result_indices = []

    # Iterate through all possible start indices
    for index in range(len(text) - substring_length + 1):
        words_seen = []
        words_found = 0

        # Check if the concatenation of words exists starting at current index
        for word_index in range(number_of_words):
            current_word_start = index + word_index * word_length
            current_word = text[current_word_start:current_word_start + word_length]

            # Ensure the words are in order and present
            if current_word in words:
                words_seen.append(current_word)
                words_found += 1

        # Verify all the words are present
        if words_found == number_of_words:
            # Check if the counts of each word matches
            temp_words = words[:] # make a copy
            is_concat = True
            for seen_word in words_seen:
                if seen_word in temp_words:
                    temp_words.remove(seen_word)
                else:
                    is_concat = False
                    break

            if is_concat:
                result_indices.append(index)

    return result_indices

Big(O) Analysis

Time Complexity
O(n*m*k)Let n be the length of the string s, m be the number of words in the words array, and k be the length of each word (assuming all words have the same length). The brute force algorithm iterates through the string s, starting at each possible index (up to n - m*k). For each starting index, it checks if the concatenation of all words in the words array exists, which takes O(m*k) time. Thus the overall time complexity is O(n*m*k), as the outer loop runs approximately n times and the inner matching process takes m*k steps each time.
Space Complexity
O(L)The described brute force approach requires storing the list of starting locations where the word combination is found. In the worst-case scenario, every possible starting position could be a match, resulting in a list of length L, where L is the length of the main text. Therefore, the auxiliary space needed to store the starting indices is proportional to L. Thus, the space complexity is O(L).

Optimal Solution

Approach

The key is to avoid checking every single possible starting position. Instead, we use a sliding window approach, focusing on potential starting points that align with the length of the words we are looking for. This significantly reduces redundant checks and computation.

Here's how the algorithm would work step-by-step:

  1. First, figure out the length of each individual word and how many words we need to find in a row.
  2. Instead of starting from every letter in the big string, only start at positions that are multiples of the length of a single word. This is because if the matching sequence exists, it must start at such a position.
  3. From each possible starting position, check if the next 'n' words (where 'n' is the number of words we're looking for) match our word list exactly. To do this, build a way to count how many times each word appears in our target list.
  4. As you move through the bigger string, use a 'sliding window'. That is, check the sequence of words beginning at the current starting point. If those words all exist in our target word list, and each word appears the right number of times, we found a match.
  5. If you find a complete sequence of matching words, remember the starting position.
  6. After you have checked every possible starting position that is a multiple of single word length, report all of the valid starting positions.

Code Implementation

def find_substring(string, words):
    if not string or not words:
        return []

    word_length = len(words[0])
    number_of_words = len(words)
    substring_length = word_length * number_of_words
    result_indices = []

    # Store word counts for quick reference
    word_count = {}
    for word in words:
        word_count[word] = word_count.get(word, 0) + 1

    # Only check starting positions that are multiples of word_length
    for i in range(len(string) - substring_length + 1):
        words_seen = {}
        words_found = 0

        # Iterate through the substring to check each word
        for j in range(number_of_words):
            word_start = i + j * word_length
            current_word = string[word_start:word_start + word_length]

            # If the current word is not in the list, break
            if current_word not in word_count:
                break

            words_seen[current_word] = words_seen.get(current_word, 0) + 1

            # Break if we see a word more times than in target list
            if words_seen[current_word] > word_count[current_word]:
                break

            words_found += 1

        # If all words are found, add the starting index to results
        if words_found == number_of_words:
            result_indices.append(i)

    return result_indices

Big(O) Analysis

Time Complexity
O(n * m)Let n be the length of the string s and m be the total number of words in the words array. We iterate through the string s starting at indices that are multiples of the word length. For each such starting index, we potentially iterate up to m times to check if we can find a concatenation of all words. Within each of these m iterations, we perform constant-time operations like substring extraction and hashmap lookups. Therefore, the time complexity is O(n * m).
Space Complexity
O(L)The algorithm uses a hash map (or dictionary) to count the frequency of each word in the input `words` list. The size of this hash map is proportional to the number of unique words in the `words` list, which we'll denote as L. Additionally, a second hash map is used within the sliding window to track the words encountered in the current window, which in the worst case, could also contain all L unique words. Thus, the auxiliary space used is determined by the number of unique words, L, not the size of the main string.

Edge Cases

CaseHow to Handle
s or words is null or emptyReturn an empty list immediately as there can be no concatenation.
words array is emptyIf words is empty, return an empty list, indicating no valid concatenations can be found.
s is shorter than the total length of the concatenated wordsReturn an empty list since s cannot contain the concatenation.
words contains an empty stringIf a word is empty the length calculation must account for zero length strings and the main algorithm should handle comparisons against empty strings.
words contains duplicate wordsUse a frequency map to count the occurrences of each word and compare against that during the search.
No valid starting indices existReturn an empty list when no valid starting indices are found after searching the entire string s.
All words in 'words' are the same.The frequency map ensures correct counting and the sliding window logic accurately checks for consecutive repetitions.
Very long string s and large word array, potential for timeoutOptimize the algorithm using a sliding window approach to avoid redundant calculations and improve time complexity.