Word Subsets

Medium

Asked by:

49 views

Topics:

ArraysStrings

You are given two string arrays words1 and words2.

A string b is a subset of string a if every letter in b occurs in a including multiplicity.

For example, "wrr" is a subset of "warrior" but is not a subset of "world".

A string a from words1 is universal if for every string b in words2, b is a subset of a.

Return an array of all the universal strings in words1. You may return the answer in any order.

Example 1:

Input: words1 = ["amazon","apple","facebook","google","leetcode"], words2 = ["e","o"]

Output: ["facebook","google","leetcode"]

Example 2:

Input: words1 = ["amazon","apple","facebook","google","leetcode"], words2 = ["lc","eo"]

Output: ["leetcode"]

Example 3:

Input: words1 = ["acaac","cccbb","aacbb","caacc","bcbbb"], words2 = ["c","cc","b"]

Output: ["cccbb"]

Constraints:

1 <= words1.length, words2.length <= 10⁴
1 <= words1[i].length, words2[i].length <= 10
words1[i] and words2[i] consist only of lowercase English letters.
All the strings of words1 are unique.

Try coding it in LeetCode

Solution

Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

What is the maximum length of the words in `words1` and `words2`, and how many words can be in each array?
Are the input words guaranteed to contain only lowercase English letters?
If a word in `words1` can be formed, with repetition, from the letters of multiple words in `words2`, should I include it in the output only once?
If `words2` is empty, should I return `words1`?
Is the order of the output words in `words1` significant, or can I return them in any order?

Brute Force Solution

Approach

The brute-force strategy for the 'Word Subsets' problem involves checking every word in the first list to see if it contains all the letters and counts found in every word from the second list. We essentially go through each word in the first list and compare it to every word in the second list.

Here's how the algorithm would work step-by-step:

For each word in the first list of words, examine it carefully.
For each word in the second list of words, check if all the letters present in that word also appear in the word from the first list, and in sufficient quantities.
If even one word from the second list has a letter that is missing or doesn't appear enough times in the word from the first list, then the word from the first list is not a 'subset' of all the words in the second list.
If a word from the first list successfully contains all the letters from every single word in the second list (in the required counts), then we mark it as a valid subset.
Repeat this process for every word in the first list, building a collection of words that meet all the requirements.

Code Implementation

def word_subsets_brute_force(words1, words2):
    result = []

    for word1 in words1:
        is_subset = True

        # Iterate through each word in the second list
        for word2 in words2:
            word1_counts = {}
            word2_counts = {}

            for char in word1:
                word1_counts[char] = word1_counts.get(char, 0) + 1
            for char in word2:
                word2_counts[char] = word2_counts.get(char, 0) + 1

            # Check if word1 contains all letters of word2 with sufficient counts
            for char, count in word2_counts.items():
                if char not in word1_counts or word1_counts[char] < count:
                    is_subset = False
                    break
        
        # Only add to result if word1 is a subset of all words2
        if is_subset:
            result.append(word1)

    return result

Big(O) Analysis

Time Complexity

O(A * B * L) – Let A be the number of words in the first list (words1), and B be the number of words in the second list (words2). For each of the A words in words1, we iterate through all B words in words2. Inside the inner loop, we compare character counts which takes time proportional to L, the maximum length of a word. Thus, the overall time complexity is O(A * B * L).

Space Complexity

O(1) – The brute-force solution described analyzes each word in the first list against each word in the second list. It doesn't explicitly mention creating auxiliary data structures that scale with the input size, such as storing intermediate results in lists or dictionaries. The comparison process seems to rely on direct access and comparison of characters within the input strings themselves. Thus, the auxiliary space used is constant, independent of the input size, which makes it O(1).

Optimal Solution

Approach

The goal is to find words that contain all the letters from each word in a given list. Instead of checking every word combination, we create a profile of required letters and then filter our words based on this profile. This avoids redundant checks and focuses on essential letter requirements.

Here's how the algorithm would work step-by-step:

First, analyze the list of short words to figure out the maximum number of times each letter appears across all of them. This creates a master list of letter requirements.
Then, for each longer word, check if it contains at least the number of each letter as specified in the master list of requirements.
If a longer word satisfies these minimum letter counts for all letters, then it is considered a 'word subset' and included in the result.
By pre-calculating the letter requirements, we avoid unnecessary checks and quickly identify the valid subset words.

Code Implementation

def word_subsets(words1, words2):
    master_letter_requirements = [0] * 26

    # Find maximum letter frequencies required across all words2
    for word2 in words2:
        word_letter_frequency = [0] * 26
        for letter in word2:
            word_letter_frequency[ord(letter) - ord('a')] += 1
        for i in range(26):
            master_letter_requirements[i] = max(master_letter_requirements[i], word_letter_frequency[i])

    result = []

    # Iterate through words1 to check for subset property
    for word1 in words1:
        word1_letter_frequency = [0] * 26

        # Count frequencies for each letter
        for letter in word1:
            word1_letter_frequency[ord(letter) - ord('a')] += 1

        is_subset = True
        # Check if all letters meet min. requirements
        for i in range(26):
            if word1_letter_frequency[i] < master_letter_requirements[i]:
                is_subset = False
                break

        # Add to result if the word meets frequency requirements.
        if is_subset:
            result.append(word1)

    return result

Big(O) Analysis

Time Complexity

O(A + B) – Let A be the total number of characters across all words in the array A, and let B be the total number of characters across all words in the array B. First, we iterate through array B to compute the maximum frequency of each character. This takes O(B) time. Then, we iterate through array A, and for each word, we check if it is a subset. Checking if a word is a subset requires iterating through the frequency map of array B and comparing it with the frequency of characters in the word from array A. Thus, the complexity is dominated by O(A + B).

Space Complexity

O(1) – The algorithm uses a fixed-size array (or hash map) of size 26 to store the maximum letter counts derived from the list of short words, irrespective of the number of short words. While there's a list to hold the result, its space is considered part of the output and not auxiliary space. Therefore, the auxiliary space used is constant and independent of the input size (number of words or their lengths), resulting in O(1) space complexity.

Edge Cases

Case	How to Handle
words1 or words2 is null or empty	Return an empty list if either input array is null or empty as no subsets can be formed.
words1 or words2 contains empty strings	Treat empty strings as strings with no characters; frequency counts will be all zeros, and handle accordingly based on the algorithm used.
words1 or words2 contains very long strings exceeding reasonable memory limits	Consider handling large strings in chunks or using a more memory-efficient data structure for character counting to avoid out-of-memory errors.
words1 and words2 contain same words	The algorithm should correctly identify these words as subsets if the frequency criteria are met.
words2 contains words that are permutations of each other, but not subsets	Character frequency analysis must ensure that the subsets are based on minimum occurrences of characters, not just presence of characters.
words2 has some very large frequencies of certain characters such that no words1 satisfy subset requirements	The solution should return an empty result, as no words in words1 meet the subset condition defined by the high character frequencies of words2.
Unicode characters present in words1 or words2	Ensure character frequency counting handles extended ASCII or Unicode characters correctly, possibly using a wider character representation.
Words contain characters with mixed cases (upper and lower)	Consider converting all words to lowercase or uppercase uniformly to ensure case-insensitive comparison if that is desired.

words1 or words2 is null or empty

How to Handle:

Return an empty list if either input array is null or empty as no subsets can be formed.

words1 or words2 contains empty strings

How to Handle:

Treat empty strings as strings with no characters; frequency counts will be all zeros, and handle accordingly based on the algorithm used.

words1 or words2 contains very long strings exceeding reasonable memory limits

How to Handle:

Consider handling large strings in chunks or using a more memory-efficient data structure for character counting to avoid out-of-memory errors.

words1 and words2 contain same words

How to Handle:

The algorithm should correctly identify these words as subsets if the frequency criteria are met.

words2 contains words that are permutations of each other, but not subsets

How to Handle:

Character frequency analysis must ensure that the subsets are based on minimum occurrences of characters, not just presence of characters.

words2 has some very large frequencies of certain characters such that no words1 satisfy subset requirements

How to Handle:

The solution should return an empty result, as no words in words1 meet the subset condition defined by the high character frequencies of words2.

Unicode characters present in words1 or words2

How to Handle:

Ensure character frequency counting handles extended ASCII or Unicode characters correctly, possibly using a wider character representation.

Words contain characters with mixed cases (upper and lower)

How to Handle:

Consider converting all words to lowercase or uppercase uniformly to ensure case-insensitive comparison if that is desired.