Taro Logo

Shortest Uncommon Substring in an Array

Medium
Amazon logo
Amazon
3 views
Topics:
ArraysStringsSliding Windows

You are given an array arr of size n consisting of non-empty strings.

Find a string array answer of size n such that:

  • answer[i] is the shortest substring of arr[i] that does not occur as a substring in any other string in arr. If multiple such substrings exist, answer[i] should be the lexicographically smallest. And if no such substring exists, answer[i] should be an empty string.

Return the array answer.

Example 1:

Input: arr = ["cab","ad","bad","c"]
Output: ["ab","","ba",""]
Explanation:
- For the string "cab", the shortest substring that does not occur in any other string is either "ca" or "ab", we choose the lexicographically smaller substring, which is "ab".
- For the string "ad", there is no substring that does not occur in any other string.
- For the string "bad", the shortest substring that does not occur in any other string is "ba".
- For the string "c", there is no substring that does not occur in any other string.

Example 2:

Input: arr = ["abc","bcd","abcd"]
Output: ["","","abcd"]
Explanation:
- For the string "abc", there is no substring that does not occur in any other string.
- For the string "bcd", there is no substring that does not occur in any other string.
- For the string "abcd", the shortest substring that does not occur in any other string is "abcd".

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. What is the maximum length of each string in the input array, and what characters can they contain (e.g., ASCII, Unicode)?
  2. If no uncommon substring exists, what should the function return (e.g., null, empty string, throw an exception)?
  3. Does 'substring' refer to a contiguous sequence of characters, and should the uncommon substring be the shortest one, or *a* shortest one?
  4. If multiple shortest uncommon substrings exist, can I return any of them?
  5. Can the input array be empty or contain null/empty strings? How should I handle those edge cases?

Brute Force Solution

Approach

We need to find the shortest string that doesn't appear as part of any of the other strings in our collection. The brute force approach involves checking every possible string, starting with the shortest, until we find one that works.

Here's how the algorithm would work step-by-step:

  1. First, we'll consider strings of length 1, then length 2, then length 3, and so on, until we find our answer.
  2. For each length, we'll generate all possible strings of that length using all possible combinations of characters.
  3. For each of these generated strings, we'll check if it appears within any of the strings we were originally given.
  4. If the generated string does NOT appear as a part of any of the original strings, we've found our shortest uncommon substring. We can stop and return that string.
  5. If the generated string DOES appear as a part of one or more of the original strings, we discard it and move on to the next generated string of the same length.
  6. If we've checked all possible strings of a certain length and haven't found an uncommon substring, we move on to checking strings of the next larger length.

Code Implementation

def find_shortest_uncommon_substring(strings):
    alphabet = 'abcdefghijklmnopqrstuvwxyz'
    max_length = max(len(string) for string in strings)

    for substring_length in range(1, max_length + 2):
        # Iterate through possible substring lengths.
        for substring in generate_all_strings(substring_length, alphabet):
            is_substring_common = False

            for string_in_list in strings:
                # Check if the generated substring
                # exists inside the current string.
                if substring in string_in_list:
                    is_substring_common = True
                    break
            
            if not is_substring_common:
                # If substring isn't in
                # any of the strings, return it.
                return substring

    return None

def generate_all_strings(string_length, alphabet):
    if string_length == 0:
        yield ''
        return
    
    for character in alphabet:
        for suffix in generate_all_strings(string_length - 1, alphabet):
            yield character + suffix

Big(O) Analysis

Time Complexity
O(K * 4^L * N * L)Let N be the number of strings in the input array, and K be the size of the alphabet (assumed to be 4: A, C, G, T). The algorithm iterates through lengths L, starting from 1, until an uncommon substring is found. For each length L, it generates all possible strings of length L, which is K^L. For each of these K^L strings, it checks if it's a substring of any of the N input strings. Checking if a string of length L is a substring of another string takes O(L) time using a simple contains check. The worst case arises when the algorithm searches for every substring until length L is reached where an uncommon string is found. Multiplying these components gives the total time complexity of approximately O(K * 4^L * N * L), where K is a constant representing the characters A,C,T,G
Space Complexity
O(L*A)The algorithm generates all possible strings of length L, where L starts at 1 and increments until an uncommon substring is found. The number of such strings for a given length L with alphabet size A is A^L. While these are not *all* stored at once, the `appears within` check for each generated string requires allocating memory to hold the generated string of length L. Therefore, we need O(L) space to temporarily store each generated substring to check if it appears in the input strings. In the worst-case, the length of uncommon substring L can depend on the size and nature of input strings, and if we assume the alphabet size is A, the temporary strings drive space complexity to O(L*A) in practical scenario.

Optimal Solution

Approach

The trick is to find the shortest sequence of characters that's not present in any of the given words. We'll start by checking short sequences and gradually increase the length until we find one that works, avoiding unnecessary work.

Here's how the algorithm would work step-by-step:

  1. Start by checking if a single character (a sequence of length 1) is not present in any word. If found, we are done.
  2. If all single characters are present, move to sequences of length 2 (pairs of characters).
  3. For each sequence length, go through every possible combination of characters of that length.
  4. For each combination, check if it's a substring of any of the words in the input.
  5. If a combination is not a substring of any word, you have found the shortest uncommon substring and you can stop.
  6. If all combinations of a given length are substrings of at least one word, increase the length and repeat the process.

Code Implementation

def find_shortest_uncommon_substring(words):
    character_set = 'abcdefghijklmnopqrstuvwxyz'
    max_word_length = max(len(word) for word in words)

    for substring_length in range(1, max_word_length + 2):
        # Start checking substrings of increasing length.
        for start_index in range(len(character_set) ** substring_length):
            substring = ""
            temp_index = start_index
            for _ in range(substring_length):
                substring = character_set[temp_index % len(character_set)] + substring
                temp_index //= len(character_set)

            is_substring_present = False

            # Check if the current substring is present in any of the words.
            for word in words:
                if substring in word:
                    is_substring_present = True
                    break

            # If the substring is not present in any word, we found our answer.
            if not is_substring_present:
                return substring

    return None

Big(O) Analysis

Time Complexity
O(K * A^L * N * M)Let A be the size of the alphabet, L be the length of the shortest uncommon substring, N be the number of words in the input, M be the maximum length of a word, and K be the maximum length of L we are going to check. We iterate through possible lengths L (from 1 up to a maximum of K). For each length L, we generate all possible substrings of length L, which is A^L. For each substring, we iterate through each word in the input (N words) and check if the substring is present. Checking if a substring is present in a word takes O(M) time. Thus, the total time complexity is O(K * A^L * N * M).
Space Complexity
O(1)The algorithm's space complexity is dominated by the storage of the generated substrings of varying lengths and the boolean checks for substring existence. The plain English explanation primarily focuses on iterating through possible substrings and checking their presence in the input words. These checks do not create data structures whose size depends on the input size N (number of words or total characters in all words). The generation of each candidate substring is done on-the-fly, with each one individually tested against the input words; the number of words or the length of any individual word doesn't significantly impact additional space. Therefore, only a small amount of constant auxiliary space is used for variables regardless of the input size N.

Edge Cases

CaseHow to Handle
Input array is null or emptyReturn an empty string as there are no strings to compare, or throw an IllegalArgumentException based on requirements.
Array contains an empty stringTreat the empty string as a valid string for comparison purposes; shortest uncommon substring will never be ''.
All strings in the array are identicalThe shortest uncommon substring will be the first character of the string, or an empty string if the string is empty.
No uncommon substring exists (all substrings are common)Return an empty string or a specific indicator (e.g., null, -1) to signal the absence of a solution based on the spec.
Array contains very long strings (approaching memory limits)Be mindful of memory usage when generating substrings and consider a streaming approach if memory is constrained.
Array contains strings with special characters or UnicodeEnsure that the substring comparison and generation logic correctly handles these characters.
Multiple shortest uncommon substrings existReturn any one of the shortest uncommon substrings or all of them depending on the specification.
Integer overflow when calculating substring lengthsUse appropriate data types (e.g., long) to prevent potential overflow when comparing lengths, or ensure strings are limited in length.