Taro Logo

Smallest Subsequence of Distinct Characters

Medium
Amazon logo
Amazon
1 view
Topics:
StringsStacksGreedy Algorithms

Given a string s, return the lexicographically smallest subsequence of s that contains all the distinct characters of s exactly once.

For example:

  1. If s = "bcabc", the output should be "abc".
  2. If s = "cbacdcbc", the output should be "acdb".

Here are some constraints:

  • 1 <= s.length <= 1000
  • s consists of lowercase English letters.

Could you provide an algorithm to solve this problem efficiently, considering both time and space complexity? Please explain the reasoning behind your approach and walk through the examples to demonstrate how it works.

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. Is the input string guaranteed to contain only lowercase English letters, or can it contain other characters (uppercase, numbers, symbols)?
  2. If multiple smallest subsequences exist, which one should I return? For example, the lexicographically smallest one?
  3. If the input string is empty, what should I return?
  4. Can the input string be null?
  5. Does the order of distinct characters in the subsequence have to respect the original string's order?

Brute Force Solution

Approach

The brute force way to find the smallest subsequence with distinct characters means we'll try every possible combination. We build each possible subsequence and then check if it has all distinct characters. Finally, we pick the shortest one that meets the criteria.

Here's how the algorithm would work step-by-step:

  1. First, generate every possible subsequence from the original sequence. A subsequence is formed by deleting some characters (or none) without changing the order of the remaining characters.
  2. For each subsequence you created, check if it contains only distinct characters. That means no character should appear more than once in that subsequence.
  3. If a subsequence has all distinct characters, save it. If not, discard it.
  4. Once you've checked every possible subsequence, compare all the saved subsequences (the ones with distinct characters).
  5. Find the shortest saved subsequence. If there's a tie for the shortest, pick the one that comes earliest alphabetically.

Code Implementation

def smallest_subsequence_of_distinct_characters_brute_force(sequence):
    all_subsequences = []

    # Generate all possible subsequences
    for i in range(1 << len(sequence)):
        subsequence = ""
        for j in range(len(sequence)):
            if (i >> j) & 1:
                subsequence += sequence[j]
        all_subsequences.append(subsequence)

    distinct_subsequences = []
    for subsequence in all_subsequences:
        if len(set(subsequence)) == len(subsequence):

            # Keep only the subsequences with distinct characters
            distinct_subsequences.append(subsequence)

    if not distinct_subsequences:
        return ""

    smallest_subsequence = distinct_subsequences[0]

    # Find the smallest subsequence that contains distinct chars
    for subsequence in distinct_subsequences:
        if len(subsequence) < len(smallest_subsequence):
            smallest_subsequence = subsequence
        elif len(subsequence) == len(smallest_subsequence) and subsequence < smallest_subsequence:
            smallest_subsequence = subsequence

    return smallest_subsequence

Big(O) Analysis

Time Complexity
O(2^n * n)Generating all possible subsequences of a string of length n takes O(2^n) time because each character can either be included or excluded in a subsequence. For each of these subsequences, we need to check if all characters are distinct, which takes O(n) time in the worst case (where n is the length of the subsequence and also the maximum number of unique characters). Therefore, the overall time complexity is O(2^n * n).
Space Complexity
O(2^N * N)The brute force approach generates all possible subsequences. There are 2^N possible subsequences for a string of length N. Each subsequence can have a length of at most N. Therefore, we need space to store all generated subsequences in memory and each of them has a maximum length of N. In addition, a constant number of variables might be used, but their size is insignificant compared to storing all the subsequences. Thus, the auxiliary space is O(2^N * N).

Optimal Solution

Approach

To find the smallest unique character sequence, we'll carefully build it one character at a time. The core idea is to prioritize characters that appear later in the input and avoid adding redundant characters that can be dropped without losing any unique characters.

Here's how the algorithm would work step-by-step:

  1. First, figure out the last spot where each unique character shows up in the original input.
  2. Start building your sequence. Go through the input character by character.
  3. For each character, check if it's already in your sequence. If it is, skip it.
  4. If it's a new character, consider removing the last character you added to your sequence. You can only do this if the last character appears again later in the input.
  5. Keep removing characters from the end of your sequence as long as they appear later in the input and are larger than the current character you are looking at.
  6. After potentially removing characters, add the current character to the end of your sequence.
  7. By repeating this process, you ensure that your sequence contains all the unique characters in the original input, in the smallest possible order.

Code Implementation

def smallest_subsequence(input_string):
    last_occurrence = {}
    for index, character in enumerate(input_string):
        last_occurrence[character] = index

    stack = []
    seen = set()

    for index, character in enumerate(input_string):
        if character in seen:
            continue

        # Maintain the stack to get the smallest sequence
        while stack and character < stack[-1] and index < last_occurrence[stack[-1]]:
            removed_character = stack.pop()
            seen.remove(removed_character)

        stack.append(character)
        seen.add(character)

    return ''.join(stack)

Big(O) Analysis

Time Complexity
O(n)The algorithm iterates through the input string of length n once to determine the last occurrence of each character. Then, it iterates through the input string again, potentially performing operations within a while loop that removes characters from the stack (sequence). However, each character is added to and removed from the stack at most once. Therefore, the dominant operations are linearly proportional to the input size n, resulting in O(n) time complexity. The while loop has a net cost of O(n) over the course of the entire execution, not O(n) for each element.
Space Complexity
O(1)The algorithm utilizes a hash map to store the last occurrence of each unique character in the input string, and a stack (or similar data structure representing the sequence being built) to maintain the smallest subsequence found so far. The size of the hash map is determined by the number of unique characters, which is limited by the character set (e.g., 26 for lowercase English letters). The stack's size is also bounded by the number of unique characters. Therefore, the auxiliary space used remains constant and does not depend on the length of the input string (N), resulting in O(1) space complexity.

Edge Cases

CaseHow to Handle
Empty input stringReturn an empty string immediately as there are no characters to process.
Input string with only one distinct characterReturn the input string directly as it already satisfies the condition.
Input string with all characters being the sameReturn a string containing only that one character.
Input string with characters in reverse alphabetical orderThe algorithm should correctly build the subsequence by iteratively adding and potentially removing characters from the stack.
Input string of maximum allowed lengthEnsure the solution's time and space complexity remains within acceptable limits for large inputs.
Input string with only lowercase English lettersThe solution must be designed to handle only these letters, as specified in the problem.
Input string where a character appears multiple times, but the last occurrence is early in the stringThe algorithm needs to correctly prioritize later occurrences to minimize the subsequence.
Input string with some characters appearing only onceThose characters must be included in the final subsequence, and the algorithm should handle them appropriately.