Taro Logo

String Compression III

Medium
Qualcomm logo
Qualcomm
1 view
Topics:
Strings

Given a string word, compress it using the following algorithm:

  • Begin with an empty string comp. While word is not empty, use the following operation:
    • Remove a maximum length prefix of word made of a single character c repeating at most 9 times.
    • Append the length of the prefix followed by c to comp.

Return the string comp.

Example 1:

Input: word = "abcde"

Output: "1a1b1c1d1e"

Explanation:

Initially, comp = "". Apply the operation 5 times, choosing "a", "b", "c", "d", and "e" as the prefix in each operation.

For each prefix, append "1" followed by the character to comp.

Example 2:

Input: word = "aaaaaaaaaaaaaabb"

Output: "9a5a2b"

Explanation:

Initially, comp = "". Apply the operation 3 times, choosing "aaaaaaaaa", "aaaaa", and "bb" as the prefix in each operation.

  • For prefix "aaaaaaaaa", append "9" followed by "a" to comp.
  • For prefix "aaaaa", append "5" followed by "a" to comp.
  • For prefix "bb", append "2" followed by "b" to comp.

Constraints:

  • 1 <= word.length <= 2 * 105
  • word consists only of lowercase English letters.

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. What characters are allowed in the input string? Can I assume only lowercase English letters, or could there be uppercase letters, numbers, or special characters?
  2. How large can the input string be? What's the maximum length I should consider?
  3. What defines the 'best' compression? Is it solely based on the length of the compressed string, or are there other factors to consider?
  4. If the input string is already compressed (e.g., 'a1b1c1'), should I return it as is, or attempt to further compress it?
  5. Are there any specific rules or constraints on the repetition counts in the compressed string? For example, can a character repeat more than 9 times before needing to be represented as 'x10'?

Brute Force Solution

Approach

The brute force way to compress a string with a limited number of deletions is to try every possible combination of character deletions. We systematically explore all options, keeping track of the best compression we find. It is a guaranteed way to find the answer, but potentially slow.

Here's how the algorithm would work step-by-step:

  1. Consider all possible ways to remove zero characters from the string.
  2. Next, consider all possible ways to remove one character from the string.
  3. Then, consider all possible ways to remove two characters, and so on, up to the maximum allowed number of deletions.
  4. For each string you get after deleting some characters, calculate the length of its compressed form.
  5. The compressed form counts consecutive repeated characters and replaces them with the character followed by the count of how many times it repeats.
  6. Compare the compressed lengths of all the strings you got from deleting characters.
  7. Choose the string with the shortest compressed length as the best result.

Code Implementation

def string_compression_brute_force(input_string, max_deletions):

    def get_compressed_length(string_to_compress):
        if not string_to_compress:
            return 0

        compressed_string = ""
        count = 1
        for i in range(len(string_to_compress)):
            if i + 1 < len(string_to_compress) and string_to_compress[i] == string_to_compress[i + 1]:
                count += 1
            else:
                compressed_string += string_to_compress[i]
                if count > 1:
                    compressed_string += str(count)
                count = 1
        return len(compressed_string)

    def generate_strings_with_deletions(current_string, deletions_remaining, current_index, current_result, all_results):
        if deletions_remaining == 0:
            all_results.append(current_result + current_string[current_index:])
            return

        if current_index == len(current_string):
            return

        # Option 1: Delete the character at current_index
        generate_strings_with_deletions(
            current_string, deletions_remaining - 1, current_index + 1, current_result, all_results
        )

        # Option 2: Keep the character at current_index
        generate_strings_with_deletions(
            current_string, deletions_remaining, current_index + 1, current_result + current_string[current_index], all_results
        )

    all_possible_strings = []
    generate_strings_with_deletions(input_string, max_deletions, 0, "", all_possible_strings)

    # Initialize with a large value to ensure the first comparison finds a smaller value
    minimum_compressed_length = float('inf')
    best_compressed_string = ""

    # Iterate through all possible strings after deletions
    for possible_string in all_possible_strings:

        # Get the length of the compressed version of the current string
        compressed_length = get_compressed_length(possible_string)

        # Update the minimum compressed length and the corresponding string
        if compressed_length < minimum_compressed_length:

            minimum_compressed_length = compressed_length
            best_compressed_string = possible_string

    return minimum_compressed_length

Big(O) Analysis

Time Complexity
O(n^k) – The algorithm explores all possible combinations of deleting up to k characters from a string of length n. For each possible combination of deletions, the algorithm calculates the length of the compressed string. The number of ways to delete k characters from a string of length n is given by the binomial coefficient 'n choose k', which is O(n^k) when k is significantly smaller than n or fixed. Since the algorithm iterates through all of these combinations and compresses each one in O(n) time at worst (due to counting adjacent characters), and calculates the length in O(n), the overall time complexity is dominated by the combination generation, making it O(n^k).
Space Complexity
O(2^N * N) – The brute force approach explores all possible combinations of character deletions. In the worst-case scenario, where we consider deleting 0 to K characters from a string of length N, the number of generated strings can be up to 2^N (each character can either be deleted or kept). For each of these strings, we need to store the modified string (length at most N), which consumes N space. Therefore, the auxiliary space complexity is O(2^N * N) due to the storage of these potentially numerous modified strings during the deletion and compression process. There are at most 2^N combinations of the string and we need to store these strings.

Optimal Solution

Approach

This string compression problem aims to find the shortest compressed version of a given string after making at most 'k' changes. The optimal approach uses a technique that remembers the best compressed string we can get to each part of the original string, avoiding recalculations and ensuring efficiency.

Here's how the algorithm would work step-by-step:

  1. Start by considering the initial part of the string, figuring out the best way to compress it if we make zero changes.
  2. Then, figure out the best compression achievable if we are allowed to make one change within that initial part of the string.
  3. Keep repeating this process, increasing the number of allowed changes, up to the maximum allowed 'k' changes, and storing the results.
  4. Move forward, extending the part of the string being considered by one character at a time.
  5. For each new part of the string and each possible number of allowed changes, look back at the previously calculated best compressions.
  6. Determine if it's better to compress the new character as is, or whether to change it, considering both the length of the resulting compressed string and the remaining number of allowed changes.
  7. Remember which option gave the shortest compressed string.
  8. Continue this process until the entire original string is processed, and we've calculated the best possible compression for the whole string with at most 'k' changes.

Code Implementation

def string_compression_with_changes(input_string, allowed_changes):
    string_length = len(input_string)
    
    # dp[i][j] stores length of best compression of s[:i] with j changes.
    dp = [[float('inf')] * (allowed_changes + 1) for _ in range(string_length + 1)]
    dp[0][0] = 0

    def get_length_of_compressed_group(count):
        if count == 1:
            return 1
        elif count < 10:
            return 2
        elif count < 100:
            return 3
        else:
            return 4

    for index in range(1, string_length + 1):
        for changes_used in range(allowed_changes + 1):
            for previous_index in range(index):
                mismatches = 0
                
                # Count mismatches between s[previous_index:index] and the character at previous_index
                for k in range(previous_index, index):
                    if input_string[k] != input_string[previous_index]:
                        mismatches += 1

                if mismatches <= changes_used:
                    # Calculate length of compressed string for s[previous_index:index]
                    compressed_group_length = get_length_of_compressed_group(index - previous_index)

                    # Update dp table with the minimum length found so far
                    dp[index][changes_used] = min(dp[index][changes_used], \
                                                     dp[previous_index][changes_used - mismatches] + 1 + (compressed_group_length - 1))

    # The result is the shortest compression length using at most k changes.
    return min(dp[string_length])

Big(O) Analysis

Time Complexity
O(n^2 * k) – The algorithm iterates through the string of length 'n' using a nested loop structure to consider all substrings. The outer loop iterates 'n' times, and the inner loop also iterates up to 'n' times, resulting in n*n combinations of substrings. For each substring, we consider up to 'k' possible changes. Therefore, the overall time complexity is O(n * n * k), which simplifies to O(n^2 * k).
Space Complexity
O(N*N*K) – The algorithm employs dynamic programming, storing intermediate results in a 3D table (or equivalent data structure) to avoid recalculations. This table's dimensions correspond to the length of the input string (N) for the starting and ending indices of substrings and the maximum allowed changes (K). Thus, the auxiliary space required scales with the product of these dimensions, resulting in O(N*N*K) space complexity. No other significant data structures are used beyond this DP table.

Edge Cases

CaseHow to Handle
Null or empty input stringReturn an empty string or handle as an error based on problem requirements.
String with length 1 or 2Handle these short strings according to the base cases outlined in the string compression algorithm.
String contains only one distinct character repeated many times (e.g., 'aaaaaa')The compression should efficiently encode this, potentially resulting in a very short compressed string.
String contains no repeating characters (e.g., 'abcdefg')The compressed string might be longer than the original, requiring logic to return the shorter string.
String where run lengths require more digits than the character count (e.g., 'a' * 10 followed by 'b')The solution should correctly handle multi-digit run lengths in the compressed string.
Extremely long input string (approaching memory limits)Ensure the algorithm uses memory efficiently to avoid out-of-memory errors for large inputs, possibly by considering iterative approach to limit recursion.
String with alternating characters, preventing significant compression (e.g., 'abababab')The algorithm should still return a valid (though possibly only slightly) compressed string or the original, depending on which is shorter.
String containing digits (e.g. 'a12b')The compression must correctly distinguish between a character that is a digit and a run length representation; either encode directly as digits, or define escape sequence for the digit runs.