Taro Logo

Better Compression of String

Medium
Asked by:
Profile picture
Profile picture
7 views
Topics:
Strings

You are given a string s where each character is a digit. You are asked to compress the string such that each group of consecutive identical characters c is replaced with the character c followed by the number of times it occurs. For example, the string "111223" is compressed to "132213".

You are also given an integer k. You want to find the length of the shortest compressed string of s after deleting at most k characters.

Example 1:

Input: s = "1111222", k = 2
Output: 4
Explanation: We can delete two '1's to have the string "11222" which compresses to "2132".
Alternatively, we can delete two '2's to have the string "11112" which compresses to "4112".
The length of the shortest compressed string is 4.

Example 2:

Input: s = "aaabcccd", k = 2
Output: 4
Explanation: Delete two 'c's so that the string becomes "aaabd" which compresses to "3a1b1d".

Example 3:

Input: s = "aaaaaaaaaa", k = 0
Output: 3
Explanation: The string compresses to "10a".

Constraints:

  • 1 <= s.length <= 100
  • 0 <= k <= s.length
  • s consists of only digits.

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. What is the character set for the letters in the string? Are they only lowercase English letters, or can they include uppercase letters, numbers, or other symbols?
  2. What are the constraints on the length of the input string and the size of the counts? Could a count be very large, potentially exceeding the capacity of a standard 32-bit integer?
  3. Can the input string be empty, or is it guaranteed to contain at least one character-count pair?
  4. Is the input string format strictly followed, meaning a letter is always followed by at least one digit, or should I handle malformed inputs like 'a' without a number or '12' without a preceding letter?
  5. If a character's total count is 1, should the output include the number 1 (e.g., 'a1') or just the character ('a')?

Brute Force Solution

Approach

The goal is to take a compressed string, like 'a3b2c1', and create a new, more organized version. The straightforward approach is to first figure out the full, uncompressed list of all characters and their counts, and then rebuild the string in a sorted, combined format.

Here's how the algorithm would work step-by-step:

  1. First, we need to understand what the original string represents. Let's create a simple tally sheet for each letter of the alphabet, from 'a' to 'z'.
  2. Go through the compressed string, one character at a time. When you see a letter, remember it.
  3. Then, look at the number that comes right after that letter. This number tells you how many times that letter appears.
  4. Find that letter on your tally sheet and add that number to its current count.
  5. Repeat this process for every letter-number pair until you've gone through the entire input string.
  6. Once you're done, your tally sheet will have the total count for every character that was in the string.
  7. Now, create the new, better compressed string. Start with the letter 'a'.
  8. Check your tally sheet. If 'a' has a count greater than zero, write down 'a' followed by its total count in the new string.
  9. Move on to 'b'. Do the same thing: if its count is not zero, write down 'b' and its count.
  10. Continue this for every letter all the way to 'z', building up the final string piece by piece in alphabetical order.

Code Implementation

def better_compression_brute_force(compressed_string):
    # A tally sheet is needed to store the total counts for each character from 'a' to 'z'.

    character_counts = [0] * 26
    string_index = 0
    string_length = len(compressed_string)

    while string_index < string_length:
        current_character = compressed_string[string_index]
        string_index += 1

        number_string = ''
        while string_index < string_length and compressed_string[string_index].isdigit():
            number_string += compressed_string[string_index]
            string_index += 1

        count = int(number_string)
        
        # We update the tally for the specific character by converting it to an index (a=0, b=1, etc).

        character_index = ord(current_character) - ord('a')
        character_counts[character_index] += count

    better_compressed_string_parts = []
    # Iterate through all possible characters in alphabetical order to build the new string.

    for alphabet_index in range(26):
        if character_counts[alphabet_index] > 0:
            # Construct the character-count pair for the final sorted string if its count is non-zero.

            character = chr(ord('a') + alphabet_index)
            count = character_counts[alphabet_index]
            better_compressed_string_parts.append(f"{character}{count}")

    return "".join(better_compressed_string_parts)

Big(O) Analysis

Time Complexity
O(n)The time complexity is determined by two main sequential operations. First, we iterate through the input string of length n once to parse the characters and their counts, updating a frequency map; this takes O(n) time. Second, we iterate through the 26 letters of the alphabet, which is a constant-time operation, O(1), to build the final compressed string. The dominant part of the algorithm is parsing the input string, resulting in a total time complexity that is linear with respect to its length, which simplifies to O(n).
Space Complexity
O(1)The algorithm's space complexity is constant because it uses a fixed-size 'tally sheet' to store the counts of characters from 'a' to 'z'. This data structure, which can be implemented as an array of size 26, does not grow with the length of the input string N. Apart from this fixed-size array, only a few variables are needed for iteration. Therefore, the auxiliary space required remains constant regardless of the input string's size.

Optimal Solution

Approach

The goal is to combine the counts for each character that appears multiple times in the compressed string. The most effective way is to first count up all the occurrences for each unique character and then build a new, final string with the characters in alphabetical order, each followed by its total count.

Here's how the algorithm would work step-by-step:

  1. First, go through the compressed string and treat it as a collection of character-and-number pairs.
  2. For each character you see, find its associated number that tells you how many times it appeared in that spot.
  3. Keep a running total for every unique character. For example, if you see 'a' with the number '3' and later see 'a' again with the number '5', you know you have a total of eight 'a's.
  4. After you have counted up all the occurrences for every character, you'll have a final tally for each one.
  5. Now, create a new, clean result. Go through the alphabet from 'a' to 'z'.
  6. For each letter of the alphabet, if you have a count for it, add that letter to your result followed by its total count.
  7. By following this process, you will build the final compressed string with all characters in sorted order, each appearing only once with its total combined count.

Code Implementation

def better_compression_of_string(compressed_string):
    # An array is used to efficiently store counts for each character from 'a' to 'z'.
    character_counts = [0] * 26
    string_index = 0
    string_length = len(compressed_string)

    # This loop parses the string to aggregate counts for each character.
    while string_index < string_length:
        current_character = compressed_string[string_index]
        string_index += 1
        
        number_string = ""
        while string_index < string_length and compressed_string[string_index].isdigit():
            number_string += compressed_string[string_index]
            string_index += 1
        
        count = int(number_string)
        character_index = ord(current_character) - ord('a')
        character_counts[character_index] += count

    final_compressed_string = ""
    # Iterate through all possible characters to build the final string in alphabetical order.
    for alphabet_index in range(26):
        if character_counts[alphabet_index] > 0:
            character = chr(ord('a') + alphabet_index)
            count_to_append = character_counts[alphabet_index]
            final_compressed_string += character + str(count_to_append)
            
    return final_compressed_string

Big(O) Analysis

Time Complexity
O(n)The time complexity is determined by two main sequential passes. First, we iterate through the input string of length n once to parse character-count pairs and aggregate the totals into a fixed-size array (of 26 for lowercase English letters), which takes O(n) time. The second pass involves iterating through this fixed-size array of 26 characters to build the final output string, which is a constant time operation, O(1). Therefore, the overall runtime is dominated by the initial pass through the input string, resulting in a total time complexity of O(n).
Space Complexity
O(1)The space complexity is constant because the algorithm keeps a running total for every unique character. Since the problem involves only lowercase English letters from 'a' to 'z', the number of unique characters is fixed at 26. Therefore, the data structure used to store these counts, such as an array or a hash map, will have a maximum size of 26, which does not depend on the length of the input string N. The space used for building the result string is also limited by this constant factor, as it will contain at most 26 character-count pairs.

Edge Cases

Input string is empty
How to Handle:
The solution should return an empty string as there are no characters to process.
Input string is null
How to Handle:
The code should handle this gracefully, typically by returning an empty string or throwing an exception.
String contains only one character-count pair, like 'a10'
How to Handle:
The logic should correctly parse and return the single pair, resulting in 'a10'.
Character counts consist of multiple digits, like 'a123b45'
How to Handle:
The parsing logic must correctly read all consecutive digits to form the complete number for each character.
String contains characters with different cases, like 'a1A2'
How to Handle:
The problem implies lowercase letters, but if uppercase is possible, the solution should treat 'a' and 'A' as distinct characters.
A character appears many times, leading to a large total count, like 'a9a9a9...'
How to Handle:
The solution should use a data type for counts that can handle large sums to prevent integer overflow.
The input string is already perfectly compressed and sorted, like 'a1b2c3'
How to Handle:
The algorithm should process this correctly and return the identical string 'a1b2c3'.
The input string contains all 26 lowercase letters, like 'a1b1c1...z1'
How to Handle:
The data structure for storing counts, such as an array of size 26 or a map, must handle this without issue.