Taro Logo

Count Number of Homogenous Substrings

Medium
Asked by:
Profile picture
Profile picture
7 views
Topics:
StringsDynamic Programming

Given a string s, return the number of homogenous substrings of s. Since the answer may be too large, return it modulo 109 + 7.

A string is homogenous if all the characters of the string are the same.

A substring is a contiguous sequence of characters within a string.

Example 1:

Input: s = "abbcccaa"
Output: 13
Explanation: The homogenous substrings are listed as below:
"a"   appears 3 times.
"aa"  appears 1 time.
"b"   appears 2 times.
"bb"  appears 1 time.
"c"   appears 3 times.
"cc"  appears 2 times.
"ccc" appears 1 time.
3 + 1 + 2 + 1 + 3 + 2 + 1 = 13.

Example 2:

Input: s = "xy"
Output: 2
Explanation: The homogenous substrings are "x" and "y".

Example 3:

Input: s = "zzzzz"
Output: 15

Constraints:

  • 1 <= s.length <= 105
  • s consists of lowercase letters.

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. Can the input string be empty or null?
  2. What is the maximum length of the input string?
  3. Is the input string guaranteed to contain only ASCII characters, or should I consider Unicode?
  4. By 'substring', do you mean a contiguous sequence of characters within the string?
  5. If the string contains no homogenous substrings, what should I return (e.g., 0)?

Brute Force Solution

Approach

The brute force way to count homogenous substrings is very straightforward. We look at every possible piece of the whole string. Then, for each of those pieces, we check if it's made up of only one character.

Here's how the algorithm would work step-by-step:

  1. First, consider every single character in the string as a substring. Check if each of those is homogenous (made of the same character).
  2. Next, consider every pair of adjacent characters as a substring. Check if each of those pairs is homogenous.
  3. Then, consider every group of three consecutive characters as a substring. Check if each of those groups is homogenous.
  4. Continue this process, considering groups of four, five, six, and so on, up to the length of the entire string.
  5. Each time you find a substring that is homogenous, add it to the total count.
  6. Finally, after checking all possible substrings, report the total count of homogenous substrings you found.

Code Implementation

def count_homogenous_substrings_brute_force(input_string):
    string_length = len(input_string)
    homogenous_substring_count = 0

    # Iterate through all possible substring lengths
    for substring_length in range(1, string_length + 1):
        # Iterate through all possible starting positions for the substring
        for starting_index in range(string_length - substring_length + 1):
            # Extract the substring
            substring = input_string[starting_index:starting_index + substring_length]

            # Check if the substring is homogenous
            is_homogenous = True

            #Checking if the substring has zero length. Added for safety.
            if len(substring) == 0:
                is_homogenous = False

            # Iterate through the substring characters
            for character_index in range(1, len(substring)):
                #Check the current character against the previous
                if substring[character_index] != substring[0]:
                    is_homogenous = False
                    break

            # Increment the count if the substring is homogenous
            if is_homogenous:
                homogenous_substring_count += 1

    return homogenous_substring_count

Big(O) Analysis

Time Complexity
O(n²)The outer loop iterates from substring length 1 up to n (the length of the string). For each substring length, the inner loop iterates through all possible starting positions for that substring. Checking if a substring of length k is homogenous takes O(k) time in the worst case. Therefore, the total time complexity is approximately 1*n + 2*(n-1) + 3*(n-2) + ... + n*1 which is the sum of k*(n-k+1). This sum expands to a quadratic expression, hence the time complexity is O(n²).
Space Complexity
O(1)The described brute force approach iterates through all possible substrings but doesn't require any auxiliary data structures to store intermediate substrings or counts. It only uses a constant number of variables like loop counters to define the start and end of the substrings and a variable to accumulate the count. Therefore, the amount of extra memory needed does not depend on the input string's size (N), and the space complexity is constant.

Optimal Solution

Approach

The efficient way to count homogenous substrings involves identifying contiguous blocks of identical characters. Instead of checking every possible substring, we focus on calculating the number of homogenous substrings within each of these blocks and then summing them up. This avoids redundant calculations and drastically improves performance.

Here's how the algorithm would work step-by-step:

  1. Go through the input text character by character.
  2. Identify sections of consecutive identical characters. For example, in 'abbbcc', 'a', 'bbb', and 'cc' are these sections.
  3. For each section of identical characters, calculate the number of homogenous substrings it contains. A section of length 'n' contains 'n' substrings of length 1, 'n-1' substrings of length 2, and so on, down to 1 substring of length 'n'. The total number of homogenous substrings is the sum of numbers from 1 to 'n', which can be efficiently calculated as n * (n + 1) / 2.
  4. Add up the number of homogenous substrings calculated for each section.
  5. The final sum is the total number of homogenous substrings in the input text.

Code Implementation

def count_homogenous_substrings(input_string):
    string_length = len(input_string)
    total_homogenous_substrings = 0
    start_index = 0

    while start_index < string_length:
        end_index = start_index

        # Find the end of the current homogenous section
        while (end_index < string_length and
               input_string[start_index] == input_string[end_index]):
            end_index += 1

        # Calculate the length of the homogenous section
        homogenous_length = end_index - start_index

        # Calculate number of substrings
        total_homogenous_substrings +=
            homogenous_length * (homogenous_length + 1) // 2

        # Move to the next section
        start_index = end_index

    return total_homogenous_substrings

Big(O) Analysis

Time Complexity
O(n)The algorithm iterates through the input string of length n once. Inside the loop, it identifies contiguous blocks of identical characters. The calculation of homogenous substrings within each block takes constant time, specifically n * (n + 1) / 2 is still proportional to n, where n is the length of the block, but since the outer loop already processes the whole string once, this is included in the complexity of O(n). Therefore, the overall time complexity is O(n) because we are performing a single pass through the input string.
Space Complexity
O(1)The algorithm iterates through the input string and calculates the number of homogenous substrings based on contiguous blocks of identical characters. It only uses a few constant space variables to keep track of the current character, the length of the current homogenous block, and the total count of substrings. No auxiliary data structures that scale with the input size are used. Therefore, the space complexity is constant, O(1).

Edge Cases

CaseHow to Handle
Null or empty input stringReturn 0 as there are no substrings.
String with a single characterReturn 1 since a single character is a homogenous substring of length 1.
String with all identical characters (e.g., 'aaaaa')The formula n*(n+1)/2 efficiently calculates the count of substrings in this case.
String with alternating characters (e.g., 'ababab')The solution should correctly count each single character substring as a homogenous substring.
Very long string (approaching maximum string length supported by the language)Ensure the solution uses an efficient algorithm (O(n)) to avoid exceeding time limits.
String containing non-ASCII charactersThe character comparison should handle Unicode characters correctly, which it will natively in most languages.
Integer overflow when calculating the number of substringsUse a 64-bit integer type (long) to store the count of homogenous substrings if necessary.
String containing mixed case characters, but we only care about homogenity of same charactersEnsure the count resets to zero if casing is different ( 'aAaa' would have substrings 'a', 'A', 'aa')