Taro Logo

Find Maximum Number of Non Intersecting Substrings

Medium
Asked by:
Profile picture
4 views
Topics:
ArraysGreedy Algorithms

You are given a string word.

Return the maximum number of non-intersecting substrings of word that are at least four characters long and start and end with the same letter.

Example 1:

Input: word = "abcdeafdef"

Output: 2

Explanation:

The two substrings are "abcdea" and "fdef".

Example 2:

Input: word = "bcdaaaab"

Output: 1

Explanation:

The only substring is "aaaa". Note that we cannot also choose "bcdaaaab" since it intersects with the other substring.

Constraints:

  • 1 <= word.length <= 2 * 105
  • word consists only of lowercase English letters.

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. To clarify the definition of 'non-intersecting', if one valid substring ends at index `j` and another begins at index `j+1`, are they considered non-intersecting since they don't share any common indices?
  2. If the input string has a length less than 4, it's impossible to form any valid substrings. In this case, is returning 0 the correct behavior?
  3. Consider a string like 'abccba', which contains two valid but overlapping substrings: 'abccba' and the shorter 'bccb'. To maximize the final count, is it a valid strategy to prioritize choosing the shorter substring because it finishes earlier?
  4. If one large valid substring, like 'axxxayyyb', were to entirely contain two other non-intersecting valid substrings, say 'axxxa' and 'byyyb', my goal is to maximize the count. So I should choose the two smaller substrings for a count of 2, correct?
  5. The problem states the input consists of lowercase English letters. Is the small, fixed alphabet size of 26 an intentional hint that my approach should perhaps involve pre-processing the string based on characters, for instance, by mapping each character to the indices where it appears?

Brute Force Solution

Approach

The basic idea is to be extremely thorough by checking every single possibility. We will first identify all the valid 'special' pieces of text, and then try every conceivable combination of these pieces to find the group that contains the most pieces without any of them overlapping.

Here's how the algorithm would work step-by-step:

  1. First, let's list every single possible continuous piece of the original text, from the shortest possible (just one letter) to the longest (the entire text).
  2. Next, for each of these pieces, we'll check if it's 'special'. A piece is special only if for every letter it contains, all instances of that letter from the entire original text are also contained within that same piece.
  3. We'll create a new, filtered collection that contains only these special pieces of text.
  4. From this collection of special pieces, we'll generate every possible team or grouping of them. We'll consider teams of one, then all possible teams of two, all possible teams of three, and so on.
  5. For each team we form, we'll check a crucial rule: do any of the pieces in the team step on each other's toes or occupy the same spot? If they overlap, we disqualify that entire team.
  6. After checking all teams, we're left with only the valid ones, where no pieces overlap. We'll look through them to find which team has the most pieces in it.
  7. If there's a tie and multiple teams have the same, highest number of pieces, we'll use a tie-breaker: we choose the team whose pieces have the smallest combined length.

Code Implementation

def find_maximum_number_of_non_intersecting_substrings(input_string):
    string_length = len(input_string)
    all_substrings_with_indices = []

    for start_index in range(string_length):
        for end_index in range(start_index, string_length):
            substring = input_string[start_index : end_index + 1]
            all_substrings_with_indices.append((substring, start_index, end_index))

    # Pre-calculating character bounds is crucial for efficiently checking the 'valid segment' rule.

    character_first_occurrence = {}
    character_last_occurrence = {}
    for index, character in enumerate(input_string):
        if character not in character_first_occurrence:
            character_first_occurrence[character] = index
        character_last_occurrence[character] = index
    
    valid_segments = []
    for substring, start_index, end_index in all_substrings_with_indices:
        is_segment_valid = True
        unique_chars_in_substring = set(substring)
        for character in unique_chars_in_substring:
            first_pos = character_first_occurrence[character]
            last_pos = character_last_occurrence[character]
            if not (first_pos >= start_index and last_pos <= end_index):
                is_segment_valid = False
                break
        if is_segment_valid:
            valid_segments.append((substring, start_index, end_index))

    # To exhaustively check all groupings, we generate the powerset of all valid segments.

    all_possible_combinations = []
    number_of_valid_segments = len(valid_segments)
    for combination_mask in range(1 << number_of_valid_segments):
        current_combination = []
        for item_index in range(number_of_valid_segments):
            if (combination_mask >> item_index) & 1:
                current_combination.append(valid_segments[item_index])
        all_possible_combinations.append(current_combination)
        
    best_combination_result = []
    max_substring_count = 0
    min_total_length_for_max_count = float('inf')

    for combination in all_possible_combinations:
        if not combination:
            continue

        # Any valid solution requires non-intersecting substrings, so we filter out all overlapping sets.

        is_combination_overlapping = False
        sorted_combination = sorted(combination, key=lambda segment: segment[1])
        for segment_index in range(len(sorted_combination) - 1):
            end_of_first_segment = sorted_combination[segment_index][2]
            start_of_second_segment = sorted_combination[segment_index+1][1]
            if end_of_first_segment >= start_of_second_segment:
                is_combination_overlapping = True
                break
        
        if not is_combination_overlapping:
            current_substring_count = len(combination)
            
            # If two combinations have the same number of substrings, the tie is broken by total length.

            if current_substring_count > max_substring_count:
                max_substring_count = current_substring_count
                current_total_length = sum(len(segment[0]) for segment in combination)
                min_total_length_for_max_count = current_total_length
                best_combination_result = [segment[0] for segment in combination]
            elif current_substring_count == max_substring_count:
                current_total_length = sum(len(segment[0]) for segment in combination)
                if current_total_length < min_total_length_for_max_count:
                    min_total_length_for_max_count = current_total_length
                    best_combination_result = [segment[0] for segment in combination]
    
    return best_combination_result

Big(O) Analysis

Time Complexity
O(2^(n²))The primary driver of the complexity is generating every conceivable combination of 'special' substrings. First, up to O(n²) potential 'special' substrings are identified from the input string of length n. In the worst case, the number of these valid special substrings, let's call it m, can be as large as O(n²). The core of the algorithm then generates all 2^m possible teams (subsets) from this collection, and this exponential explosion is the computational bottleneck, leading to a time complexity that simplifies to O(2^(n²)).
Space Complexity
O(2^(N^2))The dominant factor in space complexity is the generation and storage of 'every conceivable combination' of special substrings. The algorithm first identifies up to O(N^2) special substrings from the input string of length N, storing them in a collection. It then generates the power set of this collection, which means creating and holding up to 2^(O(N^2)) different teams in memory for processing. This temporary storage of an exponential number of teams before they are filtered is the most memory-intensive step, leading to O(2^(N^2)) auxiliary space.

Optimal Solution

Approach

The optimal strategy involves two main phases. First, we identify all possible 'self-contained' substrings, which are pieces of the string where every character within them appears only in that piece. Then, we cleverly select from these candidates by greedily picking the ones that finish the earliest, which ensures we can fit the maximum number of non-overlapping pieces.

Here's how the algorithm would work step-by-step:

  1. First, determine the 'home range' for every unique character in the string by finding its very first and very last appearance.
  2. Using these home ranges, identify all potential 'self-contained' substrings. A piece is self-contained if the complete home range of every character inside it is also fully contained within that piece.
  3. For each possible starting position, we only need to consider the shortest self-contained substring that can begin there. This gives us a refined list of candidate pieces to choose from.
  4. Arrange all these candidate substrings in order based on their ending position, so the ones that finish earliest are at the front of the line.
  5. Begin building the result by picking the very first candidate in the sorted list — the one that ends the soonest.
  6. Once a piece is chosen, discard all other candidates from the list that have any overlap with the one you just selected.
  7. Repeat the process with the remaining candidates: pick the one that finishes earliest, add it to the result, and discard any new overlaps. Continue until no more candidates are left.

Code Implementation

def maxNumOfSubstrings(input_string: str) -> list[str]:
    first_occurrence_map = {}
    last_occurrence_map = {}
    for index, character in enumerate(input_string):
        if character not in first_occurrence_map:
            first_occurrence_map[character] = index
        last_occurrence_map[character] = index

    # Pre-calculate the first and last appearance of each character to define substring boundaries.

    candidate_intervals = []
    string_length = len(input_string)
    for start_index in range(string_length):
        current_end_index = last_occurrence_map[input_string[start_index]]
        
        is_valid_candidate = True
        scan_index = start_index
        while scan_index <= current_end_index:
            if first_occurrence_map[input_string[scan_index]] < start_index:
                is_valid_candidate = False
                break
            
            current_end_index = max(current_end_index, last_occurrence_map[input_string[scan_index]])
            scan_index += 1
        
        if is_valid_candidate:
            # The loop verified this is a minimal 'self-contained' substring starting at this index.

            candidate_intervals.append((start_index, current_end_index))

    # Sorting by end position is the core of the greedy strategy to maximize remaining space.

    candidate_intervals.sort(key=lambda interval: (interval[1], interval[1] - interval[0]))

    result_substrings = []
    last_chosen_end_index = -1

    for start_of_interval, end_of_interval in candidate_intervals:
        # Greedily select the earliest-ending interval that doesn't conflict with our last choice.

        if start_of_interval > last_chosen_end_index:
            result_substrings.append(input_string[start_of_interval : end_of_interval + 1])
            last_chosen_end_index = end_of_interval
            
    return result_substrings

Big(O) Analysis

Time Complexity
O(n²)The primary computational cost comes from identifying all 'self-contained' substrings for every possible starting position in the string of length n. This process involves a nested loop structure where an outer loop iterates through each of the n starting characters, and for each, an inner loop expands to find the valid endpoint. In the worst case, for many of the n starting positions, the inner loop may scan a large fraction of the string, leading to a total number of operations that approximates n * n/2. This quadratic work dominates other steps, so the overall time complexity simplifies to O(n²).
Space Complexity
O(N)The primary space cost comes from storing the candidate substrings and the final result. Initially, a constant-space array is used to track the 'home range' for each unique character, as the alphabet size is fixed. The crucial step is generating a list of 'candidate pieces,' which can contain up to N substrings, one for each starting position, requiring O(N) space. Subsequently, the list holding the final selection of non-intersecting substrings can also grow to a size proportional to N in the worst-case scenario. Thus, the auxiliary space is dominated by these lists, resulting in O(N) complexity where N is the input string's length.

Edge Cases

Input string is shorter than 4 characters
How to Handle:
The solution returns 0 as no substring can satisfy the minimum length requirement of four.
Input string with all unique characters
How to Handle:
The algorithm correctly returns 0 because no substring can be formed that starts and ends with the same character.
Input string consists of a single character repeated many times, e.g., 'aaaaaaaa'
How to Handle:
The solution correctly identifies the maximum number of non-overlapping blocks of four, calculating a result of floor(length / 4).
Input containing nested valid substrings, such as 'abccba' where 'bccb' is inside 'abccba'
How to Handle:
The dynamic programming state correctly ensures that intersecting substrings are not counted together, resulting in an output of 1.
Input with character repetitions too close to form a valid substring, e.g., 'abacaba'
How to Handle:
The algorithm's length check correctly filters out these invalid short substrings, leading to a count of 0 if no valid ones exist.
Maximum length input of 2 * 10^5 characters
How to Handle:
The linear time O(N) dynamic programming solution scales efficiently and avoids a timeout on large inputs.
Input where smaller substrings yield a better result than one large one, e.g., 'aaaabaaa'
How to Handle:
The DP correctly maximizes the count by choosing two non-intersecting 'aaaa' substrings over the single longer intersecting one.
Input where no valid substrings can be formed despite meeting length and repetition criteria, e.g., 'aaabbb'
How to Handle:
The solution correctly returns 0 as no pair of identical characters are at least 3 indices apart.