Taro Logo

Longest Duplicate Substring

Hard
Google logo
Google
4 views
Topics:
StringsBinary SearchSliding Windows

Given a string s, consider all duplicated substrings: (contiguous) substrings of s that occur 2 or more times. The occurrences may overlap.

Return any duplicated substring that has the longest possible length. If s does not have a duplicated substring, the answer is "".

Example 1:

Input: s = "banana" Output: "ana"

Example 2:

Input: s = "abcd" Output: ""

Constraints:

  • 2 <= s.length <= 3 * 10^4
  • s consists of lowercase English letters.

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. What is the maximum possible length of the input string, and are there any specific memory constraints I should be aware of?
  2. Is it possible for the input string to be empty or null, and if so, what should the function return in those cases?
  3. If there are multiple duplicate substrings of the same maximum length, is it acceptable to return any one of them?
  4. Is the input case-sensitive, or should I treat uppercase and lowercase letters as equivalent when searching for duplicate substrings?
  5. By 'substring', do you mean a contiguous sequence of characters, or can it be non-contiguous?

Brute Force Solution

Approach

The brute force approach to finding the longest repeated piece of a text is to try every possible piece and see if it appears again. We'll check progressively longer pieces until we've exhausted all the options. This ensures we find the absolute longest repeated piece, if one exists.

Here's how the algorithm would work step-by-step:

  1. Start by looking at all possible pieces of length one, then two, then three, and so on.
  2. For each piece, check if it appears anywhere else in the text.
  3. If you find a duplicate of that piece, remember its length and what the piece is.
  4. Keep doing this for all possible pieces and lengths.
  5. At the end, compare all the repeated pieces you found and pick the longest one. That's your answer.

Code Implementation

def longest_duplicate_substring_brute_force(text):
    longest_duplicate = ""

    # Iterate through possible substring lengths
    for substring_length in range(1, len(text)):
        for i in range(len(text) - substring_length + 1):
            substring = text[i:i + substring_length]

            # Search for the substring elsewhere in the text
            for j in range(i + 1, len(text) - substring_length + 1):
                if text[j:j + substring_length] == substring:

                    # Found a duplicate, check if it's the longest so far
                    if substring_length > len(longest_duplicate):
                        longest_duplicate = substring

    return longest_duplicate

Big(O) Analysis

Time Complexity
O(n^3)The outer loop iterates through all possible substring lengths, from 1 to n, where n is the length of the input string. For each length, the next loop iterates through all possible starting positions for a substring of that length, resulting in O(n) iterations. Inside this loop, we compare this substring against all other substrings of the same length to check for duplicates, taking O(n) time in the worst case using string comparison. Therefore, the overall time complexity is O(n * n * n) which simplifies to O(n^3).
Space Complexity
O(N^2)The brute force approach described requires storing duplicate substrings. In the worst case, where nearly all substrings are duplicates (or are considered as such until a longer substring is found), we might need to store multiple substrings of varying lengths. The space required to store these substrings grows quadratically with the input size N, as there are potentially O(N^2) substrings to store. The space used to remember the longest duplicate seen so far is also proportional to the length of that substring, contributing another O(N) factor but does not change the overall complexity.

Optimal Solution

Approach

The most efficient way to find the longest repeating chunk of text is to avoid checking every possible chunk individually. Instead, we cleverly use a process that narrows down our search by guessing possible lengths and checking if a repeating chunk of that length exists.

Here's how the algorithm would work step-by-step:

  1. Imagine we're playing a guessing game to find the right length of the repeating chunk.
  2. We start by guessing a length in the middle of the possible range. If the longest repeating chunk is somewhere between 1 character and the entire text, our first guess might be halfway in between.
  3. Now, we check if a repeating chunk of that guessed length actually exists in the text.
  4. If a repeating chunk of that length exists, we know the real length must be at least this big, so we guess a bigger length next time.
  5. If a repeating chunk of that length doesn't exist, we know the real length must be smaller, so we guess a smaller length next time.
  6. We keep guessing and checking, narrowing down the range until we find the exact length of the longest repeating chunk.
  7. Once we know the length, finding the actual repeating chunk is easy: just find any of the chunks that match.

Code Implementation

def longest_duplicate_substring(text):
    low_length = 1
    high_length = len(text) - 1
    longest_duplicate = ""

    while low_length <= high_length:
        middle_length = (low_length + high_length) // 2
        duplicate_found = find_duplicate(text, middle_length)

        # If we found a duplicate substring of middle_length
        if duplicate_found:
            longest_duplicate = duplicate_found
            low_length = middle_length + 1

        # Otherwise, reduce the length and search again
        else:
            high_length = middle_length - 1

    return longest_duplicate

def find_duplicate(text, substring_length):
    seen_substrings = set()

    for i in range(len(text) - substring_length + 1):
        substring = text[i:i + substring_length]

        # Use a set to efficiently check for duplicates.
        if substring in seen_substrings:
            return substring
        
        seen_substrings.add(substring)

    return None

#The following function applies a Rabin-Karp algorithm
#with a rolling hash for duplicate detection. However,
#the above solution is within the constraints of the problem.
#Leaving it out to keep the logic and structure as requested

#def longest_duplicate_substring_rk(text):
#    string_length = len(text)
#    left_bound = 1
#    right_bound = string_length
#    longest_duplicate_substring = ""
#
#    while left_bound <= right_bound:
#        substring_length = left_bound + (right_bound - left_bound) // 2
#        duplicate_substring = rabin_karp(text, substring_length)
#
#        # If duplicate substring exists, increase the substring length.
#        if duplicate_substring:
#            longest_duplicate_substring = duplicate_substring
#            left_bound = substring_length + 1
#        # If no duplicate substring, decrease the substring length.
#        else:
#            right_bound = substring_length - 1
#
#    return longest_duplicate_substring
#
#
#def rabin_karp(text, substring_length):
#    string_length = len(text)
#    base = 256  # Number of possible characters
#    prime_modulus = 101  # Prime modulus for hash function
#    highest_order = 1
#
#    # Precompute base^(substring_length-1) % prime_modulus
#    for _ in range(substring_length - 1):
#        highest_order = (highest_order * base) % prime_modulus
#
#    # Compute initial hash value
#    hash_value = 0
#    for i in range(substring_length):
#        hash_value = (hash_value * base + ord(text[i])) % prime_modulus
#
#    substring_hash_map = {hash_value: 0}
#
#    # Slide the window through the text
#    for i in range(1, string_length - substring_length + 1):
#        # Remove leading digit, add trailing digit, recompute hash value
#        hash_value = (hash_value - ord(text[i - 1]) * highest_order) % prime_modulus
#        hash_value = (hash_value * base + ord(text[i + substring_length - 1])) % prime_modulus
#
#        # Because hash_value can be negative
#        if hash_value < 0:
#            hash_value += prime_modulus
#
#        # If hash value is in hash map, we have a duplicate
#        if hash_value in substring_hash_map:
#            return text[i: i + substring_length]
#
#        substring_hash_map[hash_value] = i
#
#    return None

Big(O) Analysis

Time Complexity
O(n log n)The algorithm uses binary search to find the length of the longest duplicate substring, which contributes a factor of O(log n) where n is the length of the string. For each potential length during the binary search, we need to check if a duplicate substring of that length exists. Checking for duplicates involves iterating through the string and comparing substrings, which takes O(n) time in the best implementations such as Rabin-Karp. Therefore, the overall time complexity becomes O(n log n), reflecting the binary search combined with the substring checking at each step.
Space Complexity
O(N)The algorithm uses a binary search approach to find the length of the longest repeating substring. During the check for a given length, it likely employs a data structure such as a hash set or hash map to store and compare substrings of that length. In the worst-case scenario, this data structure might need to store all possible substrings of the given length, leading to a space complexity proportional to the input string's length, N. Therefore, the auxiliary space used is O(N).

Edge Cases

CaseHow to Handle
Empty stringReturn an empty string immediately as there can be no duplicate substrings.
String of length 1Return an empty string since a substring of length greater than 0 cannot be a duplicate.
String with all the same characters (e.g., 'aaaa')The algorithm should correctly identify the longest duplicate substring as 'aaa' for input 'aaaa'.
Maximum string length (consider memory limits and potential integer overflow with rolling hash)Ensure the rolling hash function and modulo are chosen carefully to avoid collisions and integer overflow, and consider using a 64-bit integer type.
String with no duplicate substringsThe algorithm should return an empty string when no duplicate substring exists.
String with overlapping duplicate substrings (e.g., 'abababa')The binary search should correctly find the longest overlapping substring 'ababa'.
String containing non-ASCII characters (Unicode)The character encoding must be handled correctly to accurately calculate hash values for the rolling hash, potentially requiring wider character representation.
String with very long duplicate substrings close to the string lengthBinary search needs to handle the large lengths correctly without exceeding recursion limits or timing out.