Taro Logo

Repeated DNA Sequences

Medium
Apple logo
Apple
2 views
Topics:
StringsSliding Windows

You are given a DNA sequence represented as a string s. The DNA sequence is composed of a series of nucleotides abbreviated as 'A', 'C', 'G', and 'T'. For example, "ACGAATTCCG" is a DNA sequence. In studying DNA, it is useful to identify repeated sequences within the DNA.

Given a string s that represents a DNA sequence, return all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule. You may return the answer in any order.

For example:

  • Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT" Output: ["AAAAACCCCC","CCCCCAAAAA"]
  • Input: s = "AAAAAAAAAAAAA" Output: ["AAAAAAAAAA"]

Constraints:

  • 1 <= s.length <= 10^5
  • s[i] is either 'A', 'C', 'G', or 'T'.

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. What characters can the DNA sequence `s` contain, and is the input case-sensitive?
  2. What should I return if there are no repeated 10-letter subsequences?
  3. Should the returned list of repeated subsequences contain duplicates, or should each repeated subsequence appear only once in the output?
  4. Is the length of the input string `s` guaranteed to be at least 10, or can it be shorter?
  5. Do the repeated 10-letter subsequences need to be non-overlapping?

Brute Force Solution

Approach

The brute force method for finding repeated DNA sequences involves examining every possible sequence of a specified length within the DNA string. We look at each segment, compare it with all other segments of the same length, and note the ones that appear more than once. This is an exhaustive comparison of all possible substrings.

Here's how the algorithm would work step-by-step:

  1. Consider all possible short sequences of a fixed length within the longer DNA string.
  2. For each short sequence, go through the entire DNA string again to see how many times that particular short sequence appears.
  3. If a short sequence is found to appear more than once, remember it.
  4. Do this for every possible short sequence you can create from the original DNA string.
  5. At the end, you will have a list of all the short sequences that appear more than once.

Code Implementation

def find_repeated_dna_sequences_brute_force(dna_string, sequence_length):
    repeated_sequences = set()
    all_sequences = set()

    # Iterate through all possible start positions of the sequence
    for i in range(len(dna_string) - sequence_length + 1):
        current_sequence = dna_string[i:i + sequence_length]

        # If we've already seen the sequence
        if current_sequence in all_sequences:
            # It's a repeat, add to repeats
            repeated_sequences.add(current_sequence)

        # Add sequence to overall set of sequences we've seen
        all_sequences.add(current_sequence)

    return list(repeated_sequences)

Big(O) Analysis

Time Complexity
O(n²)The provided approach iterates through all possible substrings of length k (where k=10 in the original problem), which involves an outer loop that effectively goes up to n-k+1 times, where n is the length of the DNA string. For each of these substrings, the algorithm compares it against all other substrings of length k in the DNA string using an inner loop, also running up to n-k+1 times. Therefore, we have a nested loop structure where the outer loop runs approximately n times, and the inner loop also runs approximately n times, leading to approximately n*n operations. Thus, the overall time complexity is O(n²).
Space Complexity
O(N)The brute force algorithm, as described, remembers repeated sequences. In the worst-case scenario, it might need to store a collection of unique repeated DNA sequences. The number of repeated sequences could grow linearly with the size of the DNA string N. Therefore, the auxiliary space required to store these repeated sequences could be proportional to N, resulting in O(N) space complexity.

Optimal Solution

Approach

The problem asks us to find DNA sequences that appear more than once. The efficient way to solve this is to condense each DNA sequence into a shorter representation and then check if we have seen that compressed version before.

Here's how the algorithm would work step-by-step:

  1. First, go through the long DNA string and break it down into smaller, overlapping pieces of a specific length (10 characters in this case).
  2. Instead of comparing the actual sequences, turn each DNA piece into a simpler number code using a special mapping where each DNA character corresponds to a number (for example, A=0, C=1, G=2, T=3).
  3. Think of each number code as a key. Store each key in a 'seen' tracker. This tracker helps in identifying previously processed codes.
  4. If a key is seen for the first time, store it in the tracker. However, if the same key is found again, add the corresponding DNA sequence to our list of repeated sequences.
  5. Finally, return the list of repeated DNA sequences.

Code Implementation

def find_repeated_dna_sequences(dna_string):
    sequence_length = 10
    seen_sequences = set()
    repeated_sequences = set()
    result = []

    if len(dna_string) < sequence_length:
        return result

    dna_map = {'A': 0, 'C': 1, 'G': 2, 'T': 3}

    for i in range(len(dna_string) - sequence_length + 1):
        subsequence = dna_string[i:i + sequence_length]

        # Convert DNA sequence to a numerical representation.
        numerical_sequence = 0
        for char in subsequence:
            numerical_sequence = (numerical_sequence * 4) + dna_map[char]

        # Check if we've seen this sequence before.
        if numerical_sequence in seen_sequences:
            # If the sequence is already in seen, it's a repeat.
            if subsequence not in repeated_sequences:
                result.append(subsequence)
                repeated_sequences.add(subsequence)
        else:
            # Add the sequence to seen for future checks.
            seen_sequences.add(numerical_sequence)

    return result

Big(O) Analysis

Time Complexity
O(n)The algorithm iterates through the DNA string of length n to extract overlapping sequences of length 10. For each sequence, a constant-time hash calculation (sequence to number code) and hash table lookup (seen tracker) are performed. The hash table operations (insertion and lookup) take expected O(1) time on average. Therefore, the overall time complexity is dominated by the single loop through the DNA string, resulting in O(n).
Space Complexity
O(N)The algorithm uses a 'seen' tracker to store the compressed number codes and a list to hold the repeated DNA sequences. In the worst-case scenario, where almost all DNA sequences of length 10 are unique, the 'seen' tracker (likely implemented as a hash set or dictionary) could potentially store close to N-9 number codes, where N is the length of the DNA string. The list of repeated sequences could also grow proportionally to N in a contrived case. Therefore, the auxiliary space complexity is O(N).

Edge Cases

CaseHow to Handle
Input string s is null or emptyReturn an empty list as there are no subsequences to find.
Input string s length is less than 10Return an empty list as no 10-letter subsequences can be formed.
Input string s contains characters other than 'A', 'C', 'G', 'T'Consider invalid characters, either filter them out or throw an error.
Very long input string s to test performanceEnsure the solution scales well using efficient data structures like hash sets or dictionaries to avoid quadratic time complexity.
Input string s with many repeated 10-letter subsequencesThe solution should correctly identify and return all unique repeated subsequences without duplicates in the result.
A subsequence appears more than twice.The solution should only add the subsequence to the result once, even if its count is greater than or equal to 2.
Input string consists of the same repeating character, e.g. 'AAAAAAAAAA'The sliding window approach should correctly identify the repeating subsequence in this scenario.
The last few characters can create a valid 10-letter sequence.The code needs to ensure it iterates to the correct end position when generating 10 letter sequences.