You are given a DNA sequence represented as a string s
. The DNA sequence is composed of a series of nucleotides abbreviated as 'A', 'C', 'G', and 'T'. For example, "ACGAATTCCG" is a DNA sequence. In studying DNA, it is useful to identify repeated sequences within the DNA.
Given a string s
that represents a DNA sequence, return all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule. You may return the answer in any order.
For example:
s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output: ["AAAAACCCCC","CCCCCAAAAA"]
s = "AAAAAAAAAAAAA"
Output: ["AAAAAAAAAA"]
Constraints:
1 <= s.length <= 10^5
s[i]
is either 'A'
, 'C'
, 'G'
, or 'T'
.When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:
The brute force method for finding repeated DNA sequences involves examining every possible sequence of a specified length within the DNA string. We look at each segment, compare it with all other segments of the same length, and note the ones that appear more than once. This is an exhaustive comparison of all possible substrings.
Here's how the algorithm would work step-by-step:
def find_repeated_dna_sequences_brute_force(dna_string, sequence_length):
repeated_sequences = set()
all_sequences = set()
# Iterate through all possible start positions of the sequence
for i in range(len(dna_string) - sequence_length + 1):
current_sequence = dna_string[i:i + sequence_length]
# If we've already seen the sequence
if current_sequence in all_sequences:
# It's a repeat, add to repeats
repeated_sequences.add(current_sequence)
# Add sequence to overall set of sequences we've seen
all_sequences.add(current_sequence)
return list(repeated_sequences)
The problem asks us to find DNA sequences that appear more than once. The efficient way to solve this is to condense each DNA sequence into a shorter representation and then check if we have seen that compressed version before.
Here's how the algorithm would work step-by-step:
def find_repeated_dna_sequences(dna_string):
sequence_length = 10
seen_sequences = set()
repeated_sequences = set()
result = []
if len(dna_string) < sequence_length:
return result
dna_map = {'A': 0, 'C': 1, 'G': 2, 'T': 3}
for i in range(len(dna_string) - sequence_length + 1):
subsequence = dna_string[i:i + sequence_length]
# Convert DNA sequence to a numerical representation.
numerical_sequence = 0
for char in subsequence:
numerical_sequence = (numerical_sequence * 4) + dna_map[char]
# Check if we've seen this sequence before.
if numerical_sequence in seen_sequences:
# If the sequence is already in seen, it's a repeat.
if subsequence not in repeated_sequences:
result.append(subsequence)
repeated_sequences.add(subsequence)
else:
# Add the sequence to seen for future checks.
seen_sequences.add(numerical_sequence)
return result
Case | How to Handle |
---|---|
Input string s is null or empty | Return an empty list as there are no subsequences to find. |
Input string s length is less than 10 | Return an empty list as no 10-letter subsequences can be formed. |
Input string s contains characters other than 'A', 'C', 'G', 'T' | Consider invalid characters, either filter them out or throw an error. |
Very long input string s to test performance | Ensure the solution scales well using efficient data structures like hash sets or dictionaries to avoid quadratic time complexity. |
Input string s with many repeated 10-letter subsequences | The solution should correctly identify and return all unique repeated subsequences without duplicates in the result. |
A subsequence appears more than twice. | The solution should only add the subsequence to the result once, even if its count is greater than or equal to 2. |
Input string consists of the same repeating character, e.g. 'AAAAAAAAAA' | The sliding window approach should correctly identify the repeating subsequence in this scenario. |
The last few characters can create a valid 10-letter sequence. | The code needs to ensure it iterates to the correct end position when generating 10 letter sequences. |