The DNA sequence is composed of a series of nucleotides abbreviated as 'A'
, 'C'
, 'G'
, and 'T'
. For example, "ACGAATTCCG"
is a DNA sequence. When studying DNA, it is useful to identify repeated sequences within the DNA. Given a string s
that represents a DNA sequence, return all the 10
-letter-long sequences (substrings) that occur more than once in a DNA molecule. You may return the answer in any order. Example 1: Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT" Output: ["AAAAACCCCC","CCCCCAAAAA"] Example 2: Input: s = "AAAAAAAAAAAAA" Output: ["AAAAAAAAAA"] Constraints: 1 <= s.length <= 10^5, s[i] is either 'A'
, 'C'
, 'G'
, or 'T'
.
## Finding Repeated DNA Sequences
This problem asks us to identify 10-letter-long sequences (substrings) that appear more than once within a given DNA sequence. We need to return all such repeated sequences, and the order doesn't matter.
### 1. Naive Solution (Brute Force)
The most straightforward approach is to generate all possible 10-letter substrings and then count the occurrences of each substring. If a substring appears more than once, we add it to our result. This approach is easy to understand but not very efficient.
```python
def findRepeatedDnaSequences_naive(s: str) -> list[str]:
if len(s) < 10:
return []
counts = {}
for i in range(len(s) - 9):
sub = s[i:i+10]
counts[sub] = counts.get(sub, 0) + 1
result = [sub for sub, count in counts.items() if count > 1]
return result
We can improve efficiency by using a hash table (dictionary in Python) to store the substrings and their counts. This avoids repeatedly generating the same substrings. We can iterate through the DNA string once, extracting each 10-letter substring, and updating its count in the hash table. This reduces the time complexity significantly.
def findRepeatedDnaSequences(s: str) -> list[str]:
if len(s) < 10:
return []
seen = {}
repeated = set()
for i in range(len(s) - 9):
sub = s[i:i + 10]
if sub in seen:
repeated.add(sub)
else:
seen[sub] = 1
return list(repeated)
s
and M is the length of the substring (which is 10 in this case). This is because we iterate through the string to extract each substring and then potentially iterate through a list of substrings to count occurrences.s
. We iterate through the string once. Hash table (dictionary) operations (insertion, lookup) take O(1) on average. Building substring is O(M) where M is substring length which is constant, and can be thought of as O(1).counts
dictionary can grow up to a size proportional to N.s
. In the worst case, seen
and repeated
can store up to N unique substrings. In practical scenarios, the space usage will likely be less than O(N), as the number of repeating DNA sequences will be limited. Also the set prevents duplicates so it is only storing unique repeating sequences.s
has a length less than 10, there are no 10-letter substrings, so we should return an empty list.