Taro Logo

Repeated DNA Sequences

Medium
1 views
2 months ago

The DNA sequence is composed of a series of nucleotides abbreviated as 'A', 'C', 'G', and 'T'. For example, "ACGAATTCCG" is a DNA sequence. When studying DNA, it is useful to identify repeated sequences within the DNA. Given a string s that represents a DNA sequence, return all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule. You may return the answer in any order. Example 1: Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT" Output: ["AAAAACCCCC","CCCCCAAAAA"] Example 2: Input: s = "AAAAAAAAAAAAA" Output: ["AAAAAAAAAA"] Constraints: 1 <= s.length <= 10^5, s[i] is either 'A', 'C', 'G', or 'T'.

Sample Answer
## Finding Repeated DNA Sequences

This problem asks us to identify 10-letter-long sequences (substrings) that appear more than once within a given DNA sequence.  We need to return all such repeated sequences, and the order doesn't matter.

### 1. Naive Solution (Brute Force)

The most straightforward approach is to generate all possible 10-letter substrings and then count the occurrences of each substring. If a substring appears more than once, we add it to our result.  This approach is easy to understand but not very efficient.

```python
def findRepeatedDnaSequences_naive(s: str) -> list[str]:
    if len(s) < 10:
        return []

    counts = {}
    for i in range(len(s) - 9):
        sub = s[i:i+10]
        counts[sub] = counts.get(sub, 0) + 1

    result = [sub for sub, count in counts.items() if count > 1]
    return result

2. Optimal Solution (Hashing)

We can improve efficiency by using a hash table (dictionary in Python) to store the substrings and their counts. This avoids repeatedly generating the same substrings. We can iterate through the DNA string once, extracting each 10-letter substring, and updating its count in the hash table. This reduces the time complexity significantly.

def findRepeatedDnaSequences(s: str) -> list[str]:
    if len(s) < 10:
        return []

    seen = {}
    repeated = set()

    for i in range(len(s) - 9):
        sub = s[i:i + 10]
        if sub in seen:
            repeated.add(sub)
        else:
            seen[sub] = 1

    return list(repeated)

3. Big(O) Run-time Analysis

  • Naive Solution: O(N*M), where N is the length of the string s and M is the length of the substring (which is 10 in this case). This is because we iterate through the string to extract each substring and then potentially iterate through a list of substrings to count occurrences.
  • Optimal Solution: O(N), where N is the length of the string s. We iterate through the string once. Hash table (dictionary) operations (insertion, lookup) take O(1) on average. Building substring is O(M) where M is substring length which is constant, and can be thought of as O(1).

4. Big(O) Space Usage Analysis

  • Naive Solution: O(N), where N is the number of unique 10-letter substrings that appear in the given DNA string. In the worst-case scenario, where almost every 10-letter substring is unique, the counts dictionary can grow up to a size proportional to N.
  • Optimal Solution: O(N), where N is the length of the string s. In the worst case, seen and repeated can store up to N unique substrings. In practical scenarios, the space usage will likely be less than O(N), as the number of repeating DNA sequences will be limited. Also the set prevents duplicates so it is only storing unique repeating sequences.

5. Edge Cases

  • String Length Less Than 10: If the input string s has a length less than 10, there are no 10-letter substrings, so we should return an empty list.
  • Empty String: An empty string should also return an empty list, as it trivially satisfies the condition of having no repeating sequences.
  • String with Only One Unique Sequence: If the string only contains one unique 10-letter sequence (e.g., "AAAAAAAAAA"), the algorithm should identify it as a repeating sequence.
  • String with Overlapping Repeating Sequences: The solution correctly handles overlapping repeating sequences, as demonstrated in the given examples.