Minimum Genetic Mutation Interview Question for Amazon

A gene string can be represented by an 8-character long string, with choices from 'A', 'C', 'G', and 'T'. Suppose we need to investigate a mutation from a gene string startGene to a gene string endGene where one mutation is defined as one single character changed in the gene string.

For example, "AACCGGTT" --> "AACCGGTA" is one mutation.

There is also a gene bank bank that records all the valid gene mutations. A gene must be in bank to make it a valid gene string.

Given the two gene strings startGene and endGene and the gene bank bank, return the minimum number of mutations needed to mutate from startGene to endGene. If there is no such a mutation, return -1.

Note that the starting point is assumed to be valid, so it might not be included in the bank.

Example 1:

Input: startGene = "AACCGGTT", endGene = "AACCGGTA", bank = ["AACCGGTA"]
Output: 1

Example 2:

Input: startGene = "AACCGGTT", endGene = "AAACGGTA", bank = ["AACCGGTA","AACCGCTA","AAACGGTA"]
Output: 2

Here's how to solve this problem:

Problem Description

We are given two gene strings, startGene and endGene, both of length 8, consisting of characters 'A', 'C', 'G', and 'T'. We are also given a gene bank bank containing valid gene strings. A mutation is defined as changing a single character in a gene string. The goal is to find the minimum number of mutations needed to transform startGene to endGene, using only gene strings present in the bank. The starting gene might not be in the bank itself.

Example:

startGene = "AACCGGTT", endGene = "AACCGGTA", bank = ["AACCGGTA"] Output: 1

startGene = "AACCGGTT", endGene = "AAACGGTA", bank = ["AACCGGTA","AACCGCTA","AAACGGTA"] Output: 2

Approach

We can use Breadth-First Search (BFS) to find the shortest path (minimum number of mutations) from startGene to endGene. Each gene string is a node in the graph, and an edge exists between two gene strings if they differ by only one character and are in the bank (or one of them is the start gene).

Initialization:
- Create a queue and add the startGene to it.
- Create a set to store visited gene strings to avoid cycles.
- Initialize the mutation count to 0.
BFS:
- While the queue is not empty:
  - Get the size of the queue (number of nodes at the current level).
  - Iterate size times:
    - Dequeue a gene string from the queue.
    - If it is equal to the endGene, return the current mutation count.
    - For each character in the gene string:
      - Try changing it to 'A', 'C', 'G', and 'T'.
      - If the new gene string is in the bank and has not been visited:
        
        Add it to the queue.
        
        Mark it as visited.
  - Increment the mutation count.
No Path:
- If the queue becomes empty and we haven't found the endGene, it means there is no valid mutation path. Return -1.

Code

from collections import deque

class Solution:
    def minMutation(self, startGene: str, endGene: str, bank: list[str]) -> int:
        bank_set = set(bank)
        if endGene not in bank_set:
            return -1

        queue = deque([(startGene, 0)])
        visited = {startGene}
        
        while queue:
            gene, mutations = queue.popleft()
            
            if gene == endGene:
                return mutations
            
            for i in range(8):
                for char in 'ACGT':
                    new_gene = gene[:i] + char + gene[i+1:]
                    if new_gene in bank_set and new_gene not in visited:
                        queue.append((new_gene, mutations + 1))
                        visited.add(new_gene)
        
        return -1

Explanation:

bank_set = set(bank): Converts the bank list to a set for faster lookups.
queue = deque([(startGene, 0)]): Initializes a queue with the starting gene and mutation count 0.
visited = {startGene}: Keeps track of visited genes to avoid cycles.
The while queue loop performs BFS.
The inner loops iterate through each character of the gene and each possible character ('A', 'C', 'G', 'T') to create new candidate genes.
new_gene in bank_set and new_gene not in visited: Checks if the new gene is valid (in the bank) and hasn't been visited before.
If a valid mutation is found, it's added to the queue with an incremented mutation count.
If the endGene is never reached, the function returns -1.

Example Usage:

startGene = "AACCGGTT"
endGene = "AACCGGTA"
bank = ["AACCGGTA"]

solution = Solution()
result = solution.minMutation(startGene, endGene, bank)
print(result)  # Output: 1

startGene = "AACCGGTT"
endGene = "AAACGGTA"
bank = ["AACCGGTA","AACCGCTA","AAACGGTA"]

result = solution.minMutation(startGene, endGene, bank)
print(result) # Output: 2

Big O Runtime Analysis

Let n be the number of genes in the bank. In the worst case, we might visit all the genes in the bank.
For each gene, we iterate through 8 characters, and for each character, we try 4 possible replacements ('A', 'C', 'G', 'T'). This gives us a constant factor of 8 * 4 = 32.
Therefore, the time complexity is O(n * 32), which simplifies to O(n).

Big O Space Analysis

The queue queue can hold, in the worst case, all the genes in the bank. So, the space used by the queue is O(n).
The visited set can also hold all the genes in the bank. So, the space used by the set is O(n).
Therefore, the overall space complexity is O(n) + O(n) = O(n).