Taro Logo

Minimum Genetic Mutation

Medium
Apple logo
Apple
2 views
Topics:
GraphsStrings

A gene string can be represented by an 8-character long string, with choices from 'A', 'C', 'G', and 'T'.

Suppose we need to investigate a mutation from a gene string startGene to a gene string endGene where one mutation is defined as one single character changed in the gene string.

  • For example, "AACCGGTT" --> "AACCGGTA" is one mutation.

There is also a gene bank bank that records all the valid gene mutations. A gene must be in bank to make it a valid gene string.

Given the two gene strings startGene and endGene and the gene bank bank, return the minimum number of mutations needed to mutate from startGene to endGene. If there is no such a mutation, return -1.

Note that the starting point is assumed to be valid, so it might not be included in the bank.

Example 1:

Input: startGene = "AACCGGTT", endGene = "AACCGGTA", bank = ["AACCGGTA"]
Output: 1

Example 2:

Input: startGene = "AACCGGTT", endGene = "AAACGGTA", bank = ["AACCGGTA","AACCGCTA","AAACGGTA"]
Output: 2

Constraints:

  • 0 <= bank.length <= 10
  • startGene.length == endGene.length == bank[i].length == 8
  • startGene, endGene, and bank[i] consist of only the characters ['A', 'C', 'G', 'T'].

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. What are the allowed characters in the gene strings, and is the input guaranteed to contain only these allowed characters?
  2. If no valid mutation sequence exists to reach the endGene, what should the function return?
  3. Are duplicate gene strings allowed within the bank?
  4. Is the startGene guaranteed to be different from the endGene?
  5. Are all gene strings in the bank guaranteed to be the same length as the startGene and endGene?

Brute Force Solution

Approach

The brute force approach for finding the minimum genetic mutation involves exploring every possible sequence of single-character mutations from the start gene to the end gene. We generate all possible mutations at each step and check if they lead to the target, ultimately finding the shortest path. This is like trying every possible path through a maze to find the exit, regardless of how long or winding it is.

Here's how the algorithm would work step-by-step:

  1. Begin with the starting gene.
  2. Consider all possible single-character changes you can make to the starting gene.
  3. Check if any of those changed genes match the ending gene. If yes, you have found a possible mutation path.
  4. If the changed gene is not the ending gene, check if it is present in the provided bank of valid genes.
  5. If it's in the bank, remember this gene and consider it as the new current gene.
  6. From each of these new current genes, again consider all possible single-character mutations.
  7. Repeat the process of checking if the new mutated gene is the ending gene or a valid gene in the bank.
  8. Continue exploring every possible mutation from each gene in the bank until you find the ending gene, or you have exhausted all the possible mutations.
  9. Keep track of the number of mutations it takes to reach the ending gene for each possible path you find.
  10. Compare the number of mutations for all the paths that lead to the ending gene, and select the path with the fewest mutations as your answer.

Code Implementation

def minimum_genetic_mutation_brute_force(start_gene, end_gene, gene_bank):
    min_mutations = float('inf')

    def find_mutation_paths(current_gene, mutations, visited):
        nonlocal min_mutations

        if current_gene == end_gene:
            min_mutations = min(min_mutations, mutations)
            return

        # Need to avoid cycles in our search
        if current_gene in visited:
            return

        visited.add(current_gene)

        for gene in gene_bank:
            difference = 0
            for i in range(len(current_gene)):
                if current_gene[i] != gene[i]:
                    difference += 1

            # Explore possible single mutations from current_gene
            if difference == 1:
                find_mutation_paths(gene, mutations + 1, visited.copy())

    # Begin recursive search from the starting gene
    find_mutation_paths(start_gene, 0, set())

    # If no path was found, return -1
    if min_mutations == float('inf'):
        return -1
    else:
        return min_mutations

Big(O) Analysis

Time Complexity
O(4^L * N)Let L be the length of a gene and N be the number of genes in the bank. For each gene, there are 3 * L possible single-character mutations (since each character can be changed to 3 other bases 'A', 'C', 'G', 'T'). In the worst-case scenario, we explore all possible mutations from each valid gene. The branching factor is roughly 4 (since we have 3 possible single nucleotide changes plus the original). Considering we are exploring all possible paths of length L, we can approximate each path length as 4^L. We then need to check if these mutations are valid, which takes O(N) time as we would need to check against all genes in the bank. Therefore the overall time complexity is approximately O(4^L * N).
Space Complexity
O(B)The algorithm uses a queue (implicitly described as 'remembering genes') to store genes to explore, and a set (implicitly described as 'keeping track of visited locations') to avoid revisiting genes. In the worst-case scenario, the queue and the set could potentially store all genes in the bank, where B is the number of genes in the gene bank. Therefore, the auxiliary space used is proportional to the number of genes in the bank, leading to a space complexity of O(B).

Optimal Solution

Approach

The problem asks for the fewest changes to a starting word to reach a target word, using only words from a provided list. We find the shortest path by exploring possible mutations in a breadth-first manner, checking words one change away until we find the target.

Here's how the algorithm would work step-by-step:

  1. Think of the problem as finding the shortest path between two points, where each point is a word and each connection represents a single mutation.
  2. Start with the beginning word and consider all the words from the list that are only one mutation away. These are your immediate next steps.
  3. From each of those 'one mutation away' words, repeat the process: find other words that are one mutation away from *them*, but make sure you don't go back to a word you've already visited to avoid getting stuck in a loop.
  4. Keep track of how many mutations it took to get to each new word. The first time you reach the target word, you've found the shortest path, and the number of mutations is your answer.
  5. If you exhaust all possible mutations without reaching the target word, then there's no valid path, and you return a special value indicating that (like -1).

Code Implementation

def minMutation(start_gene, end_gene, gene_bank):
    queue = [(start_gene, 0)]
    visited_genes = {start_gene}

    while queue:
        current_gene, mutations_count = queue.pop(0)

        if current_gene == end_gene:
            return mutations_count

        # Iterate through all possible genes in the bank
        for next_gene in gene_bank:
            difference_count = 0
            for i in range(len(current_gene)):
                if current_gene[i] != next_gene[i]:
                    difference_count += 1

            # Check if the gene is one mutation away
            if difference_count == 1:
                # Avoid cycles by checking visited genes
                if next_gene not in visited_genes:
                    queue.append((next_gene, mutations_count + 1))
                    visited_genes.add(next_gene)

    # Return -1 if no mutation path is found
    return -1

Big(O) Analysis

Time Complexity
O(b*m*n)The algorithm uses a breadth-first search (BFS). In the worst case, we might visit all words in the bank (n words), where n is the size of the word bank. For each word visited, we generate all possible one-character mutations. Since each word has length m and can have 4 possible characters at each position, generating mutations takes O(m*4) which can be approximated as O(m), and for each mutation, we check if it's in the bank (n words worst case lookup time, can be better if stored as a set which makes it closer to O(1)). The BFS queue can hold all words in the bank in the worst case (b number of branches at each step). Therefore, the overall time complexity is O(b * m * n), where b is the branching factor and n is the number of words and m is the word length.
Space Complexity
O(N)The breadth-first search algorithm, described in the plain English explanation, implicitly uses a queue to store the words to explore. In the worst-case scenario, where nearly all the words in the provided list (of length N) are valid mutations that need to be checked, the queue could potentially hold all N words. Additionally, a set or similar data structure would be used to keep track of visited words to avoid cycles, which could also store up to N words. Therefore, the auxiliary space used is proportional to N, resulting in a space complexity of O(N).

Edge Cases

CaseHow to Handle
Null start or end geneThrow IllegalArgumentException or return -1 immediately, as the input is invalid.
Null bank arrayThrow IllegalArgumentException or return -1, as the bank is necessary for the search.
Empty bank arrayReturn -1 if the end gene is not equal to the start gene, otherwise return 0.
start gene is equal to end geneReturn 0 immediately as no mutation is required.
End gene is not reachable from start geneThe BFS will terminate without finding the end gene, so return -1.
Gene strings in bank are not all the same lengthThrow IllegalArgumentException because this violates input constraints and comparison is invalid.
Gene strings contain characters other than A, C, G, TThrow IllegalArgumentException as this violates the input constraints and invalidates the gene comparisons.
Large bank array leading to memory constraintsEnsure your BFS queue does not unnecessarily store duplicates or hold onto references longer than needed to prevent excessive memory usage.