You are given a 2D array paths
, where paths[i]
is an array representing an absolute path to the i<sup>th</sup>
folder in the file system. Due to a bug, there are many duplicate folders in a file system. Your task is to identify and remove these duplicate folders. Two folders are considered identical if they contain the same non-empty set of identical subfolders and underlying subfolder structure. The folders do not need to be at the root level to be identical. If two or more folders are identical, then mark the folders as well as all their subfolders for deletion. The file system only runs the deletion once, so any folders that become identical after the initial deletion are not deleted.
Return the 2D array ans
containing the paths of the remaining folders after deleting all the marked folders. The paths may be returned in any order.
Example:
Input: paths = [["a"],["c"],["d"],["a","b"],["c","b"],["d","a"]]
Output: [["d"],["d","a"]]
In this example, folders "/a" and "/c" (and their subfolders) are marked for deletion because they both contain an empty folder named "b".
Constraints:
1 <= paths.length <= 2 * 10^4
1 <= paths[i].length <= 500
1 <= paths[i][j].length <= 10
1 <= sum(paths[i][j].length) <= 2 * 10^5
path[i][j]
consists of lowercase English letters.Could you provide an efficient algorithm to solve this problem?
When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:
We need to find and remove duplicate folders from a file system. The brute force way is to compare each folder to every other folder, one by one, to see if they are the same.
Here's how the algorithm would work step-by-step:
def delete_duplicate_folders_brute_force(file_system):
number_of_folders = len(file_system)
folders_to_delete = [False] * number_of_folders
for current_folder_index in range(number_of_folders):
# Skip folders already marked for deletion
if folders_to_delete[current_folder_index]:
continue
for comparison_folder_index in range(current_folder_index + 1, number_of_folders):
# Avoid comparing to deleted folders
if folders_to_delete[comparison_folder_index]:
continue
if file_system[current_folder_index] == file_system[comparison_folder_index]:
# Mark the duplicate folder for deletion
folders_to_delete[comparison_folder_index] = True
new_file_system = []
for folder_index in range(number_of_folders):
# Rebuild the file system without duplicates
if not folders_to_delete[folder_index]:
new_file_system.append(file_system[folder_index])
return new_file_system
The most efficient way to find and remove duplicate folders is to represent each folder structure as a unique string. We can then use a dictionary or similar structure to identify duplicates based on these strings and remove them in a specific order to avoid disrupting the overall structure during deletion.
Here's how the algorithm would work step-by-step:
class Node:
def __init__(self, folder_name):
self.folder_name = folder_name
self.children = {}
def delete_duplicate_folders(folders):
root = Node("root")
def build_trie(path):
current = root
for folder in path:
if folder not in current.children:
current.children[folder] = Node(folder)
current = current.children[folder]
for folder in folders:
build_trie(folder)
folder_map = {}
marked_nodes = set()
def encode_trie(node):
# Use structure to represent the folder
if not node.children:
return "#"
encoded_string = "("
sorted_children = sorted(node.children.keys())
for child_name in sorted_children:
encoded_string += child_name + encode_trie(node.children[child_name])
encoded_string += ")"
return encoded_string
def traverse_trie(node):
encoded_string = encode_trie(node)
# If a duplicate is found, mark for removal
if encoded_string in folder_map:
marked_nodes.add(node)
else:
folder_map[encoded_string] = node
for child in node.children.values():
traverse_trie(child)
traverse_trie(root)
def get_paths_to_delete(node, current_path, all_paths):
if node in marked_nodes:
all_paths.append(current_path)
return
for child_name, child_node in node.children.items():
get_paths_to_delete(child_node, current_path + [child_name], all_paths)
paths_to_delete = []
get_paths_to_delete(root, [], paths_to_delete)
# We need to delete from leaf nodes to avoid issues
paths_to_delete.sort(key=len, reverse=True)
final_result = []
def is_ancestor_deleted(path):
for deleted_path in final_result:
if len(path) < len(deleted_path) and path == deleted_path[:len(path)]:
return True
return False
# Delete the nodes and keep track of them
for path in paths_to_delete:
if not is_ancestor_deleted(path):
final_result.append(path)
final_result.sort()
return final_result
Case | How to Handle |
---|---|
Null or empty input list of folder paths | Return an empty list or null to indicate no folders to process. |
All folder paths are identical | After deduplication, only one copy of the folder should remain in the final list. |
Very long folder paths (string length close to maximum allowed) | Ensure string operations and hashing are efficient enough to avoid performance bottlenecks or errors due to string size limitations. |
Deeply nested folder structures (many levels of subdirectories) | Consider the potential for stack overflow in recursive implementations; iterative solutions might be preferred. |
Large number of folders with very similar directory structures | The hashing algorithm should be chosen carefully to minimize collisions and maintain good performance. |
Cycles in the folder structure (due to symbolic links, etc) | Implement cycle detection to prevent infinite loops in tree traversal or hashing algorithms. |
Memory constraints with a very large number of folders | Consider streaming or out-of-core algorithms if the entire folder structure cannot fit into memory at once. |
Folder names containing special characters or Unicode | Ensure the hashing function and string comparison operations handle these characters correctly to avoid incorrect deduplication. |