Taro Logo

Delete Duplicate Folders in System

Hard
Google logo
Google
5 views
Topics:
TreesRecursionArraysStrings

You are given a 2D array paths, where paths[i] is an array representing an absolute path to the i<sup>th</sup> folder in the file system. Due to a bug, there are many duplicate folders in a file system. Your task is to identify and remove these duplicate folders. Two folders are considered identical if they contain the same non-empty set of identical subfolders and underlying subfolder structure. The folders do not need to be at the root level to be identical. If two or more folders are identical, then mark the folders as well as all their subfolders for deletion. The file system only runs the deletion once, so any folders that become identical after the initial deletion are not deleted.

Return the 2D array ans containing the paths of the remaining folders after deleting all the marked folders. The paths may be returned in any order.

Example:

Input: paths = [["a"],["c"],["d"],["a","b"],["c","b"],["d","a"]] Output: [["d"],["d","a"]]

In this example, folders "/a" and "/c" (and their subfolders) are marked for deletion because they both contain an empty folder named "b".

Constraints:

  • 1 <= paths.length <= 2 * 10^4
  • 1 <= paths[i].length <= 500
  • 1 <= paths[i][j].length <= 10
  • 1 <= sum(paths[i][j].length) <= 2 * 10^5
  • path[i][j] consists of lowercase English letters.
  • No two paths lead to the same folder.
  • For any folder not at the root level, its parent folder will also be in the input.

Could you provide an efficient algorithm to solve this problem?

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. What data structure represents the folder structure? Is it a tree, a list of paths, or something else, and what are the data types of the folder names?
  2. How is a 'duplicate' folder defined? Does it mean folders with the same exact structure and files, or just the same names, regardless of the content?
  3. If multiple sets of duplicate folders exist, which ones should be deleted, and in what order should the deletions occur if the deletion of one impacts another?
  4. What is the expected output format? Should I return a list of deleted folder paths, the modified folder structure, or something else entirely?
  5. Are there any constraints on memory usage given a very large folder structure?

Brute Force Solution

Approach

We need to find and remove duplicate folders from a file system. The brute force way is to compare each folder to every other folder, one by one, to see if they are the same.

Here's how the algorithm would work step-by-step:

  1. Take the first folder in the system.
  2. Compare this folder to every other folder in the system.
  3. If you find a folder that is exactly the same as the first folder, mark one of them for deletion.
  4. Move to the next folder and repeat the process, comparing it to all other folders. Note that you only compare with folders that have not been marked for deletion.
  5. Continue doing this for every folder in the system.
  6. Once you have compared every folder, go back and delete all the folders that were marked for deletion.

Code Implementation

def delete_duplicate_folders_brute_force(file_system):
    number_of_folders = len(file_system)
    folders_to_delete = [False] * number_of_folders

    for current_folder_index in range(number_of_folders):
        # Skip folders already marked for deletion
        if folders_to_delete[current_folder_index]:
            continue

        for comparison_folder_index in range(current_folder_index + 1, number_of_folders):
            # Avoid comparing to deleted folders
            if folders_to_delete[comparison_folder_index]:
                continue

            if file_system[current_folder_index] == file_system[comparison_folder_index]:
                # Mark the duplicate folder for deletion
                folders_to_delete[comparison_folder_index] = True

    new_file_system = []

    for folder_index in range(number_of_folders):
        # Rebuild the file system without duplicates
        if not folders_to_delete[folder_index]:
            new_file_system.append(file_system[folder_index])

    return new_file_system

Big(O) Analysis

Time Complexity
O(n²)The algorithm iterates through each of the n folders in the system. For each folder, it compares it against all the other folders to find duplicates. In the worst-case scenario, where no folders are duplicates until the very last comparison, each folder will be compared against almost all other n folders. Therefore, the total number of comparisons approaches n * (n-1), which simplifies to n²/2 - n, ultimately resulting in a time complexity of O(n²).
Space Complexity
O(1)The provided algorithm iterates through folders and marks some for deletion. It does not explicitly use any auxiliary data structures such as hash maps, temporary lists, or recursion. The marking of folders for deletion can be done in place with a boolean flag per folder. Thus, the space required is constant and does not depend on the number of folders, N.

Optimal Solution

Approach

The most efficient way to find and remove duplicate folders is to represent each folder structure as a unique string. We can then use a dictionary or similar structure to identify duplicates based on these strings and remove them in a specific order to avoid disrupting the overall structure during deletion.

Here's how the algorithm would work step-by-step:

  1. First, we need a way to represent each folder's structure as a unique code. We'll use the folder names and their hierarchy to create this code.
  2. We convert each folder structure into a special format where identical folder trees result in the same code.
  3. Next, we build a system to keep track of each unique folder code we encounter.
  4. As we go through the folders, we check if the folder's code is already in our system. If it is, that means we've found a duplicate.
  5. When we find a duplicate, we mark it for removal. It's important to remove the actual folder from the system after marking it. Remove from the end (leaf nodes) to avoid structural inconsistencies as folders are deleted.
  6. Once we've checked all the folders, we go back and delete the ones we marked for removal. This ensures we only remove the duplicate folders and not the original structure.
  7. This approach avoids comparing every folder to every other folder, making it much faster, especially when dealing with many folders.

Code Implementation

class Node:
    def __init__(self, folder_name):
        self.folder_name = folder_name
        self.children = {}

def delete_duplicate_folders(folders):
    root = Node("root")

    def build_trie(path):
        current = root
        for folder in path:
            if folder not in current.children:
                current.children[folder] = Node(folder)
            current = current.children[folder]

    for folder in folders:
        build_trie(folder)

    folder_map = {}
    marked_nodes = set()

    def encode_trie(node):
        # Use structure to represent the folder
        if not node.children:
            return "#"

        encoded_string = "("
        sorted_children = sorted(node.children.keys())
        for child_name in sorted_children:
            encoded_string += child_name + encode_trie(node.children[child_name])

        encoded_string += ")"
        return encoded_string

    def traverse_trie(node):
        encoded_string = encode_trie(node)

        # If a duplicate is found, mark for removal
        if encoded_string in folder_map:
            marked_nodes.add(node)
        else:
            folder_map[encoded_string] = node

        for child in node.children.values():
            traverse_trie(child)

    traverse_trie(root)

    def get_paths_to_delete(node, current_path, all_paths):
        if node in marked_nodes:
            all_paths.append(current_path)
            return

        for child_name, child_node in node.children.items():
            get_paths_to_delete(child_node, current_path + [child_name], all_paths)

    paths_to_delete = []
    get_paths_to_delete(root, [], paths_to_delete)

    # We need to delete from leaf nodes to avoid issues
    paths_to_delete.sort(key=len, reverse=True)
    final_result = []
    
    def is_ancestor_deleted(path):
        for deleted_path in final_result:
            if len(path) < len(deleted_path) and path == deleted_path[:len(path)]:
                return True
        return False
    
    # Delete the nodes and keep track of them
    for path in paths_to_delete:
        if not is_ancestor_deleted(path):
            final_result.append(path)
    
    final_result.sort()
    return final_result

Big(O) Analysis

Time Complexity
O(n*m)The algorithm iterates through each of the 'n' folders in the system. For each folder, creating a unique code involves traversing its structure, which in the worst case can be 'm' levels deep, where m is the maximum depth of the folder hierarchy. Building and checking the code in the dictionary takes O(m) time for each folder. Therefore, the overall time complexity is O(n*m) where n is the number of folders and m is the maximum folder depth, assuming hashmap operations are O(1).
Space Complexity
O(N)The algorithm uses a dictionary or hash map to store the unique code representation of each folder structure. In the worst-case scenario, where all folder structures are unique, the dictionary will store N folder codes, where N is the total number of folders. Therefore, the auxiliary space used by the dictionary scales linearly with the number of folders. Consequently, the space complexity is O(N).

Edge Cases

CaseHow to Handle
Null or empty input list of folder pathsReturn an empty list or null to indicate no folders to process.
All folder paths are identicalAfter deduplication, only one copy of the folder should remain in the final list.
Very long folder paths (string length close to maximum allowed)Ensure string operations and hashing are efficient enough to avoid performance bottlenecks or errors due to string size limitations.
Deeply nested folder structures (many levels of subdirectories)Consider the potential for stack overflow in recursive implementations; iterative solutions might be preferred.
Large number of folders with very similar directory structuresThe hashing algorithm should be chosen carefully to minimize collisions and maintain good performance.
Cycles in the folder structure (due to symbolic links, etc)Implement cycle detection to prevent infinite loops in tree traversal or hashing algorithms.
Memory constraints with a very large number of foldersConsider streaming or out-of-core algorithms if the entire folder structure cannot fit into memory at once.
Folder names containing special characters or UnicodeEnsure the hashing function and string comparison operations handle these characters correctly to avoid incorrect deduplication.