Taro Logo

Find Duplicate File in System

Medium
36 views
2 months ago

Given a list paths of directory info, including the directory path, and all the files with contents in this directory, return all the duplicate files in the file system in terms of their paths. You may return the answer in any order. A group of duplicate files consists of at least two files that have the same content. A single directory info string in the input list has the following format: "root/d1/d2/.../dm f1.txt(f1_content) f2.txt(f2_content) ... fn.txt(fn_content)" It means there are n files (f1.txt, f2.txt ... fn.txt) with content (f1_content, f2_content ... fn_content) respectively in the directory "root/d1/d2/.../dm". Note that n >= 1 and m >= 0. If m = 0, it means the directory is just the root directory. The output is a list of groups of duplicate file paths. For each group, it contains all the file paths of the files that have the same content. A file path is a string that has the following format: "directory_path/file_name.txt"

For example, if the input is paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)","root 4.txt(efgh)"] then the output should be [["root/a/2.txt","root/c/d/4.txt","root/4.txt"],["root/a/1.txt","root/c/3.txt"]]

Sample Answer
from collections import defaultdict

def findDuplicate(paths):
    """
    Finds duplicate files in a file system given a list of directory info.

    Args:
        paths (list[str]): A list of strings, where each string represents a directory and its files.

    Returns:
        list[list[str]]: A list of lists, where each inner list contains file paths of duplicate files.
    """

    content_map = defaultdict(list)
    for path_info in paths:
        parts = path_info.split()
        directory = parts[0]
        for i in range(1, len(parts)):
            file_info = parts[i]
            file_name, content = file_info.split('(')
            content = content[:-1]  # Remove the closing parenthesis
            full_path = directory + '/' + file_name
            content_map[content].append(full_path)

    result = []
    for content, file_paths in content_map.items():
        if len(file_paths) > 1:
            result.append(file_paths)

    return result


# Example Usage:
paths1 = ["root/a 1.txt(abcd) 2.txt(efgh)", "root/c 3.txt(abcd)", "root/c/d 4.txt(efgh)", "root 4.txt(efgh)"]
print(findDuplicate(paths1))
# Expected Output: [['root/a/2.txt', 'root/c/d/4.txt', 'root/4.txt'], ['root/a/1.txt', 'root/c/3.txt']]

paths2 = ["root/a 1.txt(abcd) 2.txt(efgh)", "root/c 3.txt(abcd)", "root/c/d 4.txt(efgh)"]
print(findDuplicate(paths2))
# Expected Output: [['root/a/2.txt', 'root/c/d/4.txt'], ['root/a/1.txt', 'root/c/3.txt']]

Brute Force Solution:

A brute force approach would involve comparing every file with every other file. First, extract the file paths and content from the input strings. Then, for each file, compare its content with the content of all other files. If the contents match, add both file paths to a list of duplicates. This approach is highly inefficient.

Optimal Solution:

The optimal solution uses a hash map (dictionary) to store the file content as keys and a list of corresponding file paths as values. This significantly reduces the number of comparisons needed.

  1. Parse Input: Iterate through the input paths list. For each directory info string, split it into the directory path and file info.
  2. Extract File Info: For each file info, extract the file name and content.
  3. Populate Hash Map: Use the content as the key in the hash map. If the content already exists as a key, append the file path to the list of file paths for that content. If the content is new, create a new key-value pair in the hash map with the content as the key and a list containing the file path as the value.
  4. Identify Duplicates: Iterate through the hash map. For each key-value pair, if the list of file paths has more than one entry, it means there are duplicate files with the same content. Add the list of file paths to the result.

Big(O) Run-time Analysis:

  • Parsing Input: Iterating through the paths list takes O(N) time, where N is the number of directory info strings in the input. Splitting each string and extracting file names and content takes O(K) time on average, where K is the average length of the strings. Therefore, the overall time complexity for parsing the input is O(N * K).
  • Populating Hash Map: Inserting and retrieving values from a hash map takes O(1) time on average. Since we are doing this for each file, the time complexity is O(M), where M is the total number of files.
  • Identifying Duplicates: Iterating through the hash map takes O(P) time, where P is the number of unique file contents. Creating the result list takes O(M) time in the worst case.
  • Overall: The dominant factor is O(N * K + M). In many practical scenarios, N * K dominates M, making the overall runtime O(N * K), where N is the number of paths and K is the average length of each path string. In simpler terms, the length of the input essentially determines the runtime.

Big(O) Space Usage Analysis:

  • Hash Map: The hash map stores file contents and their corresponding file paths. In the worst-case scenario, all files have unique content, and the hash map stores all files. Therefore, the space complexity for the hash map is O(M), where M is the total number of files.
  • Result List: The result list stores lists of duplicate file paths. In the worst-case scenario, all files are duplicates of each other, and the result list stores all file paths. Therefore, the space complexity for the result list is also O(M).
  • Overall: The overall space complexity is O(M), where M is the total number of files. This is because we are storing the file paths in the hash map and in the result.

Edge Cases and Handling:

  1. Empty Input: If the input paths list is empty, return an empty list. This is handled correctly as the loops won't execute.
  2. No Duplicate Files: If there are no duplicate files, the result list will be empty. This is handled correctly by the logic.
  3. Empty File Content: If a file has empty content (e.g., "file.txt()" ), the code still works correctly as the content will be an empty string, and file paths with empty content will be grouped together.
  4. Different Directories with Same File Names: The code correctly handles files with the same names in different directories because it stores the full path (directory + file name).
  5. Large Number of Files: The hash map approach is efficient for a large number of files, but memory usage could become a concern if there are a very large number of extremely long file paths. In this case, consider external storage or processing in batches.

Follow Up Questions:

  1. Searching Files in a Real File System (DFS vs. BFS):

    • DFS (Depth-First Search): DFS is generally preferred for searching files because file systems are structured in a hierarchical (tree-like) manner. DFS explores each branch as deeply as possible, which is efficient for finding a file within a specific directory or subdirectory. It's also memory-efficient because it only needs to store the current path.
    • BFS (Breadth-First Search): BFS would explore all directories at the same level before going deeper. This can be less efficient for deeply nested directories. BFS requires more memory as it needs to store all the directories at the current level.
  2. Large File Content (GB Level):

    • Instead of reading the entire file content into memory, compute a hash (e.g., MD5, SHA-256) of the file. Compare the hashes instead of the entire content. This significantly reduces memory usage.
    • If hashes collide (unlikely but possible), you can then compare the full files, or part of the files to confirm the match. More secure hashes (SHA-256) reduces likelyhood of collisions.
  3. Reading File by 1kb Chunks:

    • Read the file in 1kb chunks. Compute a rolling hash (e.g., Rabin-Karp) or a cryptographic hash incrementally as you read each chunk. This allows you to compare files without loading the entire content into memory.
    • Again, if the hashes match, consider a full file comparison or comparing a larger part of the files to avoid potential collisions.
  4. Time and Memory Complexity of Modified Solution:

    • Time Complexity: The time complexity becomes O(N * (K/B)), where N is the number of files, K is the average file size, and B is the chunk size (1kb in this case). This is because you need to read each file in chunks.
    • Memory Complexity: The memory complexity is dominated by the hash value, which is typically fixed size (e.g. 32 bytes for MD5, 64 bytes for SHA256), plus the 1kb chunk. So it becomes O(B + H) where H is the size of the Hash.
    • Optimization: Use more efficient hashing algorithms. Use multi-threading to read and hash files concurrently.
  5. Ensuring Duplicates are Not False Positives:

    • Cryptographic Hashes: Use strong cryptographic hashes (SHA-256 or better) to minimize the risk of collisions.
    • Full File Comparison: After identifying potential duplicates based on hashes, perform a full byte-by-byte comparison to confirm that the files are identical.
    • Metadata Comparison: Compare file metadata (e.g., creation date, modification date) in addition to content to further reduce false positives.