Taro Logo

Find Duplicate File in System

Medium
Amazon logo
Amazon
5 views
Topics:
ArraysStrings

Given a list paths of directory info, including the directory path, and all the files with contents in this directory, return all the duplicate files in the file system in terms of their paths. You may return the answer in any order. A group of duplicate files consists of at least two files that have the same content. A single directory info string in the input list has the following format: "root/d1/d2/.../dm f1.txt(f1_content) f2.txt(f2_content) ... fn.txt(fn_content)". It means there are n files (f1.txt, f2.txt ... fn.txt) with content (f1_content, f2_content ... fn_content) respectively in the directory "root/d1/d2/.../dm". Note that n >= 1 and m >= 0. If m = 0, it means the directory is just the root directory. The output is a list of groups of duplicate file paths. For each group, it contains all the file paths of the files that have the same content. A file path is a string that has the following format: "directory_path/file_name.txt". Example 1: paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)","root 4.txt(efgh)"]. Output: [[root/a/2.txt,root/c/d/4.txt,root/4.txt],[root/a/1.txt,root/c/3.txt]]. Example 2: paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)"]. Output: [[root/a/2.txt,root/c/d/4.txt],[root/a/1.txt,root/c/3.txt]]

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. What is the maximum depth of the file system directory structure?
  2. Can a file name or content be empty?
  3. What should the output be if no duplicate files are found?
  4. If multiple files have the exact same content, should I include all their paths in the corresponding list in the output?
  5. Are the file paths absolute or relative?

Brute Force Solution

Approach

The goal is to find files with the same content in a file system. The brute force method involves comparing every file to every other file to see if they are identical.

Here's how the algorithm would work step-by-step:

  1. Take the first file.
  2. Compare its content with the content of every other file in the system.
  3. If you find a file with the exact same content, mark them as duplicates.
  4. Repeat this process for the second file, then the third, and so on, making sure not to compare a file to itself or re-check already identified duplicate pairs.
  5. Continue until every file has been compared to all other files.
  6. Finally, gather all the groups of files that were marked as having the same content. These are your duplicate file groups.

Code Implementation

def find_duplicate_files_brute_force(paths):
    all_files = []
    for path_string in paths:
        parts = path_string.split()
        root_directory = parts[0]
        for i in range(1, len(parts)):
            file_name_and_content = parts[i].split('(')
            file_name = file_name_and_content[0]
            file_content = file_name_and_content[1][:-1]
            all_files.append((root_directory + '/' + file_name, file_content))

    duplicate_files = []

    # Iterate through each file to compare with the rest
    for i in range(len(all_files)):
        duplicates_found = []
        for j in range(i + 1, len(all_files)):
            #Compare contents, not names to find duplicates
            if all_files[i][1] == all_files[j][1]:
                if not duplicates_found:
                    duplicates_found.append(all_files[i][0])
                duplicates_found.append(all_files[j][0])

        if duplicates_found:
            duplicate_files.append(duplicates_found)

    # Filter out empty lists, ensuring only actual duplicates are returned.
    duplicate_files = [group for group in duplicate_files if group]
    return duplicate_files

Big(O) Analysis

Time Complexity
O(n² * m)The input size, n, represents the number of files in the system. The provided algorithm iterates through each file (outer loop) and compares it with every other file (inner loop). The comparison involves reading the content of the files, which takes O(m) time, where m represents the average file size. The nested loops lead to approximately n * (n-1)/2 comparisons, and each comparison involves O(m) work. Therefore, the overall time complexity is O(n² * m), where n² represents the number of file comparisons and m represents the average time to read and compare the content of a file.
Space Complexity
O(1)The brute force approach described primarily involves iterating through the input file system and comparing files directly. While comparing, there's no explicit creation of data structures to store intermediate data like file contents or a list of duplicates. The plain English explanation only mentions marking files as duplicates but doesn't specify a data structure for this marking, implying minimal or constant space for storing a flag or index if such marking is done. Therefore, the space complexity is dominated by a few index variables, making the auxiliary space usage constant, independent of the input size N (where N is the number of files in the system).

Optimal Solution

Approach

The problem requires finding files with identical content across different directories. The best way is to create a way to quickly look up files based on their content and then group together the ones with matching content.

Here's how the algorithm would work step-by-step:

  1. Go through each directory provided one at a time.
  2. For each file found, extract its content.
  3. Use the content as a key and store the complete file path associated with that key in a special content-based directory.
  4. After processing all the directories, identify the keys that have more than one file path associated with them. These are the duplicate files.
  5. Finally, group these file paths together based on the key content and present as the final solution.

Code Implementation

def find_duplicate_file(paths):
    content_to_paths = {}
    for path_string in paths:
        parts = path_string.split()
        root_directory = parts[0]
        for file_info in parts[1:]:
            file_name, content_with_parentheses = file_info.split('(')
            content = content_with_parentheses[:-1]
            full_file_path = root_directory + '/' + file_name

            # Use content as key, accumulating file paths.
            if content in content_to_paths:
                content_to_paths[content].append(full_file_path)
            else:
                content_to_paths[content] = [full_file_path]

    result = []
    # Identify duplicates based on content having multiple paths.
    for content, file_paths in content_to_paths.items():
        if len(file_paths) > 1:
            result.append(file_paths)

    return result

Big(O) Analysis

Time Complexity
O(N*M)Let N be the total number of files across all directories and M be the average length of the content of the files. Iterating through each file takes O(N) time. For each file, extracting its content requires reading up to M characters, costing O(M) per file. Building the content-based directory takes O(N*M) time. Finally, identifying and grouping the duplicate file paths takes O(N) in the worst case where we iterate through our content based directory which contains at most N file paths. Because O(N*M) dominates O(N), the overall time complexity is O(N*M).
Space Complexity
O(N)The algorithm uses a content-based directory (likely a hash map) to store file paths based on their content. In the worst-case scenario, all files have unique content, leading to storing all N file paths (where N is the total number of files across all directories) in the hash map. Thus, the auxiliary space used is proportional to the total number of files. Therefore, the space complexity is O(N).

Edge Cases

CaseHow to Handle
Empty input array of file pathsReturn an empty list since there are no files to compare.
File content is empty stringTreat an empty string as valid file content and include it in the content hashmap.
A file path contains spaces at the beginning or endTrim the file path to ensure accurate comparison and matching.
Very large number of files, potentially exceeding memoryConsider using an external sorting approach or a disk-based hashmap if memory becomes a bottleneck.
Files with the same content but different timestamps/metadataThe content-based comparison inherently ignores timestamps/metadata, focusing solely on content.
File names or paths contain special characters that could cause issues with string parsingProperly escape special characters or use a robust string parsing library to handle diverse file name formats.
Paths containing relative paths like './' or '../'Normalize paths by resolving relative components before processing to ensure accurate comparisons.
File content is extremely long string causing hash collisionUse a strong hashing algorithm (e.g., SHA-256) and consider strategies for collision resolution within the hashmap if collisions become frequent.