Given a list paths
of directory info, including the directory path, and all the files with contents in this directory, return all the duplicate files in the file system in terms of their paths. You may return the answer in any order. A group of duplicate files consists of at least two files that have the same content. A single directory info string in the input list has the following format: "root/d1/d2/.../dm f1.txt(f1_content) f2.txt(f2_content) ... fn.txt(fn_content)"
It means there are n
files (f1.txt, f2.txt ... fn.txt)
with content (f1_content, f2_content ... fn_content)
respectively in the directory "root/d1/d2/.../dm"
. Note that n >= 1
and m >= 0
. If m = 0
, it means the directory is just the root directory. The output is a list of groups of duplicate file paths. For each group, it contains all the file paths of the files that have the same content. A file path is a string that has the following format: "directory_path/file_name.txt"
For example, if the input is paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)","root 4.txt(efgh)"]
then the output should be [["root/a/2.txt","root/c/d/4.txt","root/4.txt"],["root/a/1.txt","root/c/3.txt"]]
from collections import defaultdict
def findDuplicate(paths):
"""
Finds duplicate files in a file system given a list of directory info.
Args:
paths (list[str]): A list of strings, where each string represents a directory and its files.
Returns:
list[list[str]]: A list of lists, where each inner list contains file paths of duplicate files.
"""
content_map = defaultdict(list)
for path_info in paths:
parts = path_info.split()
directory = parts[0]
for i in range(1, len(parts)):
file_info = parts[i]
file_name, content = file_info.split('(')
content = content[:-1] # Remove the closing parenthesis
full_path = directory + '/' + file_name
content_map[content].append(full_path)
result = []
for content, file_paths in content_map.items():
if len(file_paths) > 1:
result.append(file_paths)
return result
# Example Usage:
paths1 = ["root/a 1.txt(abcd) 2.txt(efgh)", "root/c 3.txt(abcd)", "root/c/d 4.txt(efgh)", "root 4.txt(efgh)"]
print(findDuplicate(paths1))
# Expected Output: [['root/a/2.txt', 'root/c/d/4.txt', 'root/4.txt'], ['root/a/1.txt', 'root/c/3.txt']]
paths2 = ["root/a 1.txt(abcd) 2.txt(efgh)", "root/c 3.txt(abcd)", "root/c/d 4.txt(efgh)"]
print(findDuplicate(paths2))
# Expected Output: [['root/a/2.txt', 'root/c/d/4.txt'], ['root/a/1.txt', 'root/c/3.txt']]
A brute force approach would involve comparing every file with every other file. First, extract the file paths and content from the input strings. Then, for each file, compare its content with the content of all other files. If the contents match, add both file paths to a list of duplicates. This approach is highly inefficient.
The optimal solution uses a hash map (dictionary) to store the file content as keys and a list of corresponding file paths as values. This significantly reduces the number of comparisons needed.
paths
list. For each directory info string, split it into the directory path and file info.paths
list takes O(N) time, where N is the number of directory info strings in the input. Splitting each string and extracting file names and content takes O(K) time on average, where K is the average length of the strings. Therefore, the overall time complexity for parsing the input is O(N * K).paths
list is empty, return an empty list. This is handled correctly as the loops won't execute.Searching Files in a Real File System (DFS vs. BFS):
Large File Content (GB Level):
Reading File by 1kb Chunks:
Time and Memory Complexity of Modified Solution:
Ensuring Duplicates are Not False Positives: