Taro Logo

Design a File Sharing System

Medium
Twitch logo
Twitch
0 views
Topics:
ArraysGreedy Algorithms

We will use a file-sharing system to share a very large file consisting of m small chunks with IDs from 1 to m. When you join the system, the system should assign you a unique ID.

You are asked to implement the FileSharing class:

  • FileSharing(int m) Initializes the object with a file of m chunks.
  • int join(int[] ownedChunks): Adds a user to the system with an initial list of chunks ownedChunks. The system should assign a unique ID to the user and returns it.
  • void leave(int userID): Removes the user with ID userID from the system and makes the ID available again.
  • int[] request(int userID, int chunkID): Some user with ID userID is requesting a chunk with ID chunkID. Return a list of user IDs that own the given chunk.

Example:

Input:
["FileSharing", "join", "join", "request", "request", "leave", "request", "request", "join", "request", "leave"]
[[4], [[1,2]], [[2,3]], [1, 2], [2, 1], [1], [2, 2], [1, 2], [[1,2]], [2, 1], [2]]
Output:
[null, 1, 2, [1,2], [2], null, [2], [1,2], 3, [2,3], null]
Explanation:
FileSharing fileSharing = new FileSharing(4);
fileSharing.join([1, 2]);  // return 1
fileSharing.join([2, 3]);  // return 2
fileSharing.request(1, 2); // return [1, 2]
fileSharing.request(2, 1); // return [2]
fileSharing.leave(1);        // return null
fileSharing.request(2, 2); // return [2]
fileSharing.request(1, 2); // return [1, 2]
fileSharing.join([1, 2]);  // return 3
fileSharing.request(2, 1); // return [2, 3]
fileSharing.leave(2);        // return null

Constraints:

  • 1 <= m, chunkID <= m
  • 1 <= userID <= 105
  • 0 <= ownedChunks.length <= m
  • 1 <= ownedChunks[i] <= m
  • All the values of ownedChunks are unique.
  • Each call to leave will have a matching call to join.
  • The number of calls to all functions will not exceed 104.

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. What are the constraints on the number of users (n), the number of parts in a file, and the user IDs and file IDs?
  2. If a user tries to download a file they already possess all parts of, should I redownload, ignore the request, or return an error?
  3. If the `whoHas` method is called for a file ID that doesn't exist, what should I return (e.g., an empty list, null, or throw an exception)?
  4. Is it possible for a user to upload a file with overlapping parts, and if so, how should I handle that?
  5. How should I handle concurrent requests from multiple users (e.g., two users trying to upload the same file at the same time)?

Brute Force Solution

Approach

For designing a basic file sharing system, we can think of a brute force approach as trying every possible combination of servers and storage locations. We'll check if each combination meets our requirements, like speed or cost. If something isn't working well, we just try the next possibility until we find a fit.

Here's how the algorithm would work step-by-step:

  1. Start by listing all the available servers we can use to store files.
  2. Then, for each server, consider every possible location where the files could be stored.
  3. For each server and storage location combination, pretend that's where our files are stored.
  4. Check if this arrangement meets our needs. Does it provide fast access to the files? Is it affordable?
  5. If the arrangement doesn't work well enough, discard it and move on to the next combination of server and location.
  6. Keep doing this until we have looked at every possible server and location.
  7. Finally, from all the combinations that meet our requirements, pick the one that's the best overall based on what matters most to us (like speed, cost, or reliability).

Code Implementation

def find_best_file_storage_brute_force(servers, storage_locations, file_access_speed_requirement, cost_requirement, reliability_requirement):
    best_server = None
    best_storage_location = None
    best_overall_score = float('-inf')

    for server in servers:
        for storage_location in storage_locations:

            # We have a specific configuration of server and location
            current_file_access_speed = evaluate_file_access_speed(server, storage_location)
            current_cost = evaluate_cost(server, storage_location)
            current_reliability = evaluate_reliability(server, storage_location)

            # See if the configuration meets file access speed criteria
            if current_file_access_speed >= file_access_speed_requirement:

                # We check if cost is acceptable
                if current_cost <= cost_requirement:

                    # Is the system reliable enough for the requirements?
                    if current_reliability >= reliability_requirement:

                        # Calculating the score to choose the best config
                        overall_score = calculate_overall_score(current_file_access_speed, current_cost, current_reliability)

                        if overall_score > best_overall_score:
                            best_overall_score = overall_score
                            best_server = server
                            best_storage_location = storage_location

    return best_server, best_storage_location

def evaluate_file_access_speed(server, storage_location):
    return server["speed"] + storage_location["speed"]

def evaluate_cost(server, storage_location):
    return server["cost"] + storage_location["cost"]

def evaluate_reliability(server, storage_location):
    return server["reliability"] * storage_location["reliability"]

def calculate_overall_score(file_access_speed, cost, reliability):
    return file_access_speed * reliability / cost

Big(O) Analysis

Time Complexity
O(s * l * c)The algorithm iterates through all possible server (s) and storage location (l) combinations. For each server and location combination, it checks (c) if the arrangement meets the requirements. Therefore, the time complexity is proportional to the product of the number of servers, the number of storage locations, and the complexity of the checking process. Thus, the time complexity is O(s * l * c).
Space Complexity
O(1)The algorithm iterates through server and storage location combinations without persistently storing these combinations or generating large auxiliary data structures. It checks each combination and discards it if unsuitable, retaining at most a single 'best' option based on criteria like speed and cost. Therefore, the memory footprint remains constant, independent of the number of servers or storage locations (N). The algorithm's space complexity is O(1).

Optimal Solution

Approach

The file sharing system needs to handle uploading, downloading, and finding files efficiently. To do this, we'll use a combination of a central server to manage files and a way to split large files into smaller pieces for faster transfers.

Here's how the algorithm would work step-by-step:

  1. First, we need a central hub or server to keep track of all the files, who owns them, and where they are stored.
  2. When someone uploads a file, the server stores the file and keeps track of who uploaded it.
  3. For large files, we will split the file into smaller, more manageable chunks. The server will then keep track of all the pieces that make up the whole file.
  4. When someone wants to download a file, the server locates all the chunks of the file. It then sends the chunks to the user.
  5. The user's computer puts all the pieces back together to reconstruct the original file.
  6. To make things faster, we can have multiple copies of the chunks stored in different locations. This way, if one server is busy, we can get the chunks from another server.
  7. We also need a way for users to search for files. The server should keep a list of files and allow users to search by name, description, or other relevant information.
  8. Finally, to ensure that only authorized users can access certain files, we need to implement a system to verify who is trying to download or share a file.

Code Implementation

import hashlib

class FileSharingSystem:
    def __init__(self):
        self.file_metadata = {}
        self.file_chunks = {}
        self.chunk_locations = {}
        self.users = {}

    def register_user(self, username, password):
        hashed_password = hashlib.sha256(password.encode()).hexdigest()
        self.users[username] = hashed_password

    def login_user(self, username, password):
        hashed_password = hashlib.sha256(password.encode()).hexdigest()
        return username in self.users and self.users[username] == hashed_password

    def upload_file(self, username, file_name, file_data, chunk_size=1024):
        # Split file into chunks for efficient transfer
        file_hash = hashlib.sha256(file_data.encode()).hexdigest()
        self.file_metadata[file_hash] = {
            'file_name': file_name,
            'owner': username,
            'file_size': len(file_data),
        }

        chunks = [file_data[i:i + chunk_size] for i in range(0, len(file_data), chunk_size)]
        self.file_chunks[file_hash] = chunks
        self.chunk_locations[file_hash] = {i: ['server1', 'server2'] for i in range(len(chunks))}

    def download_file(self, file_hash, username):
        if file_hash not in self.file_metadata:
            return None, "File not found"

        if self.file_metadata[file_hash]['owner'] != username:
            return None, "Unauthorized"

        if file_hash not in self.file_chunks:
            return None, "File chunks not found"

        # Assemble chunks to reconstruct the original file
        file_data = ''.join(self.file_chunks[file_hash])
        return file_data, None

    def search_files(self, query):
        results = {}
        for file_hash, metadata in self.file_metadata.items():
            if query.lower() in metadata['file_name'].lower():
                results[file_hash] = metadata
        return results

    def get_file_metadata(self, file_hash):
        if file_hash in self.file_metadata:
            return self.file_metadata[file_hash]
        else:
            return None

    def add_chunk_location(self, file_hash, chunk_index, server_name):
        # Add chunk location for redundancy and availability
        if file_hash in self.chunk_locations and chunk_index in self.chunk_locations[file_hash]:
            self.chunk_locations[file_hash][chunk_index].append(server_name)
        else:
            return None

Big(O) Analysis

Time Complexity
O(n)The file sharing system primarily involves operations such as uploading, downloading, and searching. Uploading and downloading files of size n would take O(n) time, as each chunk needs to be read/written. Searching for a file, assuming an efficient indexing mechanism like a hash table, can be done in O(1) on average, or O(n) in the worst case (scanning all filenames). Combining these, the dominant operation scales linearly with the size of the data (n) involved in file transfers, leading to an overall time complexity of O(n).
Space Complexity
O(N)The central server stores information about all files, including metadata like owner, file name, and chunk locations. For N files, this requires storing N entries of file metadata. Furthermore, large files are split into chunks, and the server must keep track of these chunks, leading to additional storage proportional to the total number of chunks, which can be considered part of the overall file data. In the worst case where each file is considered, the auxiliary space is dependent on the number of files stored, leading to a space complexity of O(N), where N represents the total number of files and associated chunk metadata.

Edge Cases

CaseHow to Handle
n (number of users) is zero or negative.Throw an IllegalArgumentException or similar error, as the system cannot function with zero or negative users.
userId in join() is already registered.Return immediately as user is already registered; alternatively, throw an exception if duplicate registration is considered an error.
userId in leave(), upload(), or download() is not registered.Return immediately as the action cannot be performed for an unregistered user, or throw an exception indicating invalid user.
Empty parts list in upload().Treat as an empty file upload; create a new file ID, but store it as having no parts or return immediately based on requirements.
fileId in download() or whoHas() does not exist.Return an empty list in whoHas() or null/default value in download() to indicate the file doesn't exist, or throw an exception.
User downloads a file and already has all parts of it.Still record the download event to track popularity or activity, but no new parts need to be added to the user's inventory.
A very large number of parts in upload(), exceeding memory constraints.Implement pagination or chunking for file parts storage to handle arbitrarily large files and avoid memory overflow.
Large number of users downloading the same file concurrently.Implement concurrency control (e.g., using locks) to ensure data consistency when updating which users have file parts, or use an atomic data structure.