Taro Logo

Design Log Storage System

Medium
Snap logo
Snap
2 views
Topics:
ArraysStrings

You are given several logs that each log contains a unique id and timestamp. Timestamp is a string that has the following format: Year:Month:Day:Hour:Minute:Second, for example, 2017:01:01:23:59:59. All domains are zero-padded decimal numbers.

Design a log storage system to implement the following functions:

void Put(int id, string timestamp): Stores the given log along with its timestamp.

int[] Retrieve(String start, String end, String granularity): Returns the id of logs whose timestamp is within the range from start to end. start and end all have the same format as timestamp. However, granularity means the time level for consideration. For example, start = 2017:01:01:23:59:59, end = 2017:01:02:23:59:59, granularity = Day, it means that we need to find the logs within the range from 2017:01:01:00:00:00 to 2017:01:02:00:00:00.

There exist five values for granularity:

  • Year
  • Month
  • Day
  • Hour
  • Minute
  • Second


Example:

LogSystem logSystem = new LogSystem();
logSystem.put(1, "2017:01:01:23:59:59");
logSystem.put(2, "2017:01:01:22:59:59");
logSystem.put(3, "2016:01:01:00:00:00");
logSystem.retrieve("2016:01:01:01:01:01","2017:01:01:23:00:00","Year"); // returns [1,2,3], because you need to return all logs within year 2016 and 2017.
logSystem.retrieve("2016:01:01:01:01:01","2017:01:01:23:00:00","Hour"); // returns [1,2], because you need to return all logs within the range from 2016:01:01:01 to 2017:01:01:23, both end points are inclusive.

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. What is the expected scale of the log storage system, both in terms of the number of logs stored and the frequency of retrieve operations?
  2. If no logs fall within the specified timestamp range, should the `retrieve` function return an empty list or null?
  3. Are the start and end timestamps in the `retrieve` function inclusive or exclusive?
  4. Can I assume the input timestamps are always valid and in the specified "YYYY:MM:DD:HH:MM:SS" format, or do I need to handle potential formatting errors or invalid date/time values?
  5. How should I handle the edge case where the start timestamp is later than the end timestamp in the `retrieve` function?

Brute Force Solution

Approach

The most straightforward way to store and retrieve logs is to keep everything in one place and look through it whenever we need something. The brute force approach does exactly that – it avoids any fancy organization and searches all logs directly.

Here's how the algorithm would work step-by-step:

  1. Store all the log entries one after another in a simple list, like a diary.
  2. When someone asks for logs within a specific time range, go through each log entry from the very beginning.
  3. For each log entry, check if its timestamp falls within the requested time range.
  4. If it does, add it to the list of logs we will return.
  5. Continue checking every single log entry until we reach the end of the list.
  6. Finally, return the collected list of log entries that match the requested time range.

Code Implementation

def retrieve_logs_brute_force(logs, start_time, end_time):
    retrieved_logs = []

    # Iterate through each log entry.
    for log_entry in logs:

        # Check if the timestamp falls within the range.
        if start_time <= log_entry['timestamp'] <= end_time:

            # Append logs to results.
            retrieved_logs.append(log_entry)

    return retrieved_logs

Big(O) Analysis

Time Complexity
O(n)The described brute force approach involves storing log entries in a list. When querying logs within a specific time range, the algorithm iterates through each log entry in the list exactly once. For each of the n log entries, a constant-time comparison is performed to check if the timestamp falls within the given range. Therefore, the time complexity is directly proportional to the number of log entries, resulting in O(n) time complexity, where n is the number of log entries.
Space Complexity
O(M)The algorithm iterates through a list of log entries and adds entries that fall within a specific time range to a new list for returning. The space used is determined by the size of this result list, which, in the worst-case scenario, could contain all the log entries that match the time range where M is the number of matching log entries within the time range. Since M can grow linearly with the number of log entries, and in the worst-case scenario M could be equal to N (where N is the total number of log entries), the auxiliary space complexity is O(M).

Optimal Solution

Approach

The best way to store and quickly search logs is to organize them based on time. We use a system of breaking logs into time-based chunks so we can skip over large portions of logs that are irrelevant to our search.

Here's how the algorithm would work step-by-step:

  1. Divide the incoming logs into smaller, manageable pieces based on time. For example, each piece could contain logs from one hour, one day, or one week.
  2. For each of these time-based chunks, create a summary containing the earliest and latest timestamp within that chunk.
  3. Store these summaries in a special lookup table, like a list, making it easy to find a specific time period.
  4. When someone wants to search for logs within a particular time range, first check the lookup table to identify which chunks contain logs from that period.
  5. Only look inside those relevant chunks, skipping over all the others. This makes the search much faster.
  6. Within each relevant chunk, search for the specific logs that match the search criteria.

Code Implementation

class LogStorageSystem:

    def __init__(self, chunk_size_seconds=3600):
        self.chunk_size_seconds = chunk_size_seconds
        self.log_chunks = []
        self.log_metadata = []

    def store_log(self, timestamp, log_message):
        chunk_index = timestamp // self.chunk_size_seconds
        
        if not self.log_metadata or self.log_metadata[-1]['end_timestamp'] <=\
 timestamp:
            # Create new chunk when the timestamp is outside current range.
            self.log_metadata.append({
                'start_timestamp': timestamp,
                'end_timestamp': timestamp + self.chunk_size_seconds -1, 
                'chunk_index': chunk_index
            })
            self.log_chunks.append([{"timestamp": timestamp, "message": log_message}])
        else:
            # Append log to the existing chunk.
            self.log_chunks[-1].append([{"timestamp": timestamp, "message": log_message}])

    def search_logs(self, start_timestamp, end_timestamp):
        relevant_chunks = self._find_relevant_chunks(start_timestamp, end_timestamp)
        results = []

        # Optimization: Only search relevant chunks.
        for chunk_index in relevant_chunks:
            for log_entry in self.log_chunks[chunk_index]:
                if start_timestamp <= log_entry[0]['timestamp'] <= end_timestamp:
                    results.append(log_entry[0])
        return results

    def _find_relevant_chunks(self, start_timestamp, end_timestamp):
        relevant_chunks = []

        # Linear scan for simplicity. Can be optimized with binary search.
        for index, metadata in enumerate(self.log_metadata):
            if (start_timestamp <= metadata['end_timestamp'] and\
 end_timestamp >= metadata['start_timestamp']):
                relevant_chunks.append(index)

        return relevant_chunks

Big(O) Analysis

Time Complexity
O(n)The time complexity is driven by the number of time-based chunks (n) that need to be examined in the lookup table during a search. The algorithm iterates through the lookup table, which contains summaries of these chunks, to identify the relevant ones. Within each relevant chunk, the search for specific logs is assumed to be constant time or optimized via indexing/hashing and doesn't depend on the overall input size. Therefore, the dominant factor is the linear scan of the chunk summaries, resulting in O(n) time complexity.
Space Complexity
O(N)The primary auxiliary space is consumed by the lookup table, which stores summaries of time-based log chunks. In the worst case, if the logs are highly fragmented and each log entry constitutes its own chunk, the lookup table could potentially store N entries, where N is the number of log entries. Each entry in the lookup table contains the earliest and latest timestamp, thus requiring a constant amount of space per entry. Therefore, the space used by the lookup table grows linearly with the number of log entries, leading to O(N) auxiliary space.

Edge Cases

CaseHow to Handle
Null or empty log ID stringReject null or empty strings for log IDs and return an error or exception to maintain system integrity.
Null or empty timestamp stringReject null or empty strings for timestamps and return an error or exception as timestamps are essential for retrieval.
Invalid timestamp format (e.g., missing components, incorrect separators, non-numeric values)Implement robust timestamp validation to ensure the format is correct and handle invalid formats gracefully with an error.
Start timestamp is later than end timestampReturn an empty list or throw an exception if the start timestamp is after the end timestamp as the range is invalid.
Unsupported granularity level (e.g., 'Millisecond')Return an error or exception if an unsupported granularity level is provided during retrieval.
Large number of logs causing memory issuesConsider using a database or other persistent storage mechanism for scalability to avoid in-memory limitations.
Timestamps with extreme values (e.g., year 0001 or 9999)Ensure the parsing and comparison logic handles timestamps representing years near the minimum and maximum allowed dates.
Granularity is 'Second' but the start and end timestamps only differ by milliseconds.Round down the timestamps to the nearest second to ensure inclusivity in the specified range.