Taro Logo

Find Median from Data Stream

Hard
Citadel logo
Citadel
0 views
Topics:
ArraysGreedy Algorithms

The median is the middle value in an ordered integer list. If the size of the list is even, there is no middle value, and the median is the mean of the two middle values.

  • For example, for arr = [2,3,4], the median is 3.
  • For example, for arr = [2,3], the median is (2 + 3) / 2 = 2.5.

Implement the MedianFinder class:

  • MedianFinder() initializes the MedianFinder object.
  • void addNum(int num) adds the integer num from the data stream to the data structure.
  • double findMedian() returns the median of all elements so far. Answers within 10-5 of the actual answer will be accepted.

Example:

Input
["MedianFinder", "addNum", "addNum", "findMedian", "addNum", "findMedian"]
[[], [1], [2], [], [3], []]
Output
[null, null, null, 1.5, null, 2.0]

Explanation
MedianFinder medianFinder = new MedianFinder();
medianFinder.addNum(1);    // arr = [1]
medianFinder.addNum(2);    // arr = [1, 2]
medianFinder.findMedian(); // return 1.5 (i.e., (1 + 2) / 2)
medianFinder.addNum(3);    // arr[1, 2, 3]
medianFinder.findMedian(); // return 2.0

Constraints:

  • -105 <= num <= 105
  • There will be at least one element in the data structure before calling findMedian.
  • At most 5 * 104 calls will be made to addNum and findMedian.

Follow up:

  • If all integer numbers from the stream are in the range [0, 100], how would you optimize your solution?
  • If 99% of all integer numbers from the stream are in the range [0, 100], how would you optimize your solution?

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. What is the expected range of the numbers in the data stream (e.g., integers, floats, minimum and maximum values)?
  2. How many numbers are expected to be added to the data stream in total (i.e., what is the scale of the problem)?
  3. If the number of elements in the data stream is even, should the median be the average of the two middle numbers, and if so, should I return the result as a float or integer?
  4. Are there any memory constraints I should be aware of?
  5. How frequently will the `findMedian` method be called relative to the `addNum` method?

Brute Force Solution

Approach

The goal is to continuously find the middle value from a flowing sequence of numbers. The basic brute force strategy involves keeping track of all the numbers we've seen so far and re-calculating the median every time a new number arrives.

Here's how the algorithm would work step-by-step:

  1. Whenever a new number comes in, simply add it to our existing collection of numbers.
  2. Then, sort the entire collection of numbers from smallest to largest.
  3. If the total count of numbers is odd, the median is simply the number in the very middle of the sorted collection.
  4. If the total count of numbers is even, the median is the average of the two numbers in the middle of the sorted collection.

Code Implementation

class MedianFinder:

    def __init__(self):
        self.data_stream = []

    def addNum(self, num):
        # Simply add the new number to the data stream
        self.data_stream.append(num)

    def findMedian(self):
        # Need to sort the stream to find the median
        sorted_data_stream = sorted(self.data_stream)

        number_of_elements = len(sorted_data_stream)

        if number_of_elements % 2 != 0:
            # Odd number of elements means median is middle element

            middle_index = number_of_elements // 2
            return float(sorted_data_stream[middle_index])

        else:
            # Even number of elements means median is average of two middle elements
            middle_index_one = number_of_elements // 2 - 1
            middle_index_two = number_of_elements // 2
            
            # Need to cast to float to maintain precision
            return (sorted_data_stream[middle_index_one] + sorted_data_stream[middle_index_two]) / 2.0

Big(O) Analysis

Time Complexity
O(n log n)For each new number added to the collection (of size n so far), we insert it into the existing sorted collection. However, the description mentions explicitly sorting the entire collection in each step. The cost of sorting the collection of n numbers is O(n log n). Since this sorting happens for every new number that arrives, and we could receive up to n numbers, the overall time complexity becomes O(n log n) for each of the n insertions. Therefore, the total time complexity is O(n * n log n) for insertion. Finding the median from the sorted array takes O(1), but the bottleneck is the sorting step performed n times. Thus, the described algorithm has the time complexity of O(n log n) for each element added and sorted.
Space Complexity
O(N)The algorithm stores all incoming numbers in a collection. If N is the number of elements added to the data stream, then the algorithm uses a data structure (like a list or an array) to hold these N elements. Sorting this collection in place does not change the space complexity. Therefore, the space complexity is directly proportional to the number of elements in the data stream, N.

Optimal Solution

Approach

We can find the median of a constantly updating set of numbers efficiently by using two 'piles' that represent the lower and upper halves of the data. By keeping these piles balanced in size and order, the median can be quickly accessed from the top of the piles.

Here's how the algorithm would work step-by-step:

  1. Maintain two piles: one to hold the smaller half of the numbers, and another to hold the larger half.
  2. Make sure that all the numbers in the 'smaller numbers' pile are less than or equal to all the numbers in the 'larger numbers' pile.
  3. When a new number comes in, decide which pile to put it in based on whether it's larger or smaller than the current median. If it's smaller, put it in the 'smaller numbers' pile; otherwise, put it in the 'larger numbers' pile.
  4. After adding the number, check if the sizes of the two piles are unbalanced. Ideally, they should have the same number of elements, or the 'smaller numbers' pile can have one extra element if there's an odd number of total elements.
  5. If the piles are unbalanced, move the largest number from the 'smaller numbers' pile to the 'larger numbers' pile, or move the smallest number from the 'larger numbers' pile to the 'smaller numbers' pile to rebalance them.
  6. The median is then either the top element of the 'smaller numbers' pile (if the piles have an odd total size) or the average of the top elements of the two piles (if the piles have an even total size).

Code Implementation

import heapq

class MedianFinder:

    def __init__(self):
        self.smaller_numbers = [] # Max-heap for smaller half
        self.larger_numbers = [] # Min-heap for larger half

    def addNum(self, number):
        # Add to smaller or larger half based on current median
        if not self.smaller_numbers or number <= -self.smaller_numbers[0]:
            heapq.heappush(self.smaller_numbers, -number)
        else:
            heapq.heappush(self.larger_numbers, number)

        # Rebalance the heaps to maintain size property
        if len(self.smaller_numbers) > len(self.larger_numbers) + 1:

            heapq.heappush(self.larger_numbers, -heapq.heappop(self.smaller_numbers))

        elif len(self.larger_numbers) > len(self.smaller_numbers):
            # Maintain the size difference by rebalancing

            heapq.heappush(self.smaller_numbers, -heapq.heappop(self.larger_numbers))

    def findMedian(self):
        #Determine median from heaps
        if len(self.smaller_numbers) == len(self.larger_numbers):
            #Even Number of total elements

            return (-self.smaller_numbers[0] + self.larger_numbers[0]) / 2.0
        else:
            #Odd number of total elements

            return -self.smaller_numbers[0]

Big(O) Analysis

Time Complexity
O(log n)The solution uses two heaps: a max-heap to store the smaller half of the numbers and a min-heap to store the larger half. Adding a new number involves inserting it into one of the heaps, which takes O(log n) time where n is the number of elements in the heap. Rebalancing the heaps (moving an element from one heap to the other) also takes O(log n) time because it involves heapify operations. Since each addNum operation performs at most one insertion and one rebalancing, the overall time complexity for adding a single number is O(log n). Finding the median involves peeking at the top elements of the heaps, which takes O(1) time.
Space Complexity
O(N)The algorithm uses two piles (priority queues or heaps) to store the smaller and larger halves of the data stream. In the worst-case scenario, the algorithm stores all N elements from the data stream in these piles. Therefore, the auxiliary space used scales linearly with the number of elements in the data stream. This results in a space complexity of O(N).

Edge Cases

CaseHow to Handle
Empty data stream (no numbers added yet)The findMedian() method should return a defined value (e.g., NaN or throw an exception) if the stream is empty, as there's no median to calculate.
Single number in the data streamThe findMedian() method should simply return the single number when only one number has been added.
Large number of elements in the data streamThe solution must scale efficiently to handle millions or billions of numbers by utilizing an optimal data structure and algorithm (e.g., heaps).
Data stream contains duplicate numbersThe solution must correctly calculate the median even when the data stream contains duplicate numbers.
Data stream contains negative numbersThe solution should correctly handle both positive and negative numbers in the data stream without any bias.
Data stream contains extreme values (very large or very small numbers)The solution should be robust to extreme values and prevent potential integer overflow or floating-point precision issues.
Data stream with highly skewed distribution (e.g., mostly small numbers with a few very large numbers)The solution should still efficiently maintain the median even with skewed data distribution, ensuring that the heaps remain balanced.
Floating-point precision errors.Use double precision to avoid floating-point precision issues, and compare floating-point numbers with a tolerance value.