Taro Logo

Find Median from Data Stream

Hard
Amazon logo
Amazon
Topics:
ArraysGreedy AlgorithmsDynamic Programming

Let's implement a MedianFinder class to efficiently calculate the median of a stream of numbers.

The median is the middle value in an ordered integer list. If the size of the list is even, there is no middle value, and the median is the mean of the two middle values.

  • For example, for arr = [2,3,4], the median is 3.
  • For example, for arr = [2,3], the median is (2 + 3) / 2 = 2.5.

Your MedianFinder class should support the following methods:

  • MedianFinder(): Initializes the MedianFinder object.
  • void addNum(int num): Adds the integer num from the data stream to the data structure.
  • double findMedian(): Returns the median of all elements so far. Answers within 10-5 of the actual answer will be accepted.

Example:

MedianFinder medianFinder = new MedianFinder();
medianFinder.addNum(1);    // arr = [1]
medianFinder.addNum(2);    // arr = [1, 2]
medianFinder.findMedian(); // return 1.5 (i.e., (1 + 2) / 2)
medianFinder.addNum(3);    // arr[1, 2, 3]
medianFinder.findMedian(); // return 2.0

Constraints:

  • -10^5 <= num <= 10^5
  • There will be at least one element in the data structure before calling findMedian.
  • At most 5 * 10^4 calls will be made to addNum and findMedian.

Follow-up questions:

  • If all integer numbers from the stream are in the range [0, 100], how would you optimize your solution?
  • If 99% of all integer numbers from the stream are in the range [0, 100], how would you optimize your solution?

Solution


Naive Solution

A naive solution to find the median of a data stream involves storing all the numbers in a list and, for each call to findMedian(), sorting the list and returning the median. This approach is straightforward but inefficient.

Implementation

  1. addNum(int num): Add the number to a list.
  2. findMedian(): Sort the list and compute the median.

Code

class MedianFinder:
    def __init__(self):
        self.numbers = []

    def addNum(self, num):
        self.numbers.append(num)

    def findMedian(self):
        self.numbers.sort()
        n = len(self.numbers)
        if n % 2 == 0:
            return (self.numbers[n // 2 - 1] + self.numbers[n // 2]) / 2.0
        else:
            return float(self.numbers[n // 2])

Time Complexity

  • addNum(): O(1)
  • findMedian(): O(n log n), due to sorting.

Space Complexity

  • O(n) to store the numbers.

Optimal Solution: Using Heaps

A more efficient solution utilizes two heaps: a max-heap to store the smaller half of the numbers and a min-heap to store the larger half. This ensures that the median can be found in O(1) time.

Implementation

  1. addNum(int num):
    • Add the number to the max-heap (smaller half).
    • If the max-heap size exceeds the min-heap size by more than 1, move the largest element from the max-heap to the min-heap.
    • If the new number is smaller than the smallest number in the min-heap, balance by moving the largest element from the max-heap to the min-heap and inserting the new number into the max-heap.
  2. findMedian():
    • If the sizes of the heaps are equal, the median is the average of the top elements of both heaps.
    • If the max-heap has more elements, the median is the top element of the max-heap.

Code

import heapq

class MedianFinder:
    def __init__(self):
        self.small = []  # max-heap
        self.large = []  # min-heap

    def addNum(self, num):
        heapq.heappush(self.small, -num)
        if self.large and -self.small[0] > self.large[0]:
            heapq.heappush(self.large, -heapq.heappop(self.small))
        if len(self.small) > len(self.large) + 1:
            heapq.heappush(self.large, -heapq.heappop(self.small))
        elif len(self.large) > len(self.small):
            heapq.heappush(self.small, -heapq.heappop(self.large))

    def findMedian(self):
        if len(self.small) == len(self.large):
            return (-self.small[0] + self.large[0]) / 2.0
        else:
            return float(-self.small[0])

Time Complexity

  • addNum(): O(log n), due to heap operations.
  • findMedian(): O(1).

Space Complexity

  • O(n) to store the numbers in the heaps.

Edge Cases

  • Empty stream: The problem states that there will be at least one element before calling findMedian, so an empty stream won't be an issue.
  • Large number of elements: Heaps can handle a large number of elements efficiently.
  • Duplicate numbers: The heap-based solution works correctly with duplicate numbers.

Follow-up Optimization

Numbers in the Range [0, 100]

If all numbers are in the range [0, 100], we can use a frequency array (bucket sort). addNum increments the count of the number in the array and findMedian iterates through the array to find the median.

  • Time complexity for addNum: O(1)
  • Time complexity for findMedian: O(100) = O(1)
  • Space complexity: O(101)

99% of Numbers in the Range [0, 100]

For the case where 99% of the numbers are in the range [0, 100], a hybrid approach would be best. Use the frequency array as described above for numbers in the range [0, 100]. Use the heap-based approach for numbers outside that range. This keeps the memory and time costs low as long as the outlier numbers really are rare.