Taro Logo

Find Median from Data Stream

Hard
Meta logo
Meta
2 views
Topics:
Database ProblemsArrays

The median is the middle value in an ordered integer list. If the size of the list is even, there is no middle value, and the median is the mean of the two middle values.

  • For example, for arr = [2,3,4], the median is 3.
  • For example, for arr = [2,3], the median is (2 + 3) / 2 = 2.5.

Implement the MedianFinder class:

  • MedianFinder() initializes the MedianFinder object.
  • void addNum(int num) adds the integer num from the data stream to the data structure.
  • double findMedian() returns the median of all elements so far. Answers within 10-5 of the actual answer will be accepted.

Example:

Input
["MedianFinder", "addNum", "addNum", "findMedian", "addNum", "findMedian"]
[[], [1], [2], [], [3], []]
Output
[null, null, null, 1.5, null, 2.0]

Explanation
MedianFinder medianFinder = new MedianFinder();
medianFinder.addNum(1);    // arr = [1]
medianFinder.addNum(2);    // arr = [1, 2]
medianFinder.findMedian(); // return 1.5 (i.e., (1 + 2) / 2)
medianFinder.addNum(3);    // arr[1, 2, 3]
medianFinder.findMedian(); // return 2.0

Constraints:

  • -105 <= num <= 105
  • There will be at least one element in the data structure before calling findMedian.
  • At most 5 * 104 calls will be made to addNum and findMedian.

Follow up:

  • If all integer numbers from the stream are in the range [0, 100], how would you optimize your solution?
  • If 99% of all integer numbers from the stream are in the range [0, 100], how would you optimize your solution?

Solution


Naive Solution: Using a Sorted List

One straightforward approach is to maintain a sorted list of all numbers seen so far. When addNum is called, we insert the new number into the correct position in the sorted list. When findMedian is called, we simply return the middle element (or the average of the two middle elements) of the list.

Code (Python):

class MedianFinder:
    def __init__(self):
        self.data = []

    def addNum(self, num: int) -> None:
        # Insert num into self.data while maintaining sorted order
        if not self.data:
            self.data.append(num)
            return
        
        for i in range(len(self.data)):
            if num < self.data[i]:
                self.data.insert(i, num)
                return
        self.data.append(num)

    def findMedian(self) -> float:
        n = len(self.data)
        if n % 2 == 0:
            return (self.data[n // 2 - 1] + self.data[n // 2]) / 2.0
        else:
            return float(self.data[n // 2])

Time Complexity:

  • addNum: O(n), because inserting into a sorted list of size n takes O(n) time in the worst case.
  • findMedian: O(1), because accessing the middle element(s) takes constant time.

Space Complexity:

  • O(n), to store all the numbers.

Optimal Solution: Using Heaps

A better approach is to use two heaps: a max-heap and a min-heap.

  • The max-heap will store the smaller half of the numbers.
  • The min-heap will store the larger half of the numbers.

We maintain the invariant that the size of the max-heap is always equal to or one greater than the size of the min-heap. This ensures that when the total number of elements is odd, the median is simply the top of the max-heap. When the total number of elements is even, the median is the average of the tops of the two heaps.

Algorithm:

  1. addNum(num):
    • Add the number to the max-heap.
    • Move the largest element from the max-heap to the min-heap.
    • If the min-heap is larger than the max-heap, move the smallest element from the min-heap to the max-heap.
  2. findMedian():
    • If the max-heap is larger than the min-heap, the median is the top of the max-heap.
    • Otherwise, the median is the average of the tops of the two heaps.

Code (Python):

import heapq

class MedianFinder:
    def __init__(self):
        self.small = []  # max-heap (negated values)
        self.large = []  # min-heap

    def addNum(self, num: int) -> None:
        heapq.heappush(self.small, -num)
        heapq.heappush(self.large, -heapq.heappop(self.small))

        if len(self.large) > len(self.small):
            heapq.heappush(self.small, -heapq.heappop(self.large))

    def findMedian(self) -> float:
        if len(self.small) > len(self.large):
            return -self.small[0]
        else:
            return (-self.small[0] + self.large[0]) / 2.0

Time Complexity:

  • addNum: O(log n), because heap push and pop operations take O(log n) time.
  • findMedian: O(1), because accessing the top of a heap takes constant time.

Space Complexity:

  • O(n), to store all the numbers in the heaps.

Edge Cases

  • Empty stream: The problem statement specifies that findMedian will only be called after at least one number has been added. Thus, no special handling is needed for an empty stream.
  • Duplicate numbers: The heaps-based solution works correctly with duplicate numbers. The duplicates will simply be distributed between the two heaps.

Optimization for Specific Ranges

  • Numbers in the range [0, 100]: We can use a counting sort approach. Maintain an array of size 101 to store the frequency of each number. To find the median, iterate through the array to find the middle element(s). This would give O(1) time complexity for addNum and findMedian, but the space complexity is O(1) since the array size is fixed.
  • 99% of numbers in the range [0, 100]: We can combine the heaps approach with the counting sort approach. Use the counting sort for numbers in the range [0, 100] and the heaps for numbers outside that range. This will improve performance significantly if most numbers are within the range [0, 100].