Taro Logo

Find Median from Data Stream

Hard
Twitch logo
Twitch
0 views
Topics:
ArraysTwo PointersGreedy Algorithms

The median is the middle value in an ordered integer list. If the size of the list is even, there is no middle value, and the median is the mean of the two middle values.

  • For example, for arr = [2,3,4], the median is 3.
  • For example, for arr = [2,3], the median is (2 + 3) / 2 = 2.5.

Implement the MedianFinder class:

  • MedianFinder() initializes the MedianFinder object.
  • void addNum(int num) adds the integer num from the data stream to the data structure.
  • double findMedian() returns the median of all elements so far. Answers within 10-5 of the actual answer will be accepted.

Example 1:

Input
["MedianFinder", "addNum", "addNum", "findMedian", "addNum", "findMedian"]
[[], [1], [2], [], [3], []]
Output
[null, null, null, 1.5, null, 2.0]

Explanation
MedianFinder medianFinder = new MedianFinder();
medianFinder.addNum(1);    // arr = [1]
medianFinder.addNum(2);    // arr = [1, 2]
medianFinder.findMedian(); // return 1.5 (i.e., (1 + 2) / 2)
medianFinder.addNum(3);    // arr[1, 2, 3]
medianFinder.findMedian(); // return 2.0

Constraints:

  • -105 <= num <= 105
  • There will be at least one element in the data structure before calling findMedian.
  • At most 5 * 104 calls will be made to addNum and findMedian.

Follow up:

  • If all integer numbers from the stream are in the range [0, 100], how would you optimize your solution?
  • If 99% of all integer numbers from the stream are in the range [0, 100], how would you optimize your solution?

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. What is the range of possible integer values in the stream, and can I assume the stream will be of a reasonable size or could it be very large (millions or billions of numbers)?
  2. Are duplicate numbers allowed in the data stream, and if so, how should they be handled when calculating the median?
  3. When the number of elements in the stream is even, should I return the lower median, the upper median, or the average of the two?
  4. Can I assume that the `addNum` operation will always be called before `findMedian`?
  5. What data type should I return for the median; should it be an integer or a floating-point number, and what level of precision is required for the floating-point representation?

Brute Force Solution

Approach

The core idea is to maintain a list of all numbers seen so far. To find the median, we will sort the list every time a new number is added and then pick the middle number (or the average of the two middle numbers).

Here's how the algorithm would work step-by-step:

  1. Whenever a new number comes in, add it to the end of the list of numbers we already have.
  2. After adding the new number, sort the entire list from smallest to largest.
  3. If the number of numbers in the list is odd, then the median is simply the number in the very middle of the list.
  4. If the number of numbers in the list is even, then the median is the average of the two numbers that are closest to the middle of the list.

Code Implementation

class MedianFinder:

    def __init__(self):
        self.number_list = []

    def addNum(self, number) -> None:
        self.number_list.append(number)

    def findMedian(self) -> float:
        # Sort the list so that we can easily find the median
        self.number_list.sort()

        list_length = len(self.number_list)

        # Handle odd length list
        if list_length % 2 != 0:
            middle_index = list_length // 2
            return float(self.number_list[middle_index])

        # Even length list, average two middle elements
        else:
            middle_index_right = list_length // 2

            # The median is the average of the two middle numbers
            middle_index_left = middle_index_right - 1
            sum_of_middle_elements = self.number_list[middle_index_left] + self.number_list[middle_index_right]
            return float(sum_of_middle_elements) / 2.0

Big(O) Analysis

Time Complexity
O(n log n)Adding a number to the list takes O(1) time. However, after each addition, the list of size n is sorted. Sorting using an efficient algorithm like merge sort or quicksort has a time complexity of O(n log n). Since this sorting occurs for every new number added to the stream, the overall time complexity for adding n numbers and finding the median each time becomes O(n log n).
Space Complexity
O(N)The provided solution maintains a list of all numbers seen so far. When a new number arrives, it's appended to this list. Therefore, the auxiliary space required grows linearly with the number of elements added to the data stream, where N is the number of elements in the data stream. We are storing all N numbers in memory. The sorting operation itself might use some auxiliary space depending on the sorting algorithm, but given the context, the main driver of space complexity is the storage of the N numbers.

Optimal Solution

Approach

To efficiently find the median of a constantly growing stream of numbers, we can use two special containers that keep track of the smaller half and larger half of the numbers. By ensuring these containers are always balanced, we can quickly find the middle number(s) needed to calculate the median.

Here's how the algorithm would work step-by-step:

  1. Imagine you have two containers: one for the smaller numbers you've seen so far and another for the larger numbers.
  2. Whenever you get a new number, figure out which container it belongs in (smaller or larger).
  3. To keep things balanced, if one container gets too much bigger than the other, move the biggest number from the smaller container to the larger container, or vice-versa.
  4. When you need to find the median, check if the two containers have the same number of items. If they do, the median is the average of the largest number in the smaller container and the smallest number in the larger container.
  5. If the containers have different numbers of items, the median is simply the biggest number in the smaller container (if it has more items) or the smallest number in the larger container (if it has more items).

Code Implementation

import heapq

class MedianFinder:

    def __init__(self):
        self.smaller_numbers = [] # Max heap
        self.larger_numbers = []  # Min heap

    def addNum(self, number):
        if not self.smaller_numbers or number <= -self.smaller_numbers[0]:
            heapq.heappush(self.smaller_numbers, -number)
        else:
            heapq.heappush(self.larger_numbers, number)

        # Maintain size balance between the two heaps.
        if len(self.smaller_numbers) > len(self.larger_numbers) + 1:
            heapq.heappush(self.larger_numbers, -heapq.heappop(self.smaller_numbers))

        elif len(self.larger_numbers) > len(self.smaller_numbers):
            heapq.heappush(self.smaller_numbers, -heapq.heappop(self.larger_numbers))

    def findMedian(self):
        # If heaps are of equal size, median is avg of two middle elements.
        if len(self.smaller_numbers) == len(self.larger_numbers):
            return (-self.smaller_numbers[0] + self.larger_numbers[0]) / 2.0

        # If smaller_numbers is larger, it contains the median.
        else:
            return -self.smaller_numbers[0]

Big(O) Analysis

Time Complexity
O(log n)The addNum operation involves inserting a number into either the smaller or larger half containers, which are implemented using heaps (priority queues). Heap insertion has a time complexity of O(log n), where n is the number of elements in the heap. The rebalancing step involves at most one heap removal (O(log n)) and one heap insertion (O(log n)). The findMedian operation involves peeking at the top elements of the heaps, which takes O(1) time. Therefore, the dominant operation is the heap insertion/removal, resulting in an overall time complexity of O(log n) for addNum.
Space Complexity
O(N)The solution utilizes two containers to store the smaller and larger halves of the data stream. In the worst-case scenario, where the data stream consists of N numbers, these containers could potentially hold close to N/2 elements each. Therefore, the auxiliary space used grows linearly with the number of elements in the data stream. This results in a space complexity of O(N).

Edge Cases

CaseHow to Handle
Empty data stream (no numbers added yet)Return null or throw an exception if findMedian is called before any numbers are added, as there is no median to compute.
Stream with a single numberReturn the single number itself as the median.
Stream with two numbersReturn the average of the two numbers as the median.
Large stream of numbers exceeding available memoryThe solution should ideally use external memory or streaming algorithms if the stream is too large to fit in memory; otherwise, memory limits could be reached.
Stream with all identical numbersThe solution should correctly return the identical number as the median, even if the internal data structure isn't perfectly balanced.
Stream with a highly skewed distribution (e.g., mostly small numbers with a few very large numbers)The priority queue implementation should maintain balance efficiently, preventing a large imbalance that would degrade performance.
Stream containing negative numbers, zeros, and positive numbersThe solution needs to correctly handle all number types without special treatment.
Stream with extreme boundary values (e.g., Integer.MAX_VALUE, Integer.MIN_VALUE)The priority queue comparison logic must handle the extreme values without causing integer overflow errors.