Taro Logo

Find Median Given Frequency of Numbers

Hard
Pinterest logo
Pinterest
1 view
Topics:
ArraysBinary Search

The frequency of each number is given in a table format where the first column is the number and the second column is its frequency.

Return the median of all numbers.

Example 1:

Input: 
[
  [1, 3],
  [2, 1],
  [3, 2]
]
Output: 2
Explanation:
The numbers are 1, 1, 1, 2, 3, 3.
The median is 2.

Example 2:

Input: 
[
  [1, 2],
  [2, 3]
]
Output: 2
Explanation:
The numbers are 1, 1, 2, 2, 2.
The median is 2.

Constraints:

  • The total numbers are in the range [1, 105].
  • The frequency of each number is in the range [1, 105].
  • Each number is in the range [1, 105].

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. What are the possible ranges for the numbers in `nums` and `frequencies`? Can `frequencies` contain zero?
  2. Can either `nums` or `frequencies` be empty or null? What should I return in those cases?
  3. Are the numbers in `nums` guaranteed to be distinct, or can there be duplicate values?
  4. If the total number of elements (sum of `frequencies`) is even, should I return the average of the two middle elements as a float, or is an integer result acceptable?
  5. Is the `frequencies` array guaranteed to have the same length as the `nums` array?

Brute Force Solution

Approach

The brute force approach to finding the median involves recreating the entire list of numbers from their frequencies. Then, we sort this list and pick the middle number (or average the two middle numbers if the count is even).

Here's how the algorithm would work step-by-step:

  1. Create a list by adding each number as many times as its frequency indicates.
  2. Organize the numbers in the list from smallest to largest.
  3. Count how many numbers are in the organized list.
  4. If the count is odd, find the middle position and return the number at that position.
  5. If the count is even, find the two numbers in the middle. Calculate their average and return the result.

Code Implementation

def find_median_given_frequency(numbers_with_frequency):
    complete_number_list = []

    for number, frequency in numbers_with_frequency:
        # Expand the list based on frequencies
        for _ in range(frequency):
            complete_number_list.append(number)

    complete_number_list.sort()

    list_length = len(complete_number_list)

    # Need the middle element for an odd-length list
    if list_length % 2 != 0:
        median_index = list_length // 2
        return complete_number_list[median_index]

    # Need to average two middle elements for even length
    else:
        middle_index_2 = list_length // 2
        middle_index_1 = middle_index_2 - 1

        #Calculate the average of the two middle numbers
        median = (complete_number_list[middle_index_1] + \
                  complete_number_list[middle_index_2]) / 2
        return median

Big(O) Analysis

Time Complexity
O(N+K log K)The algorithm first recreates the list of numbers based on their frequencies. Let K be the total count of all numbers (sum of all frequencies). Constructing this list takes O(K) time. Then, the list is sorted, which takes O(K log K) time using an efficient sorting algorithm. Finally, finding the median from the sorted list takes O(1) time (accessing the middle element(s)). Let N be the number of unique numbers we have, in the input. Therefore the overall complexity is dominated by the sorting step O(K log K), assuming K >= N and that constructing the list is O(K) since K is the sum of the frequencies. So, the final complexity is O(N+K log K).
Space Complexity
O(N)The brute force approach creates a new list to hold all the numbers based on their frequencies. In the worst case, where each number has a frequency greater than 1, the size of this new list could approach the sum of all frequencies, effectively rebuilding a potentially large array. If we consider the sum of all frequencies as N, where N is the total number of elements after unfolding the frequencies, then the auxiliary space required for this list is proportional to N. Therefore, the space complexity is O(N).

Optimal Solution

Approach

The best way to find the median is to avoid creating a giant list of all the numbers. We can use the frequency information to efficiently jump to the middle value(s) without explicitly listing out every number.

Here's how the algorithm would work step-by-step:

  1. First, figure out the total count of all the numbers based on their frequencies.
  2. Then, determine what the middle position (or positions, for an even number of total elements) would be.
  3. Go through the numbers and their frequencies in order. Keep a running total of how many numbers you've seen so far.
  4. Stop when your running total reaches or exceeds the middle position(s).
  5. If you stopped exactly at the middle position, the number you're currently at is the median. If there are two middle positions, it's the average of those two numbers.
  6. If you exceeded the middle position, the median is the number you're currently at.

Code Implementation

def find_median_given_frequency(frequency_map):
    total_numbers = sum(frequency_map.values())
    middle_position = (total_numbers + 1) // 2
    count = 0

    sorted_numbers = sorted(frequency_map.keys())

    for i in range(len(sorted_numbers)):
        number = sorted_numbers[i]
        frequency = frequency_map[number]
        count += frequency

        # Check if we've reached the middle position.
        if count >= middle_position:

            # If total count is even, average current and next number
            if total_numbers % 2 == 0 and count == middle_position:
                if i + 1 < len(sorted_numbers):
                    next_number = sorted_numbers[i + 1]
                    return (number + next_number) / 2
                else:
                    return number #last element
            # If total count is odd, median is the current number
            else:
                return number

    return None # Should not reach here if input is valid

Big(O) Analysis

Time Complexity
O(n)The algorithm iterates through the numbers and their frequencies to find the median. The input size, n, represents the number of unique numbers (not the total count of numbers). The algorithm calculates the total count and then iterates through the numbers and their frequencies only once until it reaches or exceeds the middle position(s). Therefore, the time complexity is directly proportional to the number of unique numbers, n, making it O(n).
Space Complexity
O(1)The algorithm uses a few integer variables for calculating the total count, tracking the running total, and storing the middle position(s). The space used by these variables remains constant regardless of the input frequency data (N). Therefore, the auxiliary space complexity is O(1).

Edge Cases

CaseHow to Handle
nums or frequencies is null or emptyReturn null or throw an exception if either input array is null or empty, indicating invalid input.
nums and frequencies have different lengthsThrow an exception if the lengths of the input arrays are different, as this indicates invalid input.
Sum of frequencies is zeroReturn NaN or throw an exception if the sum of frequencies is zero, indicating an empty dataset.
Large frequencies causing integer overflow when calculating total countUse a 64-bit integer type (long in Java/C++) to prevent overflow when calculating the total number of elements.
All numbers in nums are the sameThe standard median calculation should still correctly identify the median value in this scenario.
Frequencies contain negative valuesThrow an exception if any frequency is negative, since frequencies should be non-negative.
Extreme values in nums leading to precision issues if using floating point arithmeticConsider using integer arithmetic when possible or using appropriate floating point precision to minimize errors.
Even number of total elements requiring average of two middle elementsCorrectly identify the two middle elements and compute their average according to the problem definition.