Taro Logo

H-Index II

Medium
Asked by:
Profile picture
Profile picture
Profile picture
13 views
Topics:
ArraysBinary Search

Given an array of integers citations where citations[i] is the number of citations a researcher received for their ith paper and citations is sorted in non-descending order, return the researcher's h-index.

According to the definition of h-index on Wikipedia: The h-index is defined as the maximum value of h such that the given researcher has published at least h papers that have each been cited at least h times.

You must write an algorithm that runs in logarithmic time.

Example 1:

Input: citations = [0,1,3,5,6]
Output: 3
Explanation: [0,1,3,5,6] means the researcher has 5 papers in total and each of them had received 0, 1, 3, 5, 6 citations respectively.
Since the researcher has 3 papers with at least 3 citations each and the remaining two with no more than 3 citations each, their h-index is 3.

Example 2:

Input: citations = [1,2,100]
Output: 2

Constraints:

  • n == citations.length
  • 1 <= n <= 105
  • 0 <= citations[i] <= 1000
  • citations is sorted in ascending order.

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. Can you confirm the input array `citations` is sorted in non-decreasing order as the problem description implies?
  2. What is the expected return value if no citation meets the criteria for the h-index (i.e., the h-index is 0)?
  3. What is the maximum possible size of the `citations` array?
  4. Are the citation values guaranteed to be non-negative integers?
  5. Can the `citations` array be empty?

Brute Force Solution

Approach

To find the H-Index, we'll try out different possible H-Index values. For each guess, we will check if the array satisfies the H-Index condition, by checking each number of publications. This means trying every possibility until we find the largest possible H-Index that works.

Here's how the algorithm would work step-by-step:

  1. Let's start by guessing that the H-Index is some number. We could begin with the biggest possible value for the H-Index.
  2. Now, we check the list of publications and their number of citations. We want to see if at least that many publications have at least that many citations.
  3. If we find enough publications with enough citations, then our guess might be the H-Index, but we should still try a larger value to be sure we have found the *largest* value possible for the H-Index.
  4. If we *don't* find enough publications with enough citations, it means our guess was too high. So, we try a smaller value for the H-Index.
  5. We keep trying these different values, going down one by one, until we find the largest possible number such that there are at least that many publications with at least that many citations.
  6. When we find that number, that's our H-Index!

Code Implementation

def h_index_brute_force(citations):
    number_of_publications = len(citations)

    # Start with the largest possible h-index
    for possible_h_index in range(number_of_publications, -1, -1):
        # Check if the array satisfies the condition for this h-index
        publications_with_enough_citations = 0

        for citation_count in citations:
            if citation_count >= possible_h_index:
                publications_with_enough_citations += 1

        # If enough publications have enough citations, return h-index
        if publications_with_enough_citations >= possible_h_index:

            # We've found the largest possible h-index
            return possible_h_index

    return 0

Big(O) Analysis

Time Complexity
O(n²)The algorithm iterates downwards from n (the length of the citations array) to 1, attempting to find the largest possible H-Index. For each potential H-Index value, it iterates through the entire citations array of size n to verify if at least H publications have at least H citations. Therefore, in the worst case, this involves nested loops, where the outer loop runs a maximum of n times and the inner loop runs n times for each iteration of the outer loop. This results in approximately n*n operations which simplifies to O(n²).
Space Complexity
O(1)The provided plain English explanation outlines an iterative approach that involves guessing H-Index values and checking them against the sorted array of citations. The algorithm does not explicitly use any additional data structures like arrays, hash maps, or lists to store intermediate results or track progress. The space used by the algorithm consists of a few variables to store the guessed H-Index and potentially a counter to check the H-Index condition; these variables require constant space. Therefore, the auxiliary space complexity is O(1), irrespective of the input size N, where N is the number of publications.

Optimal Solution

Approach

This problem asks us to find a special number related to someone's research publications. Because the publications are already helpfully sorted, we can use this sorted order to quickly narrow down the possibilities, rather than checking every single number.

Here's how the algorithm would work step-by-step:

  1. Start by looking at the middle publication in the sorted list.
  2. Check if the number of publications at or beyond the middle point is greater than or equal to the value of that middle publication.
  3. If it is, that means the 'special number' might be larger, so focus your attention on the right half of the list.
  4. If it's not, the 'special number' must be smaller, so look at the left half of the list.
  5. Keep repeating this process of checking the middle and narrowing the search area until you find the 'special number'.

Code Implementation

def h_index_ii(citations): 
    start_index = 0
    end_index = len(citations) - 1

    while start_index <= end_index:
        middle_index = (start_index + end_index) // 2
        
        # Publications beyond middle may be >= middle citation
        papers_beyond_middle = len(citations) - middle_index

        if citations[middle_index] >= papers_beyond_middle:
            # H-index might be larger; check the left half
            end_index = middle_index - 1
        else:
            # H-index must be smaller; check the right half
            start_index = middle_index + 1

    # Remaining publications are beyond h-index
    return len(citations) - start_index

Big(O) Analysis

Time Complexity
O(log n)The algorithm employs a binary search approach on the sorted array of citations. In each iteration, it examines the middle element and halves the search space based on the comparison. This halving of the search space in each step results in logarithmic time complexity, specifically O(log n), where n is the number of citations.
Space Complexity
O(1)The algorithm operates using binary search. It only uses a constant number of variables to store indices and the middle point during the search. The space used doesn't increase with the size of the input array (citations), denoted as N. Therefore, the auxiliary space complexity is O(1).

Edge Cases

Empty citations array
How to Handle:
Return 0 as the h-index since there are no publications.
Citations array with a single element
How to Handle:
Return 1 if the single citation is >= 1, else return 0.
All citations are 0
How to Handle:
Return 0 as the h-index since no paper has a sufficient number of citations.
All citations are very large (e.g., close to integer max)
How to Handle:
The binary search should still function correctly as the citation values do not affect the algorithm's core logic.
Citations array is in strictly increasing order
How to Handle:
The binary search will efficiently find the h-index in this sorted scenario.
Citations array is in strictly decreasing order (or all the same value)
How to Handle:
The binary search will still correctly identify the h-index, potentially at the beginning of the array.
Very large input array (potential for integer overflow in calculations)
How to Handle:
Ensure that the middle calculation in binary search uses `mid = low + (high - low) / 2` to prevent integer overflow.
No h-index exists (e.g., citations = [0,0,0])
How to Handle:
The binary search should naturally converge to a low value (0), indicating no valid h-index.