Taro Logo

H-Index

Medium
Asked by:
Profile picture
Profile picture
Profile picture
Profile picture
+8
More companies
Profile picture
Profile picture
Profile picture
Profile picture
Profile picture
Profile picture
Profile picture
Profile picture
58 views
Topics:
ArraysBinary SearchGreedy Algorithms

Given an array of integers citations where citations[i] is the number of citations a researcher received for their ith paper, return the researcher's h-index.

According to the definition of h-index on Wikipedia: The h-index is defined as the maximum value of h such that the given researcher has published at least h papers that have each been cited at least h times.

Example 1:

Input: citations = [3,0,6,1,5]
Output: 3
Explanation: [3,0,6,1,5] means the researcher has 5 papers in total and each of them had received 3, 0, 6, 1, 5 citations respectively.
Since the researcher has 3 papers with at least 3 citations each and the remaining two with no more than 3 citations each, their h-index is 3.

Example 2:

Input: citations = [1,3,1]
Output: 1

Constraints:

  • n == citations.length
  • 1 <= n <= 5000
  • 0 <= citations[i] <= 1000

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. What are the constraints on the size of the `citations` array, and what is the range of values for each citation count?
  2. Can the citation counts be negative or zero, or are they guaranteed to be non-negative integers?
  3. Is it possible for the input array `citations` to be empty? If so, what should be returned in that case?
  4. Does the h-index definition imply that the researchers must have *exactly* h citations, or *at least* h citations? (The problem states 'at least', but it's good to confirm any nuances.)
  5. If there are multiple possible h-index values that satisfy the condition, which one should be returned? (e.g., the largest, smallest, or any valid one?)

Brute Force Solution

Approach

The H-Index problem asks us to find the largest number 'h' such that a person has at least 'h' publications with 'h' or more citations. The brute force approach involves systematically checking every possible value for 'h' to see if it satisfies the condition.

Here's how the algorithm would work step-by-step:

  1. Consider a potential H-Index value, let's call it 'candidate_h'.
  2. Now, go through all the reported publications and count how many of them have a citation count that is greater than or equal to 'candidate_h'.
  3. If the number of publications with 'candidate_h' or more citations is at least 'candidate_h' itself, then 'candidate_h' is a possible H-Index.
  4. If not, then 'candidate_h' cannot be the H-Index.
  5. Repeat this process for every possible 'candidate_h' value, starting from the largest possible citation count down to zero.
  6. The largest 'candidate_h' that satisfied the condition in the previous steps is the final H-Index.

Code Implementation

def h_index_brute_force(citation_counts):
    maximum_possible_h_index = len(citation_counts)

    # Iterate downwards from the maximum possible H-index to find the largest valid 'h'
    for candidate_h_index in range(maximum_possible_h_index, -1, -1):
        publications_meeting_criteria = 0

        # Count publications with citations >= the current candidate H-index
        for citation_count in citation_counts:
            if citation_count >= candidate_h_index:
                publications_meeting_criteria += 1

        # Check if the number of qualifying publications meets or exceeds the candidate H-index
        if publications_meeting_criteria >= candidate_h_index:
            return candidate_h_index

    # This case should theoretically not be reached if citation_counts is not empty and contains non-negative values,
    # as h=0 is always a valid H-index (0 publications with >= 0 citations).
    return 0

Big(O) Analysis

Time Complexity
O(n*m)Let n be the number of publications and m be the maximum possible citation count (or the maximum value in the citations array). The algorithm iterates through each possible H-Index value from the maximum citation count down to zero. For each potential H-Index value, it iterates through all n publications to count those meeting the citation threshold. In the worst case, if the maximum citation count m is proportional to n, this approach would resemble O(n²). However, if m is significantly larger or smaller than n, the complexity is O(n*m) where n is the number of publications and m is the range of possible h-index values considered.
Space Complexity
O(1)The provided brute force approach iterates through potential H-Index values and counts publications. This process only requires a few variables to store the current candidate H-index and the count of publications meeting the criteria. No auxiliary data structures that grow with the input size (N, the number of publications) are created. Therefore, the auxiliary space complexity is constant.

Optimal Solution

Approach

The H-Index is about finding a scientist's highest 'h' such that they have at least 'h' papers with 'h' or more citations. The most efficient way to find this is to sort the papers by their citation counts and then systematically check for this condition.

Here's how the algorithm would work step-by-step:

  1. Arrange all the research papers in order from the one with the most citations to the one with the fewest.
  2. Start imagining a potential 'h' value. Let's say you pick a number, like 3.
  3. Now, look at your sorted list of papers. Count how many papers have at least 3 citations.
  4. If you find that you have 3 or more papers with at least 3 citations, then 3 is a possible 'h' value.
  5. Try to aim for a higher 'h' value. If you thought 3 was possible, try checking if 4 is possible. You'd count papers with 4 or more citations.
  6. Continue this process, increasing your target 'h' value and checking if you have enough papers with that many citations.
  7. The moment you try to check for an 'h' value, and you don't have enough papers with that many citations, you know you've gone too high.
  8. The largest 'h' value for which you *did* have enough papers is your final answer.

Code Implementation

def calculate_h_index(citation_counts_for_papers):
    sorted_citation_counts = sorted(citation_counts_for_papers, reverse=True)
    number_of_papers = len(sorted_citation_counts)
    potential_h_value = 0

    # Iterate through possible h values, starting from 0.
    for current_h in range(number_of_papers + 1):
        papers_with_enough_citations = 0
        # Count papers that meet or exceed the current potential h value.
        for citation_count in sorted_citation_counts:
            if citation_count >= current_h:
                papers_with_enough_citations += 1
            else:
                # Since the list is sorted, no further papers will meet the criteria.
                break

        # If we have at least 'current_h' papers with 'current_h' or more citations.
        if papers_with_enough_citations >= current_h:
            potential_h_value = current_h
        else:
            # We've found an h value that is too high, so the previous one was the maximum.
            break

    return potential_h_value

Big(O) Analysis

Time Complexity
O(n log n)The dominant operation in this approach is sorting the research papers by their citation counts, which typically takes O(n log n) time, where n is the number of papers. After sorting, we iterate through the sorted list once to find the h-index. This linear scan takes O(n) time. Since sorting is the more computationally expensive step, the overall time complexity is determined by the sort, resulting in O(n log n).
Space Complexity
O(1)The provided solution primarily uses a constant amount of extra space. It operates by sorting the input array (which is often done in-place for many standard sorting algorithms, or uses logarithmic space for recursive calls if not in-place). Beyond the sorting, it uses a few scalar variables to keep track of the current 'h' value being checked and the count of papers meeting the criteria. The space used does not depend on the number of papers N, hence it's constant auxiliary space.

Edge Cases

Input array is null or empty
How to Handle:
An empty or null input array should result in an h-index of 0, as no researchers exist.
Input array contains only one researcher
How to Handle:
If there's one researcher with c citations, the h-index is 1 if c >= 1, otherwise it's 0.
All researchers have 0 citations
How to Handle:
If all citations are 0, the h-index is 0 because no researcher has at least h citations for any h > 0.
All researchers have the same number of citations
How to Handle:
If all researchers have c citations, the h-index is min(n, c) where n is the number of researchers.
Citations are very large, potentially causing integer overflow
How to Handle:
Ensure that the data type used for citations and counts can accommodate the maximum possible citation value and the total number of researchers.
Citations are unsorted
How to Handle:
Sorting the array or using a frequency map allows for efficient calculation regardless of initial order.
A very large number of researchers with few citations, and a few researchers with many citations (highly skewed distribution)
How to Handle:
The algorithm should efficiently handle such distributions, for example, by sorting or using counting sort if citation values are bounded.
Citations include negative numbers
How to Handle:
Negative citation counts are not physically possible and should be treated as 0 or ignored, as they cannot contribute to an h-index.