Taro Logo

H-Index

#201 Most AskedMedium
Topics:
ArraysBinary Search

Given an array of integers citations where citations[i] is the number of citations a researcher received for their ith paper, return the researcher's h-index.

According to the definition of h-index on Wikipedia: The h-index is defined as the maximum value of h such that the given researcher has published at least h papers that have each been cited at least h times.

Example 1:

Input: citations = [3,0,6,1,5]
Output: 3
Explanation: [3,0,6,1,5] means the researcher has 5 papers in total and each of them had received 3, 0, 6, 1, 5 citations respectively.
Since the researcher has 3 papers with at least 3 citations each and the remaining two with no more than 3 citations each, their h-index is 3.

Example 2:

Input: citations = [1,3,1]
Output: 1

Constraints:

  • n == citations.length
  • 1 <= n <= 5000
  • 0 <= citations[i] <= 1000

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. What is the range of values for the citations? Can citations be zero or negative?
  2. What should I return if the input array is empty?
  3. Are there any upper bounds on the number of papers (length of the citations array)?
  4. If there are multiple possible values for the h-index, should I return the largest possible h-index?
  5. Can I modify the input array in place, or should I assume I need to preserve the original array?

Brute Force Solution

Approach

To find the h-index, we're going to try every possible value for the h-index. For each potential h-index, we'll check if the list of publications meets the required criteria to actually have that h-index.

Here's how the algorithm would work step-by-step:

  1. Start by guessing that the h-index is 0.
  2. Check if there are at least 0 publications with at least 0 citations each. This will always be true.
  3. Now, guess that the h-index is 1.
  4. Check if there are at least 1 publications with at least 1 citations each.
  5. Continue guessing higher values (2, 3, and so on) for the h-index.
  6. For each guess, check if there are at least that many publications with at least that many citations.
  7. Stop when you find a guess that doesn't work, meaning there are not enough publications with enough citations.
  8. The h-index is the highest guess that *did* work.

Code Implementation

def h_index_brute_force(citations):
    # Iterate through possible h-index values
    for possible_h_index in range(len(citations) + 1):

        # Count publications with at least possible_h_index citations
        publications_with_enough_citations = 0
        for citation_count in citations:
            if citation_count >= possible_h_index:
                publications_with_enough_citations += 1

        # If we don't have enough publications for this h-index, return the previous one
        if publications_with_enough_citations < possible_h_index:

            # Previous h-index is the result
            return possible_h_index - 1

    # If the loop completes, the h-index is the number of publications
    return len(citations)

Big(O) Analysis

Time Complexity
O(n²)The algorithm iterates through potential h-index values, starting from 0 and going up to n (the number of publications). For each potential h-index value (h), it iterates through the citations array to count how many publications have at least h citations. This inner loop iterates through all n citations. Therefore, in the worst case, the outer loop iterates n times, and the inner loop iterates n times for each iteration of the outer loop. This gives a total time complexity of O(n * n), which simplifies to O(n²).
Space Complexity
O(1)The algorithm iteratively checks potential h-index values without creating any additional data structures that scale with the input size N (the number of publications). It only uses a few constant-sized variables to keep track of the current guess and to iterate through the publications. Therefore, the auxiliary space used is constant, regardless of the number of citations.

Optimal Solution

Approach

The H-Index represents the highest number 'h' such that the researcher has 'h' papers that have each been cited at least 'h' times. The most efficient approach leverages sorting to quickly identify this 'h' without exhaustively checking every possibility. By sorting, we can efficiently determine the H-Index by working our way down the sorted list of citations.

Here's how the algorithm would work step-by-step:

  1. First, arrange the citation counts of the papers in descending order, from the most cited to the least cited.
  2. Then, start at the beginning of the sorted list (the most cited paper).
  3. Count how many papers, starting from the beginning of the sorted list, have at least that many citations.
  4. For example, if the third paper in the sorted list has 3 or more citations, then the H-Index is potentially 3.
  5. Continue this process, checking each paper to see if it has a citation count equal or greater than its position in the list (starting from one).
  6. The H-Index is the largest position number that still satisfies the condition (citation count is equal to or greater than the position).

Code Implementation

def calculate_h_index(citation_counts):
    citation_counts.sort(reverse=True)

    h_index = 0
    for index, citation_count in enumerate(citation_counts):
        # The index is incremented by 1 to represent position.
        position = index + 1

        # Check if the citation count meets the h-index criterion.
        if citation_count >= position:
            h_index = position
        else:
            # Once the condition is no longer met, the h-index is found.
            break

    return h_index

Big(O) Analysis

Time Complexity
O(n log n)The dominant operation is sorting the citation counts. Sorting algorithms like merge sort or heap sort, which are commonly used in efficient implementations, have a time complexity of O(n log n), where n is the number of citations. The subsequent iteration to determine the h-index involves traversing the sorted array once, taking O(n) time. Since O(n log n) grows faster than O(n), the overall time complexity is determined by the sorting step.
Space Complexity
O(1)The provided approach primarily uses sorting. Although the plain English explanation doesn't explicitly state which sorting algorithm is used, we can assume an in-place sorting algorithm for optimal space complexity. Therefore, beyond the input array itself, no significant auxiliary space is required. The algorithm uses a constant amount of extra space for variables like the index in the sorted array and for comparing values. Consequently, the space complexity is O(1), indicating constant extra space regardless of the input size N (the number of papers).

Edge Cases

Empty or null input array
How to Handle:
Return 0 immediately as no publications exist.
Array with a single element where the citation is 0
How to Handle:
Return 0 because the single publication has insufficient citations.
Array with a single element where the citation is a large positive integer
How to Handle:
Return 1 as the researcher has 1 paper with at least 1 citation.
Array with all citations equal to zero
How to Handle:
Return 0 as no paper has any citations, making the h-index 0.
Array with all citations greater than or equal to the number of papers
How to Handle:
Return the length of the array, as all papers have citations at least equal to the number of papers.
Array with citations in descending order (e.g., [5, 4, 3, 2, 1])
How to Handle:
The h-index will be determined by the index where citations[i] < i + 1.
Array with citations in ascending order (e.g., [0, 1, 2, 3, 4])
How to Handle:
The h-index should be calculated based on the number of citations greater than or equal to that number.
Large input array with very large citation values that might cause integer overflow during counting sort.
How to Handle:
Cap the count array size by the number of citations or use a different sorting algorithm like comparison sort for efficiency.
0/270 completed