Taro Logo

Top K Frequent Words

Medium
Amazon logo
Amazon
4 views
Topics:
ArraysStrings

Given an array of strings words and an integer k, return the k most frequent strings.

Return the answer sorted by the frequency from highest to lowest. Sort the words with the same frequency by their lexicographical order.

For example:

  1. words = ["i","love","leetcode","i","love","coding"], k = 2

    • Output: ["i","love"]
    • Explanation: "i" and "love" are the two most frequent words. Note that "i" comes before "love" due to a lower alphabetical order.
  2. words = ["the","day","is","sunny","the","the","the","sunny","is","is"], k = 4

    • Output: ["the","is","sunny","day"]
    • Explanation: "the", "is", "sunny" and "day" are the four most frequent words, with the number of occurrence being 4, 3, 2 and 1 respectively.

Could you solve it in O(n log(k)) time and O(n) extra space?

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. What is the maximum size of the `words` array and the range of the length of individual words? This helps determine if memory or time complexity is a major concern.
  2. Can the input array `words` be null or empty? If so, what should I return?
  3. If `k` is greater than the number of unique words in the input, what should I return? Should I return all the unique words?
  4. Can I assume that all words in the array consist of lowercase letters only, or do I need to handle uppercase letters, numbers, or special characters?
  5. Could you provide a few example inputs and their corresponding expected outputs to ensure my understanding of the problem, especially regarding the lexicographical ordering tie-breaker?

Brute Force Solution

Approach

The brute force method to find the most frequent words involves checking each word, counting its occurrences, and comparing those counts. We essentially look at all the words one by one to find the top few most frequent.

Here's how the algorithm would work step-by-step:

  1. First, go through the list of words and for each unique word, count how many times it appears in the list.
  2. Next, once we have the counts for all the words, create a list of the words and their counts.
  3. Then, sort this list from the highest counts to the lowest.
  4. Finally, select the first 'K' words from this sorted list. These are the 'K' most frequent words.

Code Implementation

def top_k_frequent_words_brute_force(words, k_most_frequent):
    word_counts = {}
    for word in words:
        if word in word_counts:
            word_counts[word] += 1
        else:
            word_counts[word] = 1

    # Convert the dictionary to a list of tuples
    word_frequency_list = list(word_counts.items())

    # Sort by frequency.
    word_frequency_list.sort(key=lambda item: item[1], reverse=True)

    # Extract the top k words.
    top_k_words = [word for word, count in word_frequency_list[:k_most_frequent]]

    return top_k_words

Big(O) Analysis

Time Complexity
O(n log n)The initial step involves counting word frequencies, which takes O(n) time, where n is the number of words. Next, we create a list of word-count pairs. The sorting step, which sorts these pairs by frequency, dominates the complexity. Efficient sorting algorithms like merge sort or heap sort have a time complexity of O(m log m), where m is the number of unique words. In the worst-case scenario, all words are unique, so m can be equal to n. Therefore, the overall time complexity is O(n log n) due to the sorting step. Selecting the top K words then only takes O(K) time, which is less than O(n log n).
Space Complexity
O(N)The solution first counts the frequency of each unique word. This requires storing the words and their corresponding counts, which can take up space proportional to the number of unique words. In the worst case, all words are unique, leading to a hash map of size N, where N is the total number of words in the input. Additionally, a list of word-count pairs is created, which also holds N items in the worst case. Sorting this list does not add significant space, as it can be done in place or with a small auxiliary stack space if using certain algorithms (which is often disregarded in complexity analysis focusing on significant memory usage). Therefore, the auxiliary space is dominated by the word count hashmap and the list of word count pairs, both of which grow linearly with N, resulting in O(N) space complexity.

Optimal Solution

Approach

To find the most frequent words, we'll first count how often each word appears. Then, we'll use a special data structure to quickly retrieve the top K words based on their frequency, efficiently handling ties in frequency by comparing the words alphabetically.

Here's how the algorithm would work step-by-step:

  1. First, count how many times each word shows up in the list of words. Think of this as making a tally sheet for each word.
  2. Next, organize the words based on how often they appear, so the most frequent words are easily accessible. We'll use a special tool that keeps the words in order automatically as we add them.
  3. When organizing, if two words appear the same number of times, put the one that comes earlier alphabetically first. This ensures consistent results.
  4. Finally, pick out the top K words from our organized list. These are the words that appeared most frequently.

Code Implementation

import heapq

def top_k_frequent(words, k):
    word_counts = {}
    for word in words:
        word_counts[word] = word_counts.get(word, 0) + 1

    # Use a min-heap to keep track of the top k frequent words.
    min_heap = []

    for word, count in word_counts.items():
        heapq.heappush(min_heap, (count, word))
        # Maintain heap size of k
        if len(min_heap) > k:
            heapq.heappop(min_heap)

    # Sort to handle ties, ensuring alphabetical order for equal frequencies
    top_k_words = sorted([word for count, word in min_heap], key=lambda x: (word_counts[x], x), reverse=True)

    # Sort the words based on frequency and then alphabetically
    top_k_words.sort(key=lambda word: (-word_counts[word], word))
    
    return top_k_words[:k]

Big(O) Analysis

Time Complexity
O(n log k)Counting word frequencies iterates through the input list of n words, taking O(n) time. A min-heap of size k is used to maintain the top k frequent words. Adding each word to the heap takes O(log k) time, and this happens at most n times, resulting in O(n log k) time complexity. Extracting the top k words from the heap takes O(k log k) time, but since we only need to find the top k, n log k dominates over k log k. Thus, the overall time complexity is O(n + n log k), which simplifies to O(n log k) when n is large relative to the final k extraction, and further simplifies to O(n) if k is a constant.
Space Complexity
O(N)The algorithm uses a hash map (or dictionary) to store the frequency of each word, where N is the total number of words in the input list. In the worst-case scenario, all words are distinct, requiring the hash map to store N key-value pairs. Furthermore, a priority queue (heap) is used to store the top K frequent words, which can grow up to size N if K is greater than or equal to the number of unique words. Therefore, the auxiliary space is primarily determined by the hash map, resulting in O(N) space complexity.

Edge Cases

CaseHow to Handle
words is null or emptyReturn an empty list immediately since there are no words to count.
k is zero or negativeReturn an empty list as we cannot return a negative or zero number of elements.
k is larger than the number of unique words in wordsReturn all the unique words sorted by frequency and then lexicographically.
words contains only one unique word repeated multiple timesThe map will contain only one entry; sort will correctly handle this giving the single word.
All words in words have the same frequencyThe sort will rely entirely on lexicographical order in this case.
words contains very long strings which could impact memoryThe program's memory usage will increase, but the algorithm remains functionally correct assuming memory is available.
Large input array size for words causing potential memory issuesEnsure efficient data structures are used (HashMap, PriorityQueue) and consider memory limitations.
Integer overflow in frequency countsUse a larger data type (e.g., long) for storing frequencies if needed to prevent integer overflow.