Taro Logo

Top K Frequent Words

Medium
Meta logo
Meta
1 view
Topics:
ArraysStringsGreedy Algorithms

Given an array of strings words and an integer k, return the k most frequent strings. Return the answer sorted by the frequency from highest to lowest. Sort the words with the same frequency by their lexicographical order.

For example:

words = ["i","love","leetcode","i","love","coding"], k = 2. The expected output is ["i","love"]. "i" and "love" are the two most frequent words. Note that "i" comes before "love" due to a lower alphabetical order.

words = ["the","day","is","sunny","the","the","the","sunny","is","is"], k = 4. The expected output is ["the","is","sunny","day"]. "the", "is", "sunny" and "day" are the four most frequent words, with the number of occurrence being 4, 3, 2 and 1 respectively.

Consider edge cases such as:

  • What should be returned when words is empty?
  • What if k is zero, negative, or larger than the number of unique words in words?

How would you optimize the solution to achieve O(n log k) time complexity?

Solution


Naive Approach

A straightforward approach involves counting the frequency of each word and then sorting the words based on frequency and lexicographical order. This can be achieved using a hash map to store word frequencies and then sorting the map's entries.

  1. Count Frequencies: Use a hash map (dictionary) to store each word and its frequency.
  2. Sort: Convert the map into a list of (word, frequency) pairs. Sort this list based on the frequency in descending order. If frequencies are the same, sort lexicographically in ascending order.
  3. Extract Top K: Take the first k words from the sorted list.

Code (Python)

from collections import Counter

def topKFrequent(words, k):
    word_counts = Counter(words)
    sorted_words = sorted(word_counts.items(), key=lambda x: (-x[1], x[0]))
    return [word for word, count in sorted_words[:k]]

Time Complexity

  • O(n) to count word frequencies.
  • O(n log n) to sort the words.
  • Therefore, the overall time complexity is O(n log n).

Space Complexity

  • O(n) to store the word counts in the hash map.
  • O(n) to store the sorted list of words.
  • Overall, the space complexity is O(n).

Optimal Approach: Using a Heap

An optimized approach involves using a min-heap to keep track of the k most frequent words. This approach can achieve O(n log k) time complexity.

  1. Count Frequencies: Use a hash map (dictionary) to store each word and its frequency.
  2. Heapify: Create a min-heap of size k. The heap will store (frequency, word) tuples. Since we want to prioritize higher frequency words, and break ties lexicographically, we'll store the negative frequency. Python's heapq is a min-heap, so smaller values are prioritized.
  3. Iterate and Maintain Heap: Iterate through the word counts. If the current word's frequency is greater than the smallest frequency in the heap, or if the frequency is equal but the word is lexicographically smaller, replace the smallest element in the heap with the current word.
  4. Extract Top K: Pop elements from the heap to get the top k frequent words, reversing the order to obtain the desired sorted output.

Code (Python)

import heapq
from collections import Counter

def topKFrequent(words, k):
    word_counts = Counter(words)

    heap = []
    for word, count in word_counts.items():
        heapq.heappush(heap, (-count, word))

    result = []
    for _ in range(k):
        count, word = heapq.heappop(heap)
        result.append(word)

    return result

Time Complexity

  • O(n) to count word frequencies.
  • O(n log k) to maintain the heap of size k.
  • Therefore, the overall time complexity is O(n log k).

Space Complexity

  • O(n) to store the word counts in the hash map.
  • O(k) to store the heap.
  • Overall, the space complexity is O(n).

Edge Cases

  • Empty Input: If the input array words is empty, return an empty list.
  • k > Number of Unique Words: If k is greater than the number of unique words, return all unique words sorted by frequency and lexicographical order.
  • All Words Same Frequency: When all words have the same frequency, ensure the lexicographical order is correctly maintained.