Taro Logo

Most Common Word

Easy
Datadog logo
Datadog
7 views
Topics:
ArraysStrings

Given a string paragraph and a string array of the banned words banned, return the most frequent word that is not banned. It is guaranteed there is at least one word that is not banned, and that the answer is unique.

The words in paragraph are case-insensitive and the answer should be returned in lowercase.

Note that words can not contain punctuation symbols.

Example 1:

Input: paragraph = "Bob hit a ball, the hit BALL flew far after it was hit.", banned = ["hit"]
Output: "ball"
Explanation: 
"hit" occurs 3 times, but it is a banned word.
"ball" occurs twice (and no other word does), so it is the most frequent non-banned word in the paragraph. 
Note that words in the paragraph are not case sensitive,
that punctuation is ignored (even if adjacent to words, such as "ball,"), 
and that "hit" isn't the answer even though it occurs more because it is banned.

Example 2:

Input: paragraph = "a.", banned = []
Output: "a"

Constraints:

  • 1 <= paragraph.length <= 1000
  • paragraph consists of English letters, space ' ', or one of the symbols: "!?',;.".
  • 0 <= banned.length <= 100
  • 1 <= banned[i].length <= 10
  • banned[i] consists of only lowercase English letters.

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. Are the words in the `banned` list case-sensitive, and should I consider case when counting word frequency in the `paragraph`?
  2. What characters besides letters and spaces can appear in the `paragraph`, and how should punctuation be handled (e.g., should it be removed or treated as part of a word)?
  3. If there are multiple words with the same highest frequency and none are banned, which one should I return?
  4. Can the `paragraph` or `banned` list be empty, and if so, what should I return?
  5. What is the maximum length of the `paragraph` and the `banned` list?

Brute Force Solution

Approach

The brute force approach to finding the most common word is like meticulously counting each word one by one. We clean up the input text, then compare each word against every other word to see how many times it appears. Finally, we return the word that shows up the most, ignoring any words on a forbidden list.

Here's how the algorithm would work step-by-step:

  1. First, clean up the text by removing punctuation and making all letters lowercase. This ensures that words like 'The' and 'the' are treated as the same.
  2. Next, create a list of all the individual words in the cleaned-up text.
  3. For each word in the list, count how many times it appears in the entire list.
  4. However, before counting, check if the word is on the list of forbidden words. If it is, ignore it and move on to the next word.
  5. Keep track of the word with the highest count that is not forbidden.
  6. After checking every word, the word with the highest count is the most common word.

Code Implementation

def most_common_word_brute_force(paragraph, banned_words):
    normalized_paragraph = ''.join([character.lower() if character.isalnum() else ' ' for character in paragraph])

    words = normalized_paragraph.split()

    most_common_word = ''
    highest_frequency = 0

    for word in words:
        # We must ignore any banned words as they are invalid for the final result
        if word in banned_words:
            continue

        frequency = 0
        for other_word in words:
            if word == other_word:
                frequency += 1

        # Only update the most common word if this word has a higher frequency
        if frequency > highest_frequency:
            most_common_word = word
            highest_frequency = frequency

    return most_common_word

Big(O) Analysis

Time Complexity
O(n²)The dominant operation is comparing each word against every other word in the list to count its occurrences. Assuming 'n' is the number of words in the input text after cleaning, for each of the 'n' words, we iterate through the list (potentially up to 'n' times) to count its occurrences. Therefore, the time complexity is proportional to n * n, resulting in O(n²).
Space Complexity
O(N)The algorithm creates a list of words from the input text, where N is the length of the input text, contributing O(N) space. It also implicitly keeps track of the most common word and its count, which is constant space. Therefore, the auxiliary space complexity is dominated by the list of words, resulting in O(N).

Optimal Solution

Approach

To find the most common word, we first need to clean up the input text and then count how often each word appears. We can efficiently keep track of these counts and ignore any words that are on a given ban list.

Here's how the algorithm would work step-by-step:

  1. First, prepare the text by converting everything to lowercase and removing any punctuation so we only have clean words.
  2. Next, create a system to count how many times each word appears. We'll increase a word's count each time we see it.
  3. As we are counting, check if a word is on the ban list. If it is, we'll completely ignore it and not count it at all.
  4. Finally, once we've processed all the words, find the word with the highest count. That's the most common word that isn't banned.

Code Implementation

import re

def most_common_word(paragraph, banned_words):
    normalized_paragraph = re.sub(r'[!?,.;]', ' ', paragraph).lower()
    words = normalized_paragraph.split()

    word_counts = {}
    for word in words:
        # We need to ignore banned words from the counts.
        if word not in banned_words:
            if word in word_counts:
                word_counts[word] += 1
            else:
                word_counts[word] = 1

    most_common_word_value = ''
    max_count = 0

    # Find the word with the highest count.
    for word, count in word_counts.items():
        if count > max_count:
            # We must track both the word and its count to determine the most common.
            most_common_word_value = word
            max_count = count

    return most_common_word_value

Big(O) Analysis

Time Complexity
O(n)The dominant operation is iterating through the input 'paragraph' of length n, where n is the number of characters, to clean the text (lowercase and remove punctuation). Counting the words involves another iteration, which is also proportional to n, where n is the number of words after cleaning. Checking against the 'banned' list is assumed to be O(1) using a hash set. Finding the word with the maximum count involves iterating through the word counts, which depends on the number of unique words and is, at most, proportional to n. Therefore, the overall time complexity is O(n).
Space Complexity
O(N + M)The algorithm uses a hash map to count the frequency of each word in the input text, where N is the length of the processed text (number of words after cleaning). Additionally, it uses a set to store the banned words, where M is the number of banned words. Therefore, the auxiliary space required is proportional to the number of unique words in the text and the number of banned words. The space complexity is O(N + M) as it stores words in both the hash map and the set.

Edge Cases

CaseHow to Handle
Null or empty paragraph stringReturn an empty string immediately, indicating no most common word exists.
Null or empty banned word listTreat as if no words are banned and proceed with the normal word counting process.
Paragraph containing only banned wordsReturn an empty string since no non-banned word exists.
Paragraph containing only punctuation or special charactersAfter cleaning, this will result in an empty string, so return an empty string.
Extremely long paragraph string (memory concerns)Ensure the word counting hashmap can scale efficiently to handle a large number of unique words, potentially using a more memory-efficient data structure if necessary.
Paragraph with mixed-case words that are otherwise identicalConvert all words to lowercase to ensure accurate counting of words regardless of case.
Banned words containing punctuation or special characters.Remove the punctuation from the banned words as well, to ensure accurate comparison after the paragraph cleaning.
Integer overflow in word countUse a data type large enough to hold the word counts, such as a long or double, or be aware that the results may be invalid if counts exceed maximum integer size.