Taro Logo

Report Spam Message

Medium
Asked by:
Profile picture
Profile picture
26 views
Topics:
ArraysStrings

You are given an array of strings message and an array of strings bannedWords.

An array of words is considered spam if there are at least two words in it that exactly match any word in bannedWords.

Return true if the array message is spam, and false otherwise.

Example 1:

Input: message = ["hello","world","leetcode"], bannedWords = ["world","hello"]

Output: true

Explanation:

The words "hello" and "world" from the message array both appear in the bannedWords array.

Example 2:

Input: message = ["hello","programming","fun"], bannedWords = ["world","programming","leetcode"]

Output: false

Explanation:

Only one word from the message array ("programming") appears in the bannedWords array.

Constraints:

  • 1 <= message.length, bannedWords.length <= 105
  • 1 <= message[i].length, bannedWords[i].length <= 15
  • message[i] and bannedWords[i] consist only of lowercase English letters.

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. Can a message contain multiple keywords, and should I report the index only once in that case?
  2. Are the keywords case-sensitive? Should I perform a case-insensitive comparison?
  3. What is the expected format of the output? Should the indices be sorted?
  4. Can the `messages` or `keywords` list be empty or null? If so, what should the function return?
  5. What are the maximum lengths of individual messages and keywords strings?

Brute Force Solution

Approach

The brute force approach to identifying spam messages involves checking every possible combination of words and comparing them against a list of known spam phrases. It’s like trying every key on a keyring until you find the one that opens the lock. This method guarantees finding spam but might take a very long time.

Here's how the algorithm would work step-by-step:

  1. Take the entire message and see if it exactly matches any known spam message.
  2. If not, look at every possible chunk of words in the message, starting with just two words and expanding to larger chunks.
  3. For each chunk of words, check if it exactly matches any known spam phrase.
  4. If you find a match, report the message as spam.
  5. Repeat this process for all possible chunks of words until you have exhausted all combinations.
  6. If, after checking every combination, you haven't found any matches, conclude that the message is not spam, based on the knowledge you have available.

Code Implementation

def report_spam_message(messages, spam_patterns):
   spam_report = []

   for message in messages:
       is_spam = False

       # Iterate through each spam pattern to check against the message
       for spam_pattern in spam_patterns:

           # If the spam pattern exists in the message, flag it as spam
           if spam_pattern in message:
               is_spam = True
               break

       spam_report.append(is_spam)

   return spam_report

Big(O) Analysis

Time Complexity
O(n^3 * m)Let n be the number of words in the input message and m be the number of spam phrases. The algorithm iterates through all possible sub-sequences (chunks of words) within the message. There are approximately n^2 such subsequences (the outer loops creating the starting and ending indices of a sub-sequence). For each subsequence, the algorithm compares it against all m known spam phrases. Thus, the overall time complexity is O(n^2 * m), plus the cost to compare each sub-sequence which is O(n). This comparison needs to check if the sub-sequence (of length n at worst) exactly matches a spam phrase. So we can consider this as O(n^3 * m).
Space Complexity
O(1)The algorithm's space complexity is primarily determined by the number of known spam phrases which does not depend on the input message size (N). The dominant space usage comes from storing a fixed list of known spam phrases which is independent of the input message size, N. Temporary string variables to store substrings (chunks of words) are also created, but their size is bounded by the length of the longest spam phrase, not N, and remains constant. Therefore, the auxiliary space used remains constant regardless of the input message size, N.

Optimal Solution

Approach

To report spam messages efficiently, we'll use a technique similar to how email providers filter unwanted content. The main idea is to break down the message and check it against known spam indicators.

Here's how the algorithm would work step-by-step:

  1. First, divide the message into individual parts, such as words or phrases.
  2. Then, compare each part against a list of common spam words, phrases, and patterns. This list is constantly updated with new spam trends.
  3. For each match found, increase a spam score for the message.
  4. Also, analyze the sender's information and compare it to a list of known spam senders or suspicious patterns.
  5. Adjust the spam score based on how suspicious the sender appears.
  6. If the total spam score exceeds a certain threshold, mark the message as spam and report it.
  7. Periodically review the reported messages to improve the spam detection process and adjust the spam indicator lists for better accuracy in the future.

Code Implementation

def report_spam_message(messages):
    seen_messages = set()
    spam_reports = []

    for message in messages:
        # Use a set to efficiently check if we've seen this message before.
        if message in seen_messages:

            spam_reports.append(message)
        else:
            # Add the new message to the set of seen messages.
            seen_messages.add(message)

    return spam_reports

Big(O) Analysis

Time Complexity
O(m + s + k)Splitting the message into parts takes O(m) time, where m is the length of the message. Comparing each part against a list of common spam words/phrases takes O(s) time, where s is the total length of all spam words/phrases. Analyzing the sender's information and comparing it to a list of known spam senders takes O(k) time, where k is the number of known spam senders. The spam score calculation and threshold check are constant time operations. Therefore, the overall time complexity is O(m + s + k).
Space Complexity
O(S + U)The space complexity is determined by the size of the list of common spam words, phrases, and patterns (S), and the list of known spam senders or suspicious patterns (U). We are essentially storing these two lists in memory to perform comparisons against the message parts and sender information. The space required grows linearly with the combined size of these lists, so the overall auxiliary space complexity is O(S + U), where S is the size of the spam word/phrase list and U is the size of the spam sender list. Other variables such as spam score take constant space.

Edge Cases

messages is null or empty
How to Handle:
Return an empty list if messages is null or empty to avoid NullPointerException or incorrect processing.
keywords is null or empty
How to Handle:
If keywords is null or empty, return an empty list because no spam can be detected.
messages contains empty strings
How to Handle:
Handle empty strings in messages by treating them as non-spam, i.e., skipping them during keyword search.
keywords contains empty strings
How to Handle:
An empty string in keywords will match every message, so return indices of all messages.
Case sensitivity of keywords in messages
How to Handle:
Convert both messages and keywords to lowercase for case-insensitive matching.
Large number of messages and keywords
How to Handle:
Ensure efficient string searching (e.g., using a set for keywords lookup) to maintain acceptable performance with large datasets.
Special characters in messages or keywords
How to Handle:
Sanitize or escape special characters in messages and keywords to prevent regex-related issues or incorrect matches.
Message contains multiple instances of the same keyword
How to Handle:
The problem requires only one instance of a keyword to flag as spam, so no special handling is needed, presence is the key.