Taro Logo

Drop Duplicate Rows

Easy
Google logo
Google
5 views
Topics:
Database Problems

Given a DataFrame customers with columns customer_id (int), name (object), and email (object), remove duplicate rows based on the email column, keeping only the first occurrence of each unique email address.

Example:

Input DataFrame:

+-------------+---------+--------------------+
| customer_id | name    | email              |
+-------------+---------+--------------------+
| 1           | Ella    | emily@example.com  |
| 2           | David   | michael@example.com|
| 3           | Zachary | sarah@example.com  |
| 4           | Alice   | john@example.com   |
| 5           | Finn    | john@example.com   |
| 6           | Violet  | alice@example.com  |
+-------------+---------+--------------------+

Expected Output DataFrame:

+-------------+---------+--------------------+
| customer_id | name    | email              |
+-------------+---------+--------------------+
| 1           | Ella    | emily@example.com  |
| 2           | David   | michael@example.com|
| 3           | Zachary | sarah@example.com  |
| 4           | Alice   | john@example.com   |
| 6           | Violet  | alice@example.com  |
+-------------+---------+--------------------+

Write a function that takes a pandas DataFrame as input and returns a new DataFrame with the duplicate emails removed. Explain the time and space complexity of your solution. Also, discuss any edge cases and how your solution handles them.

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. What defines a duplicate row? Are we comparing all columns for equality, or is there a subset of columns that determines uniqueness?
  2. What are the data types of the values in each column of the rows? Are null or missing values possible?
  3. If multiple duplicate rows exist, should I return only one representative of the set of duplicates or all unique rows, preserving the original order?
  4. What is the expected output format? Should I return a new array/list with duplicates removed, or modify the existing data structure in place?
  5. What is the maximum number of rows and columns that the input data structure can contain?

Brute Force Solution

Approach

To remove duplicate rows the brute force way, we check every single row against every other row. We'll keep track of the rows that are unique and throw away the ones we see more than once.

Here's how the algorithm would work step-by-step:

  1. Start with the very first row in the dataset.
  2. Compare this first row to every other row in the dataset.
  3. If you find an exact match somewhere else, mark the duplicate row for removal.
  4. If you don't find any matches, keep the first row since it's unique.
  5. Move on to the next row in the dataset and repeat the comparison process against all other rows, including the ones already checked.
  6. Continue this process until you have compared every row to every other row.
  7. Finally, remove all the rows that were marked as duplicates, leaving only the unique rows.

Code Implementation

def drop_duplicate_rows_brute_force(data):
    rows_to_remove = []

    for current_row_index in range(len(data)):
        
        for other_row_index in range(len(data)):
            # Skip comparing the row to itself
            if current_row_index == other_row_index:
                continue

            # Check if the rows are identical
            if data[current_row_index] == data[other_row_index]:
                # Mark the duplicate row for removal
                # Ensures a row is only added once
                if other_row_index not in rows_to_remove:

                    rows_to_remove.append(other_row_index)

    # Build a new dataset without the duplicates
    unique_data = []
    for row_index in range(len(data)):
        # Check if the row is marked for removal
        if row_index not in rows_to_remove:
            # Add the row to the new dataset

            unique_data.append(data[row_index])

    return unique_data

Big(O) Analysis

Time Complexity
O(n²)The algorithm iterates through each of the n rows in the dataset. For each row, it compares it to every other row in the dataset, resulting in another n comparisons. This nested loop structure means that, in the worst case, every row is compared to every other row. Therefore, the number of operations is proportional to n * n, which simplifies to O(n²).
Space Complexity
O(N)The algorithm needs to keep track of rows marked for removal. The plain English explanation refers to marking duplicate rows. This marking could be achieved by using an auxiliary data structure such as a boolean array or a set to store indices or actual row values representing duplicate rows. The size of this auxiliary data structure scales linearly with the number of rows N in the input dataset. Therefore, the space complexity is O(N).

Optimal Solution

Approach

To efficiently remove duplicate rows, we can utilize a data structure that automatically prevents duplicates. This allows us to process each row once and quickly identify and eliminate any identical entries.

Here's how the algorithm would work step-by-step:

  1. Go through each row of the table one at a time.
  2. For each row, combine all the values in that row into a single piece of text (like a phrase).
  3. Keep track of these phrases in a special list (a set) that doesn't allow the same phrase to be added twice.
  4. If a phrase is already in the list, you know you've seen that row before, so skip it.
  5. If a phrase is new, add it to the list and also keep the original row as part of the final result.
  6. In the end, the final result contains only the unique rows, since the list prevented any duplicates from being added.

Code Implementation

def drop_duplicate_rows(data):
    seen_rows = set()
    unique_rows = []

    for row in data:
        # Convert the row to a tuple for hashability
        row_tuple = tuple(row)
        row_string = ",".join(map(str, row))

        # Skip processing if row already exists.
        if row_string in seen_rows:
            continue

        # Add the row string to the seen_rows set
        seen_rows.add(row_string)

        # Append the row to the unique_rows list
        unique_rows.append(row)

    return unique_rows

Big(O) Analysis

Time Complexity
O(n*m)We iterate through each of the n rows in the table. For each row, we concatenate m values into a string. Then, we perform a lookup in a set. String concatenation takes O(m) time where m is the number of columns since each value in the row is appended to the string. Set insertion (or checking for existence) takes O(1) on average. Therefore, the time complexity for each row is O(m). Since we iterate through n rows, the total time complexity is O(n*m).
Space Complexity
O(N)The algorithm utilizes a set to store the string representation of each row encountered. In the worst-case scenario, all rows in the input table are unique. Therefore, the set could potentially store string representations of all N rows, where N is the total number of rows in the table. This means the auxiliary space used grows linearly with the number of rows in the input table. Thus the space complexity is O(N).

Edge Cases

CaseHow to Handle
Input dataframe is emptyReturn an empty dataframe or the original empty dataframe without modification based on requirement.
Input dataframe contains only one rowReturn the original dataframe without modification as there are no duplicates to remove.
All rows in the dataframe are identicalRemove all but the first row, resulting in a dataframe with a single unique row.
The dataframe contains null or missing values in some columnsHandle null values either by explicitly defining how to compare rows with nulls (treating them as equal or unequal) or excluding columns with nulls from duplicate checks.
The dataframe is very large and may exceed memory limitsConsider using chunking or streaming techniques to process the dataframe in smaller batches.
The 'subset' parameter is null or an empty listTreat this as comparing all columns for duplicates, effectively dropping rows that are identical across the board.
The 'subset' parameter contains column names that don't exist in the dataframeRaise an error or exception indicating that the specified columns are not found.
The 'keep' parameter is not specified, or has an invalid valueDefault to a sensible behavior, such as 'first' or 'last' for the 'keep' parameter if it's not explicitly provided.