Given a DataFrame customers
with columns customer_id
(int), name
(object), and email
(object), remove duplicate rows based on the email
column, keeping only the first occurrence of each unique email address.
Example:
Input DataFrame:
+-------------+---------+--------------------+
| customer_id | name | email |
+-------------+---------+--------------------+
| 1 | Ella | emily@example.com |
| 2 | David | michael@example.com|
| 3 | Zachary | sarah@example.com |
| 4 | Alice | john@example.com |
| 5 | Finn | john@example.com |
| 6 | Violet | alice@example.com |
+-------------+---------+--------------------+
Expected Output DataFrame:
+-------------+---------+--------------------+
| customer_id | name | email |
+-------------+---------+--------------------+
| 1 | Ella | emily@example.com |
| 2 | David | michael@example.com|
| 3 | Zachary | sarah@example.com |
| 4 | Alice | john@example.com |
| 6 | Violet | alice@example.com |
+-------------+---------+--------------------+
Write a function that takes a pandas DataFrame as input and returns a new DataFrame with the duplicate emails removed. Explain the time and space complexity of your solution. Also, discuss any edge cases and how your solution handles them.
When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:
To remove duplicate rows the brute force way, we check every single row against every other row. We'll keep track of the rows that are unique and throw away the ones we see more than once.
Here's how the algorithm would work step-by-step:
def drop_duplicate_rows_brute_force(data):
rows_to_remove = []
for current_row_index in range(len(data)):
for other_row_index in range(len(data)):
# Skip comparing the row to itself
if current_row_index == other_row_index:
continue
# Check if the rows are identical
if data[current_row_index] == data[other_row_index]:
# Mark the duplicate row for removal
# Ensures a row is only added once
if other_row_index not in rows_to_remove:
rows_to_remove.append(other_row_index)
# Build a new dataset without the duplicates
unique_data = []
for row_index in range(len(data)):
# Check if the row is marked for removal
if row_index not in rows_to_remove:
# Add the row to the new dataset
unique_data.append(data[row_index])
return unique_data
To efficiently remove duplicate rows, we can utilize a data structure that automatically prevents duplicates. This allows us to process each row once and quickly identify and eliminate any identical entries.
Here's how the algorithm would work step-by-step:
def drop_duplicate_rows(data):
seen_rows = set()
unique_rows = []
for row in data:
# Convert the row to a tuple for hashability
row_tuple = tuple(row)
row_string = ",".join(map(str, row))
# Skip processing if row already exists.
if row_string in seen_rows:
continue
# Add the row string to the seen_rows set
seen_rows.add(row_string)
# Append the row to the unique_rows list
unique_rows.append(row)
return unique_rows
Case | How to Handle |
---|---|
Input dataframe is empty | Return an empty dataframe or the original empty dataframe without modification based on requirement. |
Input dataframe contains only one row | Return the original dataframe without modification as there are no duplicates to remove. |
All rows in the dataframe are identical | Remove all but the first row, resulting in a dataframe with a single unique row. |
The dataframe contains null or missing values in some columns | Handle null values either by explicitly defining how to compare rows with nulls (treating them as equal or unequal) or excluding columns with nulls from duplicate checks. |
The dataframe is very large and may exceed memory limits | Consider using chunking or streaming techniques to process the dataframe in smaller batches. |
The 'subset' parameter is null or an empty list | Treat this as comparing all columns for duplicates, effectively dropping rows that are identical across the board. |
The 'subset' parameter contains column names that don't exist in the dataframe | Raise an error or exception indicating that the specified columns are not found. |
The 'keep' parameter is not specified, or has an invalid value | Default to a sensible behavior, such as 'first' or 'last' for the 'keep' parameter if it's not explicitly provided. |