Remove Duplicate Characters

Medium
10 years ago

Let's work on a string manipulation problem. I'd like you to write a function that takes a string as input and returns a new string containing only the unique characters from the original string, preserving the original order of appearance. In other words, remove all duplicate characters from the string.

Here are some examples to clarify the requirements:

  • Input: "abcabcbb", Output: "abc"
  • Input: "banana", Output: "ban"
  • Input: "hello", Output: "helo"
  • Input: "", Output: ""
  • Input: "aabbcc", Output: "abc"
  • Input: "abcdefg", Output: "abcdefg"
Sample Answer

Remove Duplicate Characters

Let's start by addressing the problem of removing duplicate characters from a string. I'll walk through a naive approach first, then optimize it for better performance. I've faced this kind of problem before, especially when working with data cleaning pipelines at Amazon, where I focused on ensuring data quality and uniqueness. I'm based in Seattle now.

1. Naive (Brute Force) Solution

The most straightforward way to remove duplicates is to iterate through the string and, for each character, check if it has already appeared in the result. If it hasn't, we add it to the result.

python def remove_duplicates_naive(s): result = "" for char in s: if char not in result: result += char return result

2. Optimal Solution (Using a Set)

A more efficient approach leverages a set to keep track of the characters we've already seen. Checking for membership in a set is typically O(1) on average, leading to significant performance improvements.

python def remove_duplicates_optimal(s): seen = set() result = "" for char in s: if char not in seen: result += char seen.add(char) return result

3. Big(O) Run-time Complexity

  • Naive Solution: O(n^2), where n is the length of the string. This is because, for each character, we iterate through the result string to check for its existence. String 'in' operation takes O(n) time.
  • Optimal Solution: O(n), where n is the length of the string. We iterate through the string once, and set lookups/insertions are O(1) on average.

4. Big(O) Space Usage

  • Naive Solution: O(n) in the worst case, where n is the length of the string. This occurs if all characters in the string are unique, and we store all of them in the result string.
  • Optimal Solution: O(n) in the worst case, where n is the length of the string. This occurs if all characters are unique, and we store each in the seen set.

5. Edge Cases

  • Empty String: If the input string is empty, both functions will return an empty string, which is the expected behavior.

  • String with all Duplicate Characters: If the string contains only duplicate characters (e.g., "aaaa"), the result will be a string with only one occurrence of that character (e.g., "a").

  • Null Input: In a real-world scenario, you'd likely want to add a check for None input and raise an exception or return an appropriate error code. For example:

    python def remove_duplicates_optimal(s): if s is None: raise ValueError("Input string cannot be None") seen = set() result = "" for char in s: if char not in seen: result += char seen.add(char) return result

6. Code

I've included the Python code snippets above for both solutions.