UTF-8 Validation

Medium

UTF-8 Validation

Medium

Solution

UTF-8 Encoding Validation

Problem Description

Given an integer array data representing the data, determine whether it is a valid UTF-8 encoding.

A character in UTF-8 can be 1 to 4 bytes long, subject to the following rules:

For a 1-byte character, the first bit is 0, followed by its Unicode code.
For an n-byte character, the first n bits are all ones, the (n + 1)th bit is 0, followed by n - 1 bytes with the most significant 2 bits being 10.

Number of Bytes	UTF-8 Octet Sequence (binary)
1	0xxxxxxx
2	110xxxxx 10xxxxxx
3	1110xxxx 10xxxxxx 10xxxxxx
4	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Note: The input is an array of integers. Only the least significant 8 bits of each integer are used to store the data. This means each integer represents only 1 byte of data.

Naive Approach

The brute-force approach would involve iterating through the data array and, for each byte, determining how many bytes the current UTF-8 character should be. Based on that, we check if the subsequent bytes follow the 10xxxxxx format. If any byte violates the UTF-8 encoding rules, we return false. If we reach the end of the array without finding any violations, we return true.

Optimal Approach

The optimal approach builds upon the naive approach but refines it for efficiency. We still iterate through the data array, checking the leading bits of each byte to determine the character length. Instead of repeated bitwise operations, we maintain a bytes_to_follow counter to keep track of the expected number of continuation bytes. This avoids redundant checks and makes the logic more concise.

Algorithm

Initialize bytes_to_follow to 0.
Iterate through the data array.
- If bytes_to_follow is 0:
  - Check the leading bits to determine the character length (1 to 4 bytes).
  - Update bytes_to_follow accordingly.
  - If the leading bits don't match a valid UTF-8 start byte, return false.
- Else:
  - Check if the current byte is a continuation byte (starts with 10).
  - If not, return false.
  - Decrement bytes_to_follow.
After the loop, if bytes_to_follow is not 0, return false (incomplete character).
Otherwise, return true.

Edge Cases

Empty input array: Should be considered valid.
Invalid start byte (e.g., 10xxxxxx at the beginning): Should return false.
Incomplete multi-byte character: Should return false.
Bytes with value greater than 255: The problem statement specifies that only the least significant 8 bits are used, so values will be in the range 0-255. No explicit check needed.

Code

class Solution:
    def validUtf8(self, data: List[int]) -> bool:
        bytes_to_follow = 0
        for byte in data:
            if bytes_to_follow == 0:
                if (byte >> 7) == 0:
                    continue  # 1-byte character
                elif (byte >> 5) == 0b110:
                    bytes_to_follow = 1  # 2-byte character
                elif (byte >> 4) == 0b1110:
                    bytes_to_follow = 2  # 3-byte character
                elif (byte >> 3) == 0b11110:
                    bytes_to_follow = 3  # 4-byte character
                else:
                    return False  # Invalid start byte
            else:
                if (byte >> 6) != 0b10:
                    return False  # Invalid continuation byte
                bytes_to_follow -= 1
        return bytes_to_follow == 0  # Check for incomplete character

Time Complexity

The time complexity is O(N), where N is the length of the data array, as we iterate through the array once.

Space Complexity

The space complexity is O(1), as we only use a constant amount of extra space to store the bytes_to_follow counter.

Solution

UTF-8 Encoding Validation

Problem Description

Given an integer array data representing the data, determine whether it is a valid UTF-8 encoding.

A character in UTF-8 can be 1 to 4 bytes long, subject to the following rules:

For a 1-byte character, the first bit is 0, followed by its Unicode code.
For an n-byte character, the first n bits are all ones, the (n + 1)th bit is 0, followed by n - 1 bytes with the most significant 2 bits being 10.

Number of Bytes	UTF-8 Octet Sequence (binary)
1	0xxxxxxx
2	110xxxxx 10xxxxxx
3	1110xxxx 10xxxxxx 10xxxxxx
4	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Note: The input is an array of integers. Only the least significant 8 bits of each integer are used to store the data. This means each integer represents only 1 byte of data.

Naive Approach

Optimal Approach

Algorithm

Initialize bytes_to_follow to 0.
Iterate through the data array.
- If bytes_to_follow is 0:
  - Check the leading bits to determine the character length (1 to 4 bytes).
  - Update bytes_to_follow accordingly.
  - If the leading bits don't match a valid UTF-8 start byte, return false.
- Else:
  - Check if the current byte is a continuation byte (starts with 10).
  - If not, return false.
  - Decrement bytes_to_follow.
After the loop, if bytes_to_follow is not 0, return false (incomplete character).
Otherwise, return true.

Edge Cases

Empty input array: Should be considered valid.
Invalid start byte (e.g., 10xxxxxx at the beginning): Should return false.
Incomplete multi-byte character: Should return false.
Bytes with value greater than 255: The problem statement specifies that only the least significant 8 bits are used, so values will be in the range 0-255. No explicit check needed.

Code

class Solution:
    def validUtf8(self, data: List[int]) -> bool:
        bytes_to_follow = 0
        for byte in data:
            if bytes_to_follow == 0:
                if (byte >> 7) == 0:
                    continue  # 1-byte character
                elif (byte >> 5) == 0b110:
                    bytes_to_follow = 1  # 2-byte character
                elif (byte >> 4) == 0b1110:
                    bytes_to_follow = 2  # 3-byte character
                elif (byte >> 3) == 0b11110:
                    bytes_to_follow = 3  # 4-byte character
                else:
                    return False  # Invalid start byte
            else:
                if (byte >> 6) != 0b10:
                    return False  # Invalid continuation byte
                bytes_to_follow -= 1
        return bytes_to_follow == 0  # Check for incomplete character

Time Complexity

The time complexity is O(N), where N is the length of the data array, as we iterate through the array once.

Space Complexity

The space complexity is O(1), as we only use a constant amount of extra space to store the bytes_to_follow counter.