Given an integer array data
representing byte values, determine if it's a valid UTF-8 encoding.
A UTF-8 character can be 1 to 4 bytes long, following these rules:
0
, followed by the Unicode.n
ones, a 0
, then n-1
continuation bytes. Continuation bytes start with 10
.Here's a visual:
Bytes | UTF-8 Octet Sequence
------|----------------------
1 | 0xxxxxxx
2 | 110xxxxx 10xxxxxx
3 | 1110xxxx 10xxxxxx 10xxxxxx
4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
x
can be 0
or 1
.
Important: Each integer in data
represents one byte (least significant 8 bits).
Example 1:
data = [197, 130, 1]
Output: true
Explanation: 11000101 10000010 00000001. A 2-byte char + 1-byte char.
Example 2:
data = [235, 140, 4]
Output: false
Explanation: 11101011 10001100 00000100. A 3-byte char, but the 3rd byte isn't a continuation byte (doesn't start with 10).
Write a function to validate if the input data
array is a valid UTF-8 encoding.
What are some edge cases to consider?
Given an integer array data
representing the data, determine whether it is a valid UTF-8 encoding.
A character in UTF-8 can be 1 to 4 bytes long, subject to the following rules:
Number of Bytes | UTF-8 Octet Sequence (binary) |
---|---|
1 | 0xxxxxxx |
2 | 110xxxxx 10xxxxxx |
3 | 1110xxxx 10xxxxxx 10xxxxxx |
4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
Note: The input is an array of integers. Only the least significant 8 bits of each integer are used to store the data. This means each integer represents only 1 byte of data.
The brute-force approach would involve iterating through the data
array and, for each byte, determining how many bytes the current UTF-8 character should be. Based on that, we check if the subsequent bytes follow the 10xxxxxx
format. If any byte violates the UTF-8 encoding rules, we return false
. If we reach the end of the array without finding any violations, we return true
.
The optimal approach builds upon the naive approach but refines it for efficiency. We still iterate through the data
array, checking the leading bits of each byte to determine the character length. Instead of repeated bitwise operations, we maintain a bytes_to_follow
counter to keep track of the expected number of continuation bytes. This avoids redundant checks and makes the logic more concise.
bytes_to_follow
to 0.data
array.
bytes_to_follow
is 0:
bytes_to_follow
accordingly.false
.10
).false
.bytes_to_follow
.bytes_to_follow
is not 0, return false
(incomplete character).true
.10xxxxxx
at the beginning): Should return false
.false
.class Solution:
def validUtf8(self, data: List[int]) -> bool:
bytes_to_follow = 0
for byte in data:
if bytes_to_follow == 0:
if (byte >> 7) == 0:
continue # 1-byte character
elif (byte >> 5) == 0b110:
bytes_to_follow = 1 # 2-byte character
elif (byte >> 4) == 0b1110:
bytes_to_follow = 2 # 3-byte character
elif (byte >> 3) == 0b11110:
bytes_to_follow = 3 # 4-byte character
else:
return False # Invalid start byte
else:
if (byte >> 6) != 0b10:
return False # Invalid continuation byte
bytes_to_follow -= 1
return bytes_to_follow == 0 # Check for incomplete character
The time complexity is O(N), where N is the length of the data
array, as we iterate through the array once.
The space complexity is O(1), as we only use a constant amount of extra space to store the bytes_to_follow
counter.