Given an integer array data
representing the data, return whether it is a valid UTF-8 encoding (i.e. it translates to a sequence of valid UTF-8 encoded characters).
A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:
0
, followed by its Unicode code.n
bits are all one's, the n + 1
bit is 0
, followed by n - 1
bytes with the most significant 2
bits being 10
.For example, data = [197,130,1]
should return true
because it represents the octet sequence: 11000101 10000010 00000001
. This is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.
However, data = [235,140,4]
should return false
because it represents the octet sequence: 11101011 10001100 00000100
. The first 3 bits are all one's and the 4th bit is 0, meaning it is a 3-bytes character. The next byte is a continuation byte which starts with 10 and that's correct. But the second continuation byte does not start with 10, so it is invalid.
def validUtf8(data):
n_bytes = 0
for num in data:
bin_rep = bin(num)[2:].zfill(8)
if n_bytes == 0:
for bit in bin_rep:
if bit == '0':
break
n_bytes += 1
if n_bytes == 0:
continue
if n_bytes > 4 or n_bytes == 1:
return False
else:
if not (bin_rep[0] == '1' and bin_rep[1] == '0'):
return False
n_bytes -= 1
return n_bytes == 0
The naive approach would be to iterate through the data array and for each integer (byte), check the leading bits to determine the number of bytes for the UTF-8 character. Then, validate the subsequent bytes according to the UTF-8 encoding rules.
The provided Python code validUtf8(data)
implements an optimal solution to determine if a given integer array data
represents a valid UTF-8 encoding.
The time complexity of the validUtf8
function is O(N), where N is the number of bytes in the input array data
. The function iterates through each byte in the array once. Inside the loop, the number of operations performed for each byte is constant. Specifically, determining the number of leading ones in the first byte of a character takes at most 8 iterations (since a byte has 8 bits). Checking continuation bytes involves a constant number of operations. Thus, the overall time complexity is linear with respect to the input size.
The space complexity of the validUtf8
function is O(1), which means it uses a constant amount of extra space. The function uses a single variable, n_bytes
, to keep track of the number of remaining bytes in a multi-byte UTF-8 character. The bin_rep
variable, although used to store the binary representation of a byte, its space is constant as it always stores 8 bits. The space used by these variables does not depend on the input size, hence the constant space complexity.
data
is empty, the code will not enter the loop and will return n_bytes == 0
, which is True
. This is a valid UTF-8 encoding as there are no characters to validate.1
s (e.g., 11111000
), it is not a valid UTF-8 sequence. The code handles this case by checking if n_bytes > 4
and returning False
.10
, the code returns False
. This is handled by the if not (bin_rep[0] == '1' and bin_rep[1] == '0')
condition.n_bytes > 0
), it means the UTF-8 sequence is incomplete and invalid. The final return n_bytes == 0
checks if all expected continuation bytes were encountered.0
, which correctly resets n_bytes
to 0
.