Given an integer array data
representing the data, return whether it is a valid UTF-8 encoding (i.e. it translates to a sequence of valid UTF-8 encoded characters).
A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:
0
, followed by its Unicode code.n
bits are all one's, the n + 1
bit is 0
, followed by n - 1
bytes with the most significant 2
bits being 10
.This is how the UTF-8 encoding would work:
Number of Bytes | UTF-8 Octet Sequence | (binary) --------------------+----------------------------------------- 1 | 0xxxxxxx 2 | 110xxxxx 10xxxxxx 3 | 1110xxxx 10xxxxxx 10xxxxxx 4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
x
denotes a bit in the binary form of a byte that may be either 0
or 1
.
Note: The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.
Example 1:
Input: data = [197,130,1] Output: true Explanation: data represents the octet sequence: 11000101 10000010 00000001. It is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.
Example 2:
Input: data = [235,140,4] Output: false Explanation: data represented the octet sequence: 11101011 10001100 00000100. The first 3 bits are all one's and the 4th bit is 0 means it is a 3-bytes character. The next byte is a continuation byte which starts with 10 and that's correct. But the second continuation byte does not start with 10, so it is invalid.
Constraints:
1 <= data.length <= 2 * 104
0 <= data[i] <= 255
When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:
The brute force approach to validating UTF-8 data involves checking every possible sequence of bytes to see if it conforms to the UTF-8 encoding rules. It essentially simulates decoding each byte and verifying it against all possible scenarios. This method directly applies the UTF-8 encoding rules exhaustively.
Here's how the algorithm would work step-by-step:
def is_valid_utf8_brute_force(data):
data_length = len(data)
for start_index in range(data_length):
index = start_index
while index < data_length:
first_byte = data[index]
# Determine the number of bytes for the character
if (first_byte >> 7) == 0:
number_of_bytes = 1
elif (first_byte >> 5) == 0b110:
number_of_bytes = 2
elif (first_byte >> 4) == 0b1110:
number_of_bytes = 3
elif (first_byte >> 3) == 0b11110:
number_of_bytes = 4
else:
# Invalid starting byte
return False
# Check for sufficient remaining bytes
if index + number_of_bytes > data_length:
return False
# Check subsequent bytes for multi-byte characters
for j in range(1, number_of_bytes):
if (data[index + j] >> 6) != 0b10:
return False
index += number_of_bytes
return True
The key to efficiently validating UTF-8 encoded data is to recognize the bit patterns that define the number of bytes a character occupies. We check the first byte to determine the character's length and then validate the subsequent bytes accordingly.
Here's how the algorithm would work step-by-step:
def utf8_validation(data):
number_of_bytes_to_process = 0
for byte in data:
if number_of_bytes_to_process == 0:
# Check for single byte character
if (byte >> 7) == 0b0:
continue
# Determine number of continuation bytes
elif (byte >> 5) == 0b110:
number_of_bytes_to_process = 1
elif (byte >> 4) == 0b1110:
number_of_bytes_to_process = 2
elif (byte >> 3) == 0b11110:
number_of_bytes_to_process = 3
else:
return False
else:
# Validate continuation byte format
if (byte >> 6) != 0b10:
return False
number_of_bytes_to_process -= 1
# Ensures all expected continuation bytes were provided.
if number_of_bytes_to_process == 0:
return True
else:
return False
Case | How to Handle |
---|---|
Empty input array | Return true because an empty array contains zero characters, which is valid UTF-8. |
Null input array | Throw an IllegalArgumentException or return false, depending on API contract. |
Array containing a single byte with leading 1s (e.g., 0b11111000) | Return false since a single byte starting with more than one '1' is not a valid UTF-8 sequence. |
Array containing a byte that is a continuation byte as the first byte (0b10xxxxxx) | Return false as a continuation byte cannot start a UTF-8 sequence. |
Truncated UTF-8 sequence at the end of the array | Return false if the number of expected continuation bytes exceeds the remaining length of the array. |
Maximum 4-byte UTF-8 sequence that causes an integer overflow when calculating the number of bytes to read | Use bitwise operations or unsigned integers to correctly calculate the number of bytes required without overflow. |
Invalid continuation bytes, i.e. continuation byte does not start with '10' | Return false if a continuation byte does not match the '10xxxxxx' pattern. |
Overlong encoding, using more bytes than necessary to encode a character | Return true because the problem statement does not prohibit this and it's technically UTF-8, although discouraged. |