Taro Logo

UTF-8 Validation

Medium
Meta logo
Meta
1 view
Topics:
ArraysBit Manipulation

Given an integer array data representing byte values, determine if it's a valid UTF-8 encoding.

A UTF-8 character can be 1 to 4 bytes long, following these rules:

  1. 1-byte character: Starts with 0, followed by the Unicode.
  2. n-byte character: Starts with n ones, a 0, then n-1 continuation bytes. Continuation bytes start with 10.

Here's a visual:

Bytes | UTF-8 Octet Sequence
------|----------------------
1     | 0xxxxxxx
2     | 110xxxxx 10xxxxxx
3     | 1110xxxx 10xxxxxx 10xxxxxx
4     | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

x can be 0 or 1.

Important: Each integer in data represents one byte (least significant 8 bits).

Example 1:

data = [197, 130, 1]
Output: true
Explanation: 11000101 10000010 00000001.  A 2-byte char + 1-byte char.

Example 2:

data = [235, 140, 4]
Output: false
Explanation: 11101011 10001100 00000100.  A 3-byte char, but the 3rd byte isn't a continuation byte (doesn't start with 10).

Write a function to validate if the input data array is a valid UTF-8 encoding.

What are some edge cases to consider?

Solution


UTF-8 Encoding Validation

Problem Description

Given an integer array data representing the data, determine whether it is a valid UTF-8 encoding.

A character in UTF-8 can be 1 to 4 bytes long, subject to the following rules:

  1. For a 1-byte character, the first bit is 0, followed by its Unicode code.
  2. For an n-byte character, the first n bits are all ones, the (n + 1)th bit is 0, followed by n - 1 bytes with the most significant 2 bits being 10.
Number of BytesUTF-8 Octet Sequence (binary)
10xxxxxxx
2110xxxxx 10xxxxxx
31110xxxx 10xxxxxx 10xxxxxx
411110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Note: The input is an array of integers. Only the least significant 8 bits of each integer are used to store the data. This means each integer represents only 1 byte of data.

Naive Approach

The brute-force approach would involve iterating through the data array and, for each byte, determining how many bytes the current UTF-8 character should be. Based on that, we check if the subsequent bytes follow the 10xxxxxx format. If any byte violates the UTF-8 encoding rules, we return false. If we reach the end of the array without finding any violations, we return true.

Optimal Approach

The optimal approach builds upon the naive approach but refines it for efficiency. We still iterate through the data array, checking the leading bits of each byte to determine the character length. Instead of repeated bitwise operations, we maintain a bytes_to_follow counter to keep track of the expected number of continuation bytes. This avoids redundant checks and makes the logic more concise.

Algorithm

  1. Initialize bytes_to_follow to 0.
  2. Iterate through the data array.
    • If bytes_to_follow is 0:
      • Check the leading bits to determine the character length (1 to 4 bytes).
      • Update bytes_to_follow accordingly.
      • If the leading bits don't match a valid UTF-8 start byte, return false.
    • Else:
      • Check if the current byte is a continuation byte (starts with 10).
      • If not, return false.
      • Decrement bytes_to_follow.
  3. After the loop, if bytes_to_follow is not 0, return false (incomplete character).
  4. Otherwise, return true.

Edge Cases

  • Empty input array: Should be considered valid.
  • Invalid start byte (e.g., 10xxxxxx at the beginning): Should return false.
  • Incomplete multi-byte character: Should return false.
  • Bytes with value greater than 255: The problem statement specifies that only the least significant 8 bits are used, so values will be in the range 0-255. No explicit check needed.

Code

class Solution:
    def validUtf8(self, data: List[int]) -> bool:
        bytes_to_follow = 0
        for byte in data:
            if bytes_to_follow == 0:
                if (byte >> 7) == 0:
                    continue  # 1-byte character
                elif (byte >> 5) == 0b110:
                    bytes_to_follow = 1  # 2-byte character
                elif (byte >> 4) == 0b1110:
                    bytes_to_follow = 2  # 3-byte character
                elif (byte >> 3) == 0b11110:
                    bytes_to_follow = 3  # 4-byte character
                else:
                    return False  # Invalid start byte
            else:
                if (byte >> 6) != 0b10:
                    return False  # Invalid continuation byte
                bytes_to_follow -= 1
        return bytes_to_follow == 0  # Check for incomplete character

Time Complexity

The time complexity is O(N), where N is the length of the data array, as we iterate through the array once.

Space Complexity

The space complexity is O(1), as we only use a constant amount of extra space to store the bytes_to_follow counter.