UTF-8 Validation

Medium
10 days ago

Given an integer array data representing the data, return whether it is a valid UTF-8 encoding (i.e. it translates to a sequence of valid UTF-8 encoded characters).

A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:

  1. For a 1-byte character, the first bit is a 0, followed by its Unicode code.
  2. For an n-bytes character, the first n bits are all one's, the n + 1 bit is 0, followed by n - 1 bytes with the most significant 2 bits being 10.

For example, data = [197,130,1] should return true because it represents the octet sequence: 11000101 10000010 00000001. This is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.

However, data = [235,140,4] should return false because it represents the octet sequence: 11101011 10001100 00000100. The first 3 bits are all one's and the 4th bit is 0, meaning it is a 3-bytes character. The next byte is a continuation byte which starts with 10 and that's correct. But the second continuation byte does not start with 10, so it is invalid.

Sample Answer
def validUtf8(data):
    n_bytes = 0
    for num in data:
        bin_rep = bin(num)[2:].zfill(8)
        if n_bytes == 0:
            for bit in bin_rep:
                if bit == '0':
                    break
                n_bytes += 1

            if n_bytes == 0:
                continue

            if n_bytes > 4 or n_bytes == 1:
                return False
        else:
            if not (bin_rep[0] == '1' and bin_rep[1] == '0'):
                return False
        n_bytes -= 1
    return n_bytes == 0

Naive Approach

The naive approach would be to iterate through the data array and for each integer (byte), check the leading bits to determine the number of bytes for the UTF-8 character. Then, validate the subsequent bytes according to the UTF-8 encoding rules.

Optimal Solution

The provided Python code validUtf8(data) implements an optimal solution to determine if a given integer array data represents a valid UTF-8 encoding.

Big(O) Run-time Analysis

The time complexity of the validUtf8 function is O(N), where N is the number of bytes in the input array data. The function iterates through each byte in the array once. Inside the loop, the number of operations performed for each byte is constant. Specifically, determining the number of leading ones in the first byte of a character takes at most 8 iterations (since a byte has 8 bits). Checking continuation bytes involves a constant number of operations. Thus, the overall time complexity is linear with respect to the input size.

Big(O) Space Usage Analysis

The space complexity of the validUtf8 function is O(1), which means it uses a constant amount of extra space. The function uses a single variable, n_bytes, to keep track of the number of remaining bytes in a multi-byte UTF-8 character. The bin_rep variable, although used to store the binary representation of a byte, its space is constant as it always stores 8 bits. The space used by these variables does not depend on the input size, hence the constant space complexity.

Edge Cases and Handling

  1. Empty Input: If the input array data is empty, the code will not enter the loop and will return n_bytes == 0, which is True. This is a valid UTF-8 encoding as there are no characters to validate.
  2. Invalid Start Byte: If a byte starts with more than four 1s (e.g., 11111000), it is not a valid UTF-8 sequence. The code handles this case by checking if n_bytes > 4 and returning False.
  3. Missing Continuation Bytes: If the starting byte indicates a multi-byte sequence, but the subsequent bytes do not start with 10, the code returns False. This is handled by the if not (bin_rep[0] == '1' and bin_rep[1] == '0') condition.
  4. Unexpected End of Data: If the function reaches the end of the input array while still expecting continuation bytes (n_bytes > 0), it means the UTF-8 sequence is incomplete and invalid. The final return n_bytes == 0 checks if all expected continuation bytes were encountered.
  5. Single Byte Characters: Single-byte characters (ASCII) are correctly handled as their first bit is 0, which correctly resets n_bytes to 0.
  6. Integer Range: The problem states that only the least significant 8 bits of each integer are used. The provided solution accounts for this by effectively treating each integer as a byte.
  7. Overlong Encoding: UTF-8 encoding should use the minimum number of bytes necessary to represent a character. While the problem description doesn't explicitly require checking for overlong encoding, a robust solution might include checks to ensure that characters are not encoded using more bytes than necessary. For example, a character that can be represented in one byte should not be represented using two or more bytes.