Taro Logo

UTF-8 Validation

Medium
Asked by:
Profile picture
Profile picture
8 views
Topics:
ArraysBit Manipulation

Given an integer array data representing the data, return whether it is a valid UTF-8 encoding (i.e. it translates to a sequence of valid UTF-8 encoded characters).

A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:

  1. For a 1-byte character, the first bit is a 0, followed by its Unicode code.
  2. For an n-bytes character, the first n bits are all one's, the n + 1 bit is 0, followed by n - 1 bytes with the most significant 2 bits being 10.

This is how the UTF-8 encoding would work:

     Number of Bytes   |        UTF-8 Octet Sequence
                       |              (binary)
   --------------------+-----------------------------------------
            1          |   0xxxxxxx
            2          |   110xxxxx 10xxxxxx
            3          |   1110xxxx 10xxxxxx 10xxxxxx
            4          |   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

x denotes a bit in the binary form of a byte that may be either 0 or 1.

Note: The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.

Example 1:

Input: data = [197,130,1]
Output: true
Explanation: data represents the octet sequence: 11000101 10000010 00000001.
It is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.

Example 2:

Input: data = [235,140,4]
Output: false
Explanation: data represented the octet sequence: 11101011 10001100 00000100.
The first 3 bits are all one's and the 4th bit is 0 means it is a 3-bytes character.
The next byte is a continuation byte which starts with 10 and that's correct.
But the second continuation byte does not start with 10, so it is invalid.

Constraints:

  • 1 <= data.length <= 2 * 104
  • 0 <= data[i] <= 255

Solution


Clarifying Questions

When you get asked this question in a real-life environment, it will often be ambiguous (especially at FAANG). Make sure to ask these questions in that case:

  1. What is the maximum size of the input array of integers?
  2. Are all integers in the input array guaranteed to be valid byte representations (i.e., between 0 and 255)?
  3. If the input array contains an invalid UTF-8 sequence, should I return false immediately, or should I process the remaining bytes?
  4. If the input array is empty, what should I return?
  5. Can you provide a few examples of valid and invalid UTF-8 sequences to confirm my understanding of the encoding rules?

Brute Force Solution

Approach

The brute force approach to validating UTF-8 data involves checking every possible sequence of bytes to see if it conforms to the UTF-8 encoding rules. It essentially simulates decoding each byte and verifying it against all possible scenarios. This method directly applies the UTF-8 encoding rules exhaustively.

Here's how the algorithm would work step-by-step:

  1. Take the first byte and examine it to determine the number of bytes that this character should occupy.
  2. If the first byte indicates a single-byte character, simply check if the byte is within the valid range for a single-byte UTF-8 character.
  3. If the first byte indicates a multi-byte character (e.g., 2, 3, or 4 bytes), verify that there are enough remaining bytes in the data to complete the character.
  4. For each subsequent byte that belongs to a multi-byte character, verify that it starts with the specific bit pattern '10'.
  5. If any of these checks fail for any sequence of bytes, the data is invalid.
  6. Repeat this process for every possible starting position in the data to ensure there isn't a valid encoding after an invalid sequence.
  7. If every possible sequence is checked and no invalid sequences are found, then the data is considered valid UTF-8.

Code Implementation

def is_valid_utf8_brute_force(data):
    data_length = len(data)
    for start_index in range(data_length):
        index = start_index
        while index < data_length:
            first_byte = data[index]

            # Determine the number of bytes for the character
            if (first_byte >> 7) == 0:
                number_of_bytes = 1
            elif (first_byte >> 5) == 0b110:
                number_of_bytes = 2
            elif (first_byte >> 4) == 0b1110:
                number_of_bytes = 3
            elif (first_byte >> 3) == 0b11110:
                number_of_bytes = 4
            else:
                # Invalid starting byte
                return False

            # Check for sufficient remaining bytes
            if index + number_of_bytes > data_length:

                return False

            # Check subsequent bytes for multi-byte characters
            for j in range(1, number_of_bytes):

                if (data[index + j] >> 6) != 0b10:

                    return False
            index += number_of_bytes

    return True

Big(O) Analysis

Time Complexity
O(n²)The algorithm iterates through each byte in the input array of size n as a potential starting point for a UTF-8 character. For each starting byte, the algorithm may need to examine up to 3 subsequent bytes to validate a multi-byte sequence. In the worst case, for an invalid sequence, the algorithm restarts the process from the next byte, potentially re-examining many of the same bytes. This nested behavior, where each of the n bytes might trigger a potentially linear scan forward, leads to a time complexity proportional to n * n/2. Therefore, the time complexity is O(n²).
Space Complexity
O(1)The provided algorithm iterates through the input data, examining each byte to validate the UTF-8 encoding. It uses a constant number of variables such as an index or a counter to keep track of the progress during the validation process. No auxiliary data structures that scale with the input size N (the number of bytes in the data) are created. Thus, the space complexity remains constant regardless of the input size.

Optimal Solution

Approach

The key to efficiently validating UTF-8 encoded data is to recognize the bit patterns that define the number of bytes a character occupies. We check the first byte to determine the character's length and then validate the subsequent bytes accordingly.

Here's how the algorithm would work step-by-step:

  1. Examine the first byte of each character.
  2. If the first byte starts with a '0', it's a single-byte character, and we proceed to the next character.
  3. If the first byte starts with '110', it indicates a two-byte character. We need to check the next byte.
  4. If the first byte starts with '1110', it indicates a three-byte character. We need to check the next two bytes.
  5. If the first byte starts with '11110', it indicates a four-byte character. We need to check the next three bytes.
  6. For any continuation byte, ensure it starts with '10'. If it doesn't, the data is invalid.
  7. Keep validating subsequent bytes based on the length indicated by the first byte.
  8. If at any point a byte doesn't conform to the expected pattern (e.g., a continuation byte is missing or malformed), the UTF-8 data is invalid.
  9. If we reach the end of the data without finding any errors, the UTF-8 data is valid.

Code Implementation

def utf8_validation(data):
    number_of_bytes_to_process = 0

    for byte in data:
        if number_of_bytes_to_process == 0:
            # Check for single byte character
            if (byte >> 7) == 0b0:
                continue

            # Determine number of continuation bytes
            elif (byte >> 5) == 0b110:
                number_of_bytes_to_process = 1
            elif (byte >> 4) == 0b1110:
                number_of_bytes_to_process = 2
            elif (byte >> 3) == 0b11110:
                number_of_bytes_to_process = 3
            else:
                return False

        else:
            # Validate continuation byte format
            if (byte >> 6) != 0b10:
                return False

            number_of_bytes_to_process -= 1

    # Ensures all expected continuation bytes were provided.
    if number_of_bytes_to_process == 0:
        return True
    else:
        return False

Big(O) Analysis

Time Complexity
O(n)The algorithm iterates through each byte in the input array of size n. For each byte, it determines the character length and then validates the required number of subsequent bytes as continuation bytes. The number of continuation bytes to check is limited to a maximum of 3. Thus, the inner validation steps take constant time. Therefore, the overall time complexity is directly proportional to the number of bytes in the input array, resulting in O(n) complexity.
Space Complexity
O(1)The algorithm iterates through the input array without using any auxiliary data structures like lists, maps, or sets. It uses a fixed number of variables such as the number of bytes expected for a UTF-8 character. These variables consume a constant amount of space, irrespective of the size (N) of the input array. Therefore, the space complexity is O(1).

Edge Cases

CaseHow to Handle
Empty input arrayReturn true because an empty array contains zero characters, which is valid UTF-8.
Null input arrayThrow an IllegalArgumentException or return false, depending on API contract.
Array containing a single byte with leading 1s (e.g., 0b11111000)Return false since a single byte starting with more than one '1' is not a valid UTF-8 sequence.
Array containing a byte that is a continuation byte as the first byte (0b10xxxxxx)Return false as a continuation byte cannot start a UTF-8 sequence.
Truncated UTF-8 sequence at the end of the arrayReturn false if the number of expected continuation bytes exceeds the remaining length of the array.
Maximum 4-byte UTF-8 sequence that causes an integer overflow when calculating the number of bytes to readUse bitwise operations or unsigned integers to correctly calculate the number of bytes required without overflow.
Invalid continuation bytes, i.e. continuation byte does not start with '10'Return false if a continuation byte does not match the '10xxxxxx' pattern.
Overlong encoding, using more bytes than necessary to encode a characterReturn true because the problem statement does not prohibit this and it's technically UTF-8, although discouraged.