Utf  8 Encoding
1. You are given an array of integers.Input Format
2. You are required to test whether the array represents a valid sequence of UTF8 characters or
not.
3. A character in UTF8 can be from 1 to 4 bytes long and follows some rules 
(i) For 1byte long character, first bit will be 0 and rest represents its unicode code.
(ii) For nbytes long character, first nbits will be 1's, the n+1th bit is 0, followed by n1 bytes
with most significant 2 bits being 10.
Note > Only the least significant 8 bits of each element in array is used for data.
Note > Check out the question video for details.
A number NOutput Format
arr1
arr2..
N numbers
Check the sample output and question video.Question Video
1 <= n <= 10^9Sample Input
0 < a[i] <= 255
3Sample Output
197
130
1
true

Editorial
The problem here deals with checking whether the particular sequence of data can form a valid UTF8 encoded data or not.
As per UTF8 encoding, a character can be one to four bytes long and according to the length of the character rules are governing UTF8 encoding:
1. For a character of length 1 byte, the MSB should always begin with 0.
2. For a character of length 2 bytes, the starting bits of the upper byte should be 110 and the lower byte should start with 10.
3. For a character of length 2 bytes, the starting bits of the upper byte should be 1110 and the rest lower bytes should start with 10.
4. For a character of length 4 bytes, the starting bits of the upper byte should be 11110 and the rest lower bytes should start with 10.
So to summarize:
1 Byte > 0 _ _ _ _ _ _ _
2 Bytes > [1 1 0 _ _ _ _ _ ] [1 0 _ _ _ _ _ _ ]
3 Bytes > [1 1 1 0 _ _ _ _ ] [1 0 _ _ _ _ _ _ ] [1 0 _ _ _ _ _ _ ]
4 Bytes > [1 1 1 1 0 _ _ _ ] [1 0 _ _ _ _ _ _ ] [1 0 _ _ _ _ _ _ ] [1 0 _ _ _ _ _ _ ]
Let us take an example to understand the problem better,
Sequence > 11101111, 10111100, 10010111, 11011111, 10000000, 11110010, 10010101, 10001111, 10111111
As this sequence is of type 3 bytes > 2 bytes > 4 bytes and abides all rules of UTF8 encoding henceforth it is a valid sequence.
Let us take another example,
Sequence > 11101111, 10111100, 10010111, 1011111, 01000000, 11110010, 10010101, 10001111, 10111111
As this sequence is of type 3 bytes >{redundant} > 1 byte > 4 bytes so it fails for byte at index 3rd where we have a redundant byte starting from 10... so this is an invalid sequence.
The idea behind this problem is to traverse through the entire array and check for every value that whether it lies in 1 byte, 2 bytes, 3 byte, or 4byte criteria by checking its MSB bits with 0, 110, 1110, and 11110 and as per the result check further (n  1) bytes for the type where MSB bits starts from 10, here n represent the byte type, hence n = 1, 2, 3 or 4. If at any instant there occurs an ambiguity then the sequence is not valid.
Suppose we need to check whether the given val is of type 3 byte:
If((val>>4) == 0b1110) => true then it is of type 3 byte
Here we have shifted the bits of val 4 sides to the right which makes us consider the last 4 bits of the byte and now if these bits are 1110 then the criteria are met. We can apply this method to check for every type.
Time Complexity: O(n)
The time complexity for the function is linear as we are traversing on the entire array once.
Space Complexity: O(1)
The space complexity for the function is constant.

Asked in Companies

Related Topics