How to resolve the algorithm UTF-8 encode and decode step by step in the JavaScript programming language
How to resolve the algorithm UTF-8 encode and decode step by step in the JavaScript programming language
Table of Contents
Problem Statement
As described in UTF-8 and in Wikipedia, UTF-8 is a popular encoding of (multi-byte) Unicode code-points into eight-bit octets. The goal of this task is to write a encoder that takes a unicode code-point (an integer representing a unicode character) and returns a sequence of 1–4 bytes representing that character in the UTF-8 encoding. Then you have to write the corresponding decoder that takes a sequence of 1–4 UTF-8 encoded bytes and return the corresponding unicode character. Demonstrate the functionality of your encoder and decoder on the following five characters: Provided below is a reference implementation in Common Lisp.
Let's start with the solution:
Step by Step solution about How to resolve the algorithm UTF-8 encode and decode step by step in the JavaScript programming language
This JavaScript code provides a pure implementation for handling UTF-8 encoding and decoding, without extensive error handling.
Functions:
utf8encode(n)
:
- Input:
n
: A string character or UInt32 code point value.
- Output:
- An
Uint8Array
containing the UTF-8 code units representing the character.
- An
- Description:
- Encodes a single character or code point into an array of one to four UTF-8 code units.
utf8decode([m, n, o, p])
:
- Input:
[m, n, o, p]
: An array of one to four uint8 values representing UTF-8 code units.
- Output:
- A
uint32
representing the decoded code point.
- A
- Description:
- Decodes an array of UTF-8 code units into a 32-bit Unicode code point.
Testing the Functions:
The code demonstrates the usage of these functions by creating various test cases and printing the results:
str
: A string containing unicode characters.cps
: An array of code points corresponding to each character instr
.cus
: An array of arrays containing the UTF-8 code units for each character instr
.
The zip3
function is used to combine the character, code point, and UTF-8 code units for each test case into a single array for easy iteration.
Output:
The output is a table with five columns:
- Character: The original character.
- CodePoint: The hexadecimal representation of the Unicode code point.
- CodeUnits: The hexadecimal representation of the UTF-8 code units.
- uft8encode(ch): The result of
utf8encode
on the character. - uft8encode(cp): The result of
utf8encode
on the code point. - utf8decode(cu): The result of
utf8decode
on the UTF-8 code units.
Source code in the javascript programming language
/***************************************************************************\
|* Pure UTF-8 handling without detailed error reporting functionality. *|
|***************************************************************************|
|* utf8encode *|
|* < String character or UInt32 code point *|
|* > Uint8Array encoded_character *|
|* | ErrorString *|
|* *|
|* utf8encode takes a string or uint32 representing a single code point *|
|* as its argument and returns an array of length 1 up to 4 containing *|
|* utf8 code units representing that character. *|
|***************************************************************************|
|* utf8decode *|
|* < Unit8Array [highendbyte highmidendbyte lowmidendbyte lowendbyte] *|
|* > uint32 character *|
|* | ErrorString *|
|* *|
|* utf8decode takes an array of one to four uint8 representing utf8 code *|
|* units and returns a uint32 representing that code point. *|
\***************************************************************************/
const
utf8encode=
n=>
(m=>
m<0x80
?Uint8Array.from(
[ m>>0&0x7f|0x00])
:m<0x800
?Uint8Array.from(
[ m>>6&0x1f|0xc0,m>>0&0x3f|0x80])
:m<0x10000
?Uint8Array.from(
[ m>>12&0x0f|0xe0,m>>6&0x3f|0x80,m>>0&0x3f|0x80])
:m<0x110000
?Uint8Array.from(
[ m>>18&0x07|0xf0,m>>12&0x3f|0x80,m>>6&0x3f|0x80,m>>0&0x3f|0x80])
:(()=>{throw'Invalid Unicode Code Point!'})())
( typeof n==='string'
?n.codePointAt(0)
:n&0x1fffff),
utf8decode=
([m,n,o,p])=>
m<0x80
?( m&0x7f)<<0
:0xc1<m&&m<0xe0&&n===(n&0xbf)
?( m&0x1f)<<6|( n&0x3f)<<0
:( m===0xe0&&0x9f<n&&n<0xc0
||0xe0<m&&m<0xed&&0x7f<n&&n<0xc0
||m===0xed&&0x7f<n&&n<0xa0
||0xed<m&&m<0xf0&&0x7f<n&&n<0xc0)
&&o===o&0xbf
?( m&0x0f)<<12|( n&0x3f)<<6|( o&0x3f)<<0
:( m===0xf0&&0x8f<n&&n<0xc0
||m===0xf4&&0x7f<n&&n<0x90
||0xf0<m&&m<0xf4&&0x7f<n&&n<0xc0)
&&o===o&0xbf&&p===p&0xbf
?( m&0x07)<<18|( n&0x3f)<<12|( o&0x3f)<<6|( p&0x3f)<<0
:(()=>{throw'Invalid UTF-8 encoding!'})()
const
str=
'AöЖ€𝄞'
,cps=
Uint32Array.from(str,s=>s.codePointAt(0))
,cus=
[ [ 0x41]
,[ 0xc3,0xb6]
,[ 0xd0,0x96]
,[ 0xe2,0x82,0xac]
,[ 0xf0,0x9d,0x84,0x9e]]
.map(a=>Uint8Array.from(a))
,zip3=
([a,...as],[b,...bs],[c,...cs])=>
0<as.length+bs.length+cs.length
?[ [ a,b,c],...zip3(as,bs,cs)]
:[ [ a,b,c]]
,inputs=zip3(str,cps,cus);
console.log(`\
${'Character'.padEnd(16)}\
${'CodePoint'.padEnd(16)}\
${'CodeUnits'.padEnd(16)}\
${'uft8encode(ch)'.padEnd(16)}\
${'uft8encode(cp)'.padEnd(16)}\
utf8decode(cu)`)
for(let [ch,cp,cu] of inputs)
console.log(`\
${ch.padEnd(16)}\
${cp.toString(0x10).padStart(8,'U+000000').padEnd(16)}\
${`[${[...cu].map(n=>n.toString(0x10))}]`.padEnd(16)}\
${`[${[...utf8encode(ch)].map(n=>n.toString(0x10))}]`.padEnd(16)}\
${`[${[...utf8encode(cp)].map(n=>n.toString(0x10))}]`.padEnd(16)}\
${utf8decode(cu).toString(0x10).padStart(8,'U+000000')}`)
You may also check:How to resolve the algorithm The Twelve Days of Christmas step by step in the JavaScript programming language
You may also check:How to resolve the algorithm Search a list of records step by step in the Mathematica/Wolfram Language programming language
You may also check:How to resolve the algorithm Death Star step by step in the J programming language
You may also check:How to resolve the algorithm Hofstadter Q sequence step by step in the Ada programming language
You may also check:How to resolve the algorithm Range consolidation step by step in the C++ programming language