How to resolve the algorithm UTF-8 encode and decode step by step in the JavaScript programming language

Published on 12 May 2024 09:40 PM

How to resolve the algorithm UTF-8 encode and decode step by step in the JavaScript programming language

Table of Contents

Problem Statement

As described in UTF-8 and in Wikipedia, UTF-8 is a popular encoding of (multi-byte) Unicode code-points into eight-bit octets. The goal of this task is to write a encoder that takes a unicode code-point (an integer representing a unicode character) and returns a sequence of 1–4 bytes representing that character in the UTF-8 encoding. Then you have to write the corresponding decoder that takes a sequence of 1–4 UTF-8 encoded bytes and return the corresponding unicode character. Demonstrate the functionality of your encoder and decoder on the following five characters: Provided below is a reference implementation in Common Lisp.

Let's start with the solution:

Step by Step solution about How to resolve the algorithm UTF-8 encode and decode step by step in the JavaScript programming language

This JavaScript code provides a pure implementation for handling UTF-8 encoding and decoding, without extensive error handling.

Functions:

utf8encode(n):

  • Input:
    • n: A string character or UInt32 code point value.
  • Output:
    • An Uint8Array containing the UTF-8 code units representing the character.
  • Description:
    • Encodes a single character or code point into an array of one to four UTF-8 code units.

utf8decode([m, n, o, p]):

  • Input:
    • [m, n, o, p]: An array of one to four uint8 values representing UTF-8 code units.
  • Output:
    • A uint32 representing the decoded code point.
  • Description:
    • Decodes an array of UTF-8 code units into a 32-bit Unicode code point.

Testing the Functions:

The code demonstrates the usage of these functions by creating various test cases and printing the results:

  • str: A string containing unicode characters.
  • cps: An array of code points corresponding to each character in str.
  • cus: An array of arrays containing the UTF-8 code units for each character in str.

The zip3 function is used to combine the character, code point, and UTF-8 code units for each test case into a single array for easy iteration.

Output:

The output is a table with five columns:

  1. Character: The original character.
  2. CodePoint: The hexadecimal representation of the Unicode code point.
  3. CodeUnits: The hexadecimal representation of the UTF-8 code units.
  4. uft8encode(ch): The result of utf8encode on the character.
  5. uft8encode(cp): The result of utf8encode on the code point.
  6. utf8decode(cu): The result of utf8decode on the UTF-8 code units.

Source code in the javascript programming language

/***************************************************************************\
|*  Pure UTF-8 handling without detailed error reporting functionality.    *|
|***************************************************************************|
|*  utf8encode                                                             *|
|*    < String character or UInt32 code point                              *|
|*    > Uint8Array encoded_character                                       *|
|*    | ErrorString                                                        *|
|*                                                                         *|
|*  utf8encode takes a string or uint32 representing a single code point   *|
|*    as its argument and returns an array of length 1 up to 4 containing  *|
|*    utf8 code units representing that character.                         *|
|***************************************************************************|
|*  utf8decode                                                             *|
|*    < Unit8Array [highendbyte highmidendbyte lowmidendbyte lowendbyte]   *|
|*    > uint32 character                                                   *|
|*    | ErrorString                                                        *|
|*                                                                         *|
|*  utf8decode takes an array of one to four uint8 representing utf8 code  *|
|*    units and returns a uint32 representing that code point.             *|
\***************************************************************************/

const
  utf8encode=
    n=>
      (m=>
        m<0x80
       ?Uint8Array.from(
          [ m>>0&0x7f|0x00])
       :m<0x800
       ?Uint8Array.from(
          [ m>>6&0x1f|0xc0,m>>0&0x3f|0x80])
       :m<0x10000
       ?Uint8Array.from(
          [ m>>12&0x0f|0xe0,m>>6&0x3f|0x80,m>>0&0x3f|0x80])
       :m<0x110000
       ?Uint8Array.from(
          [ m>>18&0x07|0xf0,m>>12&0x3f|0x80,m>>6&0x3f|0x80,m>>0&0x3f|0x80])
       :(()=>{throw'Invalid Unicode Code Point!'})())
      ( typeof n==='string'
       ?n.codePointAt(0)
       :n&0x1fffff),
  utf8decode=
    ([m,n,o,p])=>
      m<0x80
     ?( m&0x7f)<<0
     :0xc1<m&&m<0xe0&&n===(n&0xbf)
     ?( m&0x1f)<<6|( n&0x3f)<<0
     :( m===0xe0&&0x9f<n&&n<0xc0
      ||0xe0<m&&m<0xed&&0x7f<n&&n<0xc0
      ||m===0xed&&0x7f<n&&n<0xa0
      ||0xed<m&&m<0xf0&&0x7f<n&&n<0xc0)
    &&o===o&0xbf
     ?( m&0x0f)<<12|( n&0x3f)<<6|( o&0x3f)<<0
     :( m===0xf0&&0x8f<n&&n<0xc0
      ||m===0xf4&&0x7f<n&&n<0x90
      ||0xf0<m&&m<0xf4&&0x7f<n&&n<0xc0)
    &&o===o&0xbf&&p===p&0xbf
     ?( m&0x07)<<18|( n&0x3f)<<12|( o&0x3f)<<6|( p&0x3f)<<0
     :(()=>{throw'Invalid UTF-8 encoding!'})()


const
  str=
    'AöЖ€𝄞'
 ,cps=
    Uint32Array.from(str,s=>s.codePointAt(0))
 ,cus=
    [ [ 0x41]
     ,[ 0xc3,0xb6]
     ,[ 0xd0,0x96]
     ,[ 0xe2,0x82,0xac]
     ,[ 0xf0,0x9d,0x84,0x9e]]
   .map(a=>Uint8Array.from(a))
 ,zip3=
    ([a,...as],[b,...bs],[c,...cs])=>
      0<as.length+bs.length+cs.length
     ?[ [ a,b,c],...zip3(as,bs,cs)]
     :[ [ a,b,c]]
 ,inputs=zip3(str,cps,cus);


console.log(`\
${'Character'.padEnd(16)}\
${'CodePoint'.padEnd(16)}\
${'CodeUnits'.padEnd(16)}\
${'uft8encode(ch)'.padEnd(16)}\
${'uft8encode(cp)'.padEnd(16)}\
utf8decode(cu)`)
for(let [ch,cp,cu] of inputs)
  console.log(`\
${ch.padEnd(16)}\
${cp.toString(0x10).padStart(8,'U+000000').padEnd(16)}\
${`[${[...cu].map(n=>n.toString(0x10))}]`.padEnd(16)}\
${`[${[...utf8encode(ch)].map(n=>n.toString(0x10))}]`.padEnd(16)}\
${`[${[...utf8encode(cp)].map(n=>n.toString(0x10))}]`.padEnd(16)}\
${utf8decode(cu).toString(0x10).padStart(8,'U+000000')}`)


  

You may also check:How to resolve the algorithm The Twelve Days of Christmas step by step in the JavaScript programming language
You may also check:How to resolve the algorithm Search a list of records step by step in the Mathematica/Wolfram Language programming language
You may also check:How to resolve the algorithm Death Star step by step in the J programming language
You may also check:How to resolve the algorithm Hofstadter Q sequence step by step in the Ada programming language
You may also check:How to resolve the algorithm Range consolidation step by step in the C++ programming language