How to resolve the algorithm UTF-8 encode and decode step by step in the Common Lisp programming language

Published on 12 May 2024 09:40 PM

How to resolve the algorithm UTF-8 encode and decode step by step in the Common Lisp programming language

Table of Contents

Problem Statement

As described in UTF-8 and in Wikipedia, UTF-8 is a popular encoding of (multi-byte) Unicode code-points into eight-bit octets. The goal of this task is to write a encoder that takes a unicode code-point (an integer representing a unicode character) and returns a sequence of 1–4 bytes representing that character in the UTF-8 encoding. Then you have to write the corresponding decoder that takes a sequence of 1–4 UTF-8 encoded bytes and return the corresponding unicode character. Demonstrate the functionality of your encoder and decoder on the following five characters: Provided below is a reference implementation in Common Lisp.

Let's start with the solution:

Step by Step solution about How to resolve the algorithm UTF-8 encode and decode step by step in the Common Lisp programming language

Source code in the common programming language

(defun ascii-byte-p (octet)
  "Return t if octet is a single-byte 7-bit ASCII char.
  The most significant bit is 0, so the allowed pattern is 0xxx xxxx."
  (assert (typep octet 'integer))
  (assert (<= (integer-length octet) 8))
  (let ((bitmask  #b10000000)
        (template #b00000000))
    ;; bitwise and the with the bitmask #b11000000 to extract the first two bits.
    ;; check if the first two bits are equal to the template #b10000000.
    (= (logand bitmask octet) template)))

(defun multi-byte-p (octet)
  "Return t if octet is a part of a multi-byte UTF-8 sequence.
  The multibyte pattern is 1xxx xxxx. A multi-byte can be either a lead byte or a trail byte."
  (assert (typep octet 'integer))
  (assert (<= (integer-length octet) 8))
  (let ((bitmask  #b10000000)
        (template #b10000000))
    ;; bitwise and the with the bitmask #b11000000 to extract the first two bits.
    ;; check if the first two bits are equal to the template #b10000000.
    (= (logand bitmask octet) template)))

(defun lead-byte-p (octet)
  "Return t if octet is one of the leading bytes of an UTF-8 sequence, nil otherwise.
  Allowed leading byte patterns are 0xxx xxxx, 110x xxxx, 1110 xxxx and 1111 0xxx."
  (assert (typep octet 'integer))
  (assert (<= (integer-length octet) 8))
  (let ((bitmasks  (list #b10000000 #b11100000 #b11110000 #b11111000))
        (templates (list #b00000000 #b11000000 #b11100000 #b11110000)))
    (some #'(lambda (a b) (= (logand a octet) b)) bitmasks templates)))

(defun n-trail-bytes (octet)
  "Take a leading utf-8 byte, return the number of continuation bytes 1-3."
  (assert (typep octet 'integer))
  (assert (<= (integer-length octet) 8))
  (let ((bitmasks  (list #b10000000 #b11100000 #b11110000 #b11111000))
        (templates (list #b00000000 #b11000000 #b11100000 #b11110000)))
    (loop for i from 0 to 3
       when (= (nth i templates) (logand (nth i bitmasks) octet))
       return i)))


(defun unicode-to-utf-8 (int)
  "Take a unicode code point, return a list of one to four UTF-8 encoded bytes (octets)."
  (assert (<= (integer-length int) 21))
  (let ((n-trail-bytes (cond ((<= #x00000 int #x00007F) 0)
                             ((<= #x00080 int #x0007FF) 1)
                             ((<= #x00800 int #x00FFFF) 2)
                             ((<= #x10000 int #x10FFFF) 3)))
        (lead-templates (list #b00000000 #b11000000 #b11100000 #b11110000))
        (trail-template #b10000000)
        ;; number of content bits in the lead byte.
        (n-lead-bits (list 7 5 4 3))
        ;; number of content bits in the trail byte.
        (n-trail-bits 6)
        ;; list to put the UTF-8 encoded bytes in.
        (byte-list nil))
    (if (= n-trail-bytes 0)
        ;; if we need 0 trail bytes, ist just an ascii single byte.
        (push int byte-list)
        (progn
          ;; if we need more than one byte, first fill the trail bytes with 6 bits each.
          (loop for i from 0 to (1- n-trail-bytes)
             do (push (+ trail-template
                         (ldb (byte n-trail-bits (* i n-trail-bits)) int))
                      byte-list))
          ;; then copy the remaining content bytes to the lead byte.
          (push (+ (nth n-trail-bytes lead-templates)
                   (ldb (byte (nth n-trail-bytes n-lead-bits) (* n-trail-bytes n-trail-bits)) int))
                byte-list)))
    ;; return the list of UTF-8 encoded bytes.
    byte-list))


(defun utf-8-to-unicode (byte-list)
  "Take a list of one to four utf-8 encoded bytes (octets), return a code point."
  (let ((b1 (car byte-list)))
    (cond ((ascii-byte-p b1) b1) ; if a single byte, just return it.
          ((multi-byte-p b1)
           (if (lead-byte-p b1)
               (let ((n (n-trail-bytes b1))
                     ;; Content bits we want to extract from each lead byte.
                     (lead-templates (list #b01111111 #b00011111 #b00001111 #b00000111))
                     ;; Content bits we want to extract from each trail byte.
                     (trail-template #b00111111))
                 (if (= n (1- (list-length byte-list)))
                     ;; add lead byte
                     (+ (ash (logand (nth 0 byte-list) (nth n lead-templates)) (* 6 n))
                        ;; and the trail bytes
                        (loop for i from 1 to n sum
                             (ash (logand (nth i byte-list) trail-template) (* 6 (- n i)))))
                     (error "calculated number of bytes doesnt match the length of the byte list")))
               (error "first byte in the list isnt a lead byte"))))))


(defun test-utf-8 ()
  "Return t if the chosen unicode points are encoded and decoded correctly."
  (let* ((unicodes-orig (list 65 246 1046 8364 119070))
         (unicodes-test (mapcar #'(lambda (x) (utf-8-to-unicode (unicode-to-utf-8 x)))
                                unicodes-orig)))
    (mapcar #'(lambda (x)
                (format t
                        "character ~A, code point: ~6x, utf-8: ~{~x ~}~%"
                        (code-char x)
                        x
                        (unicode-to-utf-8 x)))
            unicodes-orig)
    ;; return t if all are t
    (every #'= unicodes-orig unicodes-test)))


CL-USER> (test-utf-8)
character A, code point:     41, utf-8: 41 
character ö, code point:     F6, utf-8: C3 B6 
character Ж, code point:    416, utf-8: D0 96 
character €, code point:   20AC, utf-8: E2 82 AC 
character 𝄞, code point:  1D11E, utf-8: F0 9D 84 9E 
T


  

You may also check:How to resolve the algorithm Fivenum step by step in the Lua programming language
You may also check:How to resolve the algorithm Integer overflow step by step in the Phix programming language
You may also check:How to resolve the algorithm Bioinformatics/Sequence mutation step by step in the Yabasic programming language
You may also check:How to resolve the algorithm XML/XPath step by step in the R programming language
You may also check:How to resolve the algorithm Identity matrix step by step in the Ol programming language