How to resolve the algorithm Compiler/lexical analyzer step by step in the C++ programming language

Problem Statement
Step by Step Solution
Sourcecode

Problem Statement

Definition from Wikipedia: Create a lexical analyzer for the simple programming language specified below. The program should read input from a file and/or stdin, and write output to a file and/or stdout. If the language being used has a lexer module/library/class, it would be great if two versions of the solution are provided: One without the lexer module, and one with. The simple programming language to be analyzed is more or less a subset of C. It supports the following tokens: These differ from the the previous tokens, in that each occurrence of them has a value associated with it. For example, the following two program fragments are equivalent, and should produce the same token stream except for the line and column positions: The program output should be a sequence of lines, each consisting of the following whitespace-separated fields:

This task is intended to be used as part of a pipeline, with the other compiler tasks - for example: lex < hello.t | parse | gen | vm Or possibly: lex hello.t lex.out parse lex.out parse.out gen parse.out gen.out vm gen.out

This implies that the output of this task (the lexical analyzer) should be suitable as input to any of the Syntax Analyzer task programs. The following error conditions should be caught: Your solution should pass all the test cases above and the additional tests found Here.

The C and Python versions can be considered reference implementations.

Let's start with the solution:

Step by Step solution about How to resolve the algorithm Compiler/lexical analyzer step by step in the C++ programming language

This C++ program is a lexer, a component of a compiler that analyzes the input stream of a program in source code form and produces tokens, which are sequences of characters that represent lexical units. The lexer's primary goal is to identify and classify individual elements of the source code, such as keywords, identifiers, operators, and punctuation.

Here's a detailed explanation of the program:

Auxiliary Functions:
- file_to_string: Reads the contents of a file into a string.
- string_to_file: Writes a string to a file.
- with_IO: A generic function to perform input-output operations. It takes the input source (a file path or "stdin"), the output destination (a file path or "stdout"), and a lambda function that operates on the input and produces the output. This function provides a convenient way to handle input and output in a consistent manner.
Tokenization:
- Tokens: The program defines a Token struct that represents a token, consisting of a token name, a token value (which can be an integer or a string depending on the token type), and its source code location (line and column numbers).
- Keywords: It maintains a static map<string, TokenName> called keywords to identify keywords in the source code. Keywords are reserved words with special meanings in the language (e.g., "if", "else", "while").
- Lexer: The Lexer class is the core component that performs lexical analysis. It initializes with a character pointer to the source code.
- next_token: This method is the main entry point for tokenization. It scans the input, identifies the next token, and constructs a Token object. It employs a state machine approach, using a Scanner object to keep track of the current position in the input and to advance through the source code character by character.
- Helper Methods: The lexer uses various helper methods to handle specific situations, such as:
  - simply: Creates a token for simple one-character tokens (e.g., operators, parentheses, braces).
  - expect: Verifies that the next character matches an expected character and creates a token if so.
  - follow: Verifies that the next character matches an expected character and returns either ifyes or ifno depending on whether the match succeeds.
  - divide_or_comment: Handles the ambiguity between division and comments.
  - char_lit: Lexes a character literal (enclosed in single quotes).
  - string_lit: Lexes a string literal (enclosed in double quotes).
  - identifier: Lexes an identifier, which can be either a keyword or a user-defined identifier.
  - integer_lit: Lexes an integer literal.
Input and Output:
- The program accepts two command-line arguments: the input source and the output destination. If no arguments are provided, it defaults to reading from "stdin" and writing to "stdout."
- It uses the with_IO function to handle input and output operations in a generic manner.
- The program generates a table of tokens with their names, values, and source code locations. Each token is printed on a separate line.
Compilation and Usage:
- The program can be compiled using a C++ compiler.
- To use the program, provide the input source and output destination as command-line arguments (e.g., ./lexer source_code destination_file). If no arguments are provided, it will read from "stdin" and write to "stdout."

Source code in the cpp programming language

#include <charconv>      // std::from_chars
#include <fstream>       // file_to_string, string_to_file
#include <functional>    // std::invoke
#include <iomanip>       // std::setw
#include <ios>           // std::left
#include <iostream>
#include <map>           // keywords
#include <sstream>
#include <string>
#include <utility>       // std::forward
#include <variant>       // TokenVal

using namespace std;

// =====================================================================================================================
// Machinery
// =====================================================================================================================
string file_to_string (const string& path)
{
    // Open file
    ifstream file {path, ios::in | ios::binary | ios::ate};
    if (!file)   throw (errno);

    // Allocate string memory
    string contents;
    contents.resize(file.tellg());

    // Read file contents into string
    file.seekg(0);
    file.read(contents.data(), contents.size());

    return contents;
}

void string_to_file (const string& path, string contents)
{
    ofstream file {path, ios::out | ios::binary};
    if (!file)    throw (errno);

    file.write(contents.data(), contents.size());
}

template <class F>
void with_IO (string source, string destination, F&& f)
{
    string input;

    if (source == "stdin")    getline(cin, input);
    else                      input = file_to_string(source);

    string output = invoke(forward<F>(f), input);

    if (destination == "stdout")    cout << output;
    else                            string_to_file(destination, output);
}

// Add escaped newlines and backslashes back in for printing
string sanitize (string s)
{
    for (auto i = 0u; i < s.size(); ++i)
    {
        if      (s[i] == '\n')    s.replace(i++, 1, "\\n");
        else if (s[i] == '\\')    s.replace(i++, 1, "\\\\");
    }

    return s;
}

class Scanner
{
public:
    const char* pos;
    int         line   = 1;
    int         column = 1;

    Scanner (const char* source) : pos {source} {}

    inline char peek ()    { return *pos; }

    void advance ()
    {
        if (*pos == '\n')    { ++line; column = 1; }
        else                 ++column;

        ++pos;
    }

    char next ()
    {
        advance();
        return peek();
    }

    void skip_whitespace ()
    {
        while (isspace(static_cast<unsigned char>(peek())))
            advance();
    }
}; // class Scanner


// =====================================================================================================================
// Tokens
// =====================================================================================================================
enum class TokenName
{
    OP_MULTIPLY, OP_DIVIDE, OP_MOD, OP_ADD, OP_SUBTRACT, OP_NEGATE,
    OP_LESS, OP_LESSEQUAL, OP_GREATER, OP_GREATEREQUAL, OP_EQUAL, OP_NOTEQUAL,
    OP_NOT, OP_ASSIGN, OP_AND, OP_OR,
    LEFTPAREN, RIGHTPAREN, LEFTBRACE, RIGHTBRACE, SEMICOLON, COMMA,
    KEYWORD_IF, KEYWORD_ELSE, KEYWORD_WHILE, KEYWORD_PRINT, KEYWORD_PUTC,
    IDENTIFIER, INTEGER, STRING,
    END_OF_INPUT, ERROR
};

using TokenVal = variant<int, string>;

struct Token
{
    TokenName name;
    TokenVal  value;
    int       line;
    int       column;
};


const char* to_cstring (TokenName name)
{
    static const char* s[] =
    {
        "Op_multiply", "Op_divide", "Op_mod", "Op_add", "Op_subtract", "Op_negate",
        "Op_less", "Op_lessequal", "Op_greater", "Op_greaterequal", "Op_equal", "Op_notequal",
        "Op_not", "Op_assign", "Op_and", "Op_or",
        "LeftParen", "RightParen", "LeftBrace", "RightBrace", "Semicolon", "Comma",
        "Keyword_if", "Keyword_else", "Keyword_while", "Keyword_print", "Keyword_putc",
        "Identifier", "Integer", "String",
        "End_of_input", "Error"
    };

    return s[static_cast<int>(name)];
}


string to_string (Token t)
{
    ostringstream out;
    out << setw(2) << t.line << "   " << setw(2) << t.column  << "   ";

    switch (t.name)
    {
        case (TokenName::IDENTIFIER)   : out << "Identifier        "   << get<string>(t.value);                  break;
        case (TokenName::INTEGER)      : out << "Integer           "   << left << get<int>(t.value);             break;
        case (TokenName::STRING)       : out << "String            \"" << sanitize(get<string>(t.value)) << '"'; break;
        case (TokenName::END_OF_INPUT) : out << "End_of_input";                                                  break;
        case (TokenName::ERROR)        : out << "Error             "   << get<string>(t.value);                  break;
        default                        : out << to_cstring(t.name);
    }

    out << '\n';

    return out.str();
}


// =====================================================================================================================
// Lexer
// =====================================================================================================================
class Lexer
{
public:
    Lexer (const char* source) : s {source}, pre_state {s} {}

    bool has_more ()    { return s.peek() != '\0'; }

    Token next_token ()
    {
        s.skip_whitespace();

        pre_state = s;

        switch (s.peek())
        {
            case '*'  :    return simply(TokenName::OP_MULTIPLY);
            case '%'  :    return simply(TokenName::OP_MOD);
            case '+'  :    return simply(TokenName::OP_ADD);
            case '-'  :    return simply(TokenName::OP_SUBTRACT);
            case '{'  :    return simply(TokenName::LEFTBRACE);
            case '}'  :    return simply(TokenName::RIGHTBRACE);
            case '('  :    return simply(TokenName::LEFTPAREN);
            case ')'  :    return simply(TokenName::RIGHTPAREN);
            case ';'  :    return simply(TokenName::SEMICOLON);
            case ','  :    return simply(TokenName::COMMA);
            case '&'  :    return expect('&', TokenName::OP_AND);
            case '|'  :    return expect('|', TokenName::OP_OR);
            case '<'  :    return follow('=', TokenName::OP_LESSEQUAL,    TokenName::OP_LESS);
            case '>'  :    return follow('=', TokenName::OP_GREATEREQUAL, TokenName::OP_GREATER);
            case '='  :    return follow('=', TokenName::OP_EQUAL,        TokenName::OP_ASSIGN);
            case '!'  :    return follow('=', TokenName::OP_NOTEQUAL,     TokenName::OP_NOT);
            case '/'  :    return divide_or_comment();
            case '\'' :    return char_lit();
            case '"'  :    return string_lit();

            default   :    if (is_id_start(s.peek()))    return identifier();
                           if (is_digit(s.peek()))       return integer_lit();
                           return error("Unrecognized character '", s.peek(), "'");

            case '\0' :    return make_token(TokenName::END_OF_INPUT);
        }
    }


private:
    Scanner s;
    Scanner pre_state;
    static const map<string, TokenName> keywords;


    template <class... Args>
    Token error (Args&&... ostream_args)
    {
        string code {pre_state.pos, (string::size_type) s.column - pre_state.column};

        ostringstream msg;
        (msg << ... << forward<Args>(ostream_args)) << '\n'
            << string(28, ' ') << "(" << s.line << ", " << s.column << "): " << code;

        if (s.peek() != '\0')    s.advance();

        return make_token(TokenName::ERROR, msg.str());
    }


    inline Token make_token (TokenName name, TokenVal value = 0)
    {
        return {name, value, pre_state.line, pre_state.column};
    }


    Token simply (TokenName name)
    {
        s.advance();
        return make_token(name);
    }


    Token expect (char expected, TokenName name)
    {
        if (s.next() == expected)    return simply(name);
        else                         return error("Unrecognized character '", s.peek(), "'");
    }


    Token follow (char expected, TokenName ifyes, TokenName ifno)
    {
        if (s.next() == expected)    return simply(ifyes);
        else                         return make_token(ifno);
    }


    Token divide_or_comment ()
    {
        if (s.next() != '*')    return make_token(TokenName::OP_DIVIDE);

        while (s.next() != '\0')
        {
            if (s.peek() == '*' && s.next() == '/')
            {
                s.advance();
                return next_token();
            }
        }

        return error("End-of-file in comment. Closing comment characters not found.");
    }


    Token char_lit ()
    {
        int n = s.next();

        if (n == '\'')    return error("Empty character constant");

        if (n == '\\')    switch (s.next())
                          {
                              case 'n'  :    n = '\n'; break;
                              case '\\' :    n = '\\'; break;
                              default   :    return error("Unknown escape sequence \\", s.peek());
                          }

        if (s.next() != '\'')    return error("Multi-character constant");

        s.advance();
        return make_token(TokenName::INTEGER, n);
    }


    Token string_lit ()
    {
        string text = "";

        while (s.next() != '"')
            switch (s.peek())
            {
                case '\\' :    switch (s.next())
                               {
                                   case 'n'  :    text += '\n'; continue;
                                   case '\\' :    text += '\\'; continue;
                                   default   :    return error("Unknown escape sequence \\", s.peek());
                               }

                case '\n' :    return error("End-of-line while scanning string literal."
                                            " Closing string character not found before end-of-line.");

                case '\0' :    return error("End-of-file while scanning string literal."
                                            " Closing string character not found.");

                default   :    text += s.peek();
            }

        s.advance();
        return make_token(TokenName::STRING, text);
    }


    static inline bool is_id_start (char c)    { return isalpha(static_cast<unsigned char>(c)) || c == '_'; }
    static inline bool is_id_end   (char c)    { return isalnum(static_cast<unsigned char>(c)) || c == '_'; }
    static inline bool is_digit    (char c)    { return isdigit(static_cast<unsigned char>(c));             }


    Token identifier ()
    {
        string text (1, s.peek());

        while (is_id_end(s.next()))    text += s.peek();

        auto i = keywords.find(text);
        if (i != keywords.end())    return make_token(i->second);

        return make_token(TokenName::IDENTIFIER, text);
    }


    Token integer_lit ()
    {
        while (is_digit(s.next()));

        if (is_id_start(s.peek()))
            return error("Invalid number. Starts like a number, but ends in non-numeric characters.");

        int n;

        auto r = from_chars(pre_state.pos, s.pos, n);
        if (r.ec == errc::result_out_of_range)    return error("Number exceeds maximum value");

        return make_token(TokenName::INTEGER, n);
    }
}; // class Lexer


const map<string, TokenName> Lexer::keywords =
{
    {"else",  TokenName::KEYWORD_ELSE},
    {"if",    TokenName::KEYWORD_IF},
    {"print", TokenName::KEYWORD_PRINT},
    {"putc",  TokenName::KEYWORD_PUTC},
    {"while", TokenName::KEYWORD_WHILE}
};


int main (int argc, char* argv[])
{
    string in  = (argc > 1) ? argv[1] : "stdin";
    string out = (argc > 2) ? argv[2] : "stdout";

    with_IO(in, out, [](string input)
    {
        Lexer lexer {input.data()};

        string s = "Location  Token name        Value\n"
                   "--------------------------------------\n";

        while (lexer.has_more())    s += to_string(lexer.next_token());
        return s;
    });
}

You may also check:How to resolve the algorithm Empty program step by step in the Scilab programming language
You may also check:How to resolve the algorithm Fibonacci word/fractal step by step in the Phix programming language
You may also check:How to resolve the algorithm Color of a screen pixel step by step in the Groovy programming language
You may also check:How to resolve the algorithm Function composition step by step in the Python programming language
You may also check:How to resolve the algorithm Munching squares step by step in the Fōrmulæ programming language

How to resolve the algorithm Compiler/lexical analyzer step by step in the C++ programming language

Table of Contents

Problem Statement

Step by Step solution about How to resolve the algorithm Compiler/lexical analyzer step by step in the C++ programming language

Source code in the cpp programming language