How to resolve the algorithm Text processing/2 step by step in the Ruby programming language

Problem Statement
Step by Step Solution
Sourcecode

Problem Statement

The following task concerns data that came from a pollution monitoring station with twenty-four instruments monitoring twenty-four aspects of pollution in the air. Periodically a record is added to the file, each record being a line of 49 fields separated by white-space, which can be one or more space or tab characters. The fields (from the left) are: i.e. a datestamp followed by twenty-four repetitions of a floating-point instrument value and that instrument's associated integer flag. Flag values are >= 1 if the instrument is working and < 1 if there is some problem with it, in which case that instrument's value should be ignored. A sample from the full data file readings.txt, which is also used in the Text processing/1 task, follows: Data is no longer available at that link. Zipped mirror available here

Let's start with the solution:

Step by Step solution about How to resolve the algorithm Text processing/2 step by step in the Ruby programming language

The Ruby code snippet you provided is a program that reads a file of readings and checks for formatting errors and missing or invalid data values.

It uses the Set class to track unique values and the enum_slice method to iterate over the fields in the file two at a time.

The program prints out error messages for each issue it finds, and it also provides summary statistics about the number of lines processed, the number of errors found, and the percentage of records with good readings.

Here is a breakdown of the code:

The require 'set' line loads the Set class from the Ruby standard library.
The munge2 function takes a file object as an argument and processes the lines in the file, checking for errors and missing or invalid data values.

-The datePat, valuPat, and statPat variables define regular expressions that are used to match dates, values, and statistics, respectively.

The totalLines variable keeps track of the total number of lines processed.

-The dupdate, badform, badlen, and badreading variables are sets that are used to track duplicate dates, lines with bad formatting, lines with the wrong number of fields, and lines with bad readings, respectively.

-The datestamps set is used to track the unique dates in the file.

-The for line in readings loop iterates over each line in the file.

-The fields array is populated with the fields from the current line.

-The date variable is set to the first field in the fields array.

-The pairs array is populated with the pairs of values and statistics from the current line.

-The lineFormatOk variable is set to true if the date matches the datePat regular expression and all of the pairs match the valuPat and statPat regular expressions.

-If lineFormatOk is false, the line is printed out with an error message.

-If the number of pairs is not 24 or any of the statistics values are less than 1, the line is printed out with an error message.

-If the date is already in the datestamps set, the line is printed out with an error message.

-The date is added to the datestamps set.

-After processing all of the lines in the file, the program prints out summary statistics about the number of lines processed, the number of errors found, and the percentage of records with good readings.

-The open('readings.txt','r') do |readings| block opens the readings.txt file for reading and passes the file object to the munge2 function.

Source code in the ruby programming language

require 'set'

def munge2(readings, debug=false)
   datePat = /^\d{4}-\d{2}-\d{2}/
   valuPat = /^[-+]?\d+\.\d+/
   statPat = /^-?\d+/
   totalLines = 0
   dupdate, badform, badlen, badreading = Set[], Set[], Set[], 0
   datestamps = Set[[]]
   for line in readings
      totalLines += 1
      fields = line.split(/\t/)
      date = fields.shift
      pairs = fields.enum_slice(2).to_a
 
      lineFormatOk = date =~ datePat &&
        pairs.all? { |x,y| x =~ valuPat && y =~ statPat }
      if !lineFormatOk
         puts 'Bad formatting ' + line if debug
         badform << date
      end
         
      if pairs.length != 24 ||
           pairs.any? { |x,y| y.to_i < 1 }
         puts 'Missing values ' + line if debug
      end
      if pairs.length != 24
         badlen << date
      end
      if pairs.any? { |x,y| y.to_i < 1 }
         badreading += 1
      end
 
      if datestamps.include?(date)
         puts 'Duplicate datestamp ' + line if debug
         dupdate << date
      end

      datestamps << date
   end

   puts 'Duplicate dates:', dupdate.sort.map { |x| '  ' + x }
   puts 'Bad format:', badform.sort.map { |x| '  ' + x }
   puts 'Bad number of fields:', badlen.sort.map { |x| '  ' + x }
   puts 'Records with good readings: %i = %5.2f%%' % [
      totalLines-badreading, (totalLines-badreading)/totalLines.to_f*100 ]
   puts
   puts 'Total records:  %d' % totalLines
end

open('readings.txt','r') do |readings|
   munge2(readings)
end

You may also check:How to resolve the algorithm Middle three digits step by step in the V (Vlang) programming language
You may also check:How to resolve the algorithm LZW compression step by step in the Eiffel programming language
You may also check:How to resolve the algorithm A+B step by step in the bc programming language
You may also check:How to resolve the algorithm Stack step by step in the Go programming language
You may also check:How to resolve the algorithm Vector step by step in the Racket programming language

How to resolve the algorithm Text processing/2 step by step in the Ruby programming language

Table of Contents

Problem Statement

Step by Step solution about How to resolve the algorithm Text processing/2 step by step in the Ruby programming language

Source code in the ruby programming language