How to resolve the algorithm Text processing/2 step by step in the Ruby programming language
How to resolve the algorithm Text processing/2 step by step in the Ruby programming language
Table of Contents
Problem Statement
The following task concerns data that came from a pollution monitoring station with twenty-four instruments monitoring twenty-four aspects of pollution in the air. Periodically a record is added to the file, each record being a line of 49 fields separated by white-space, which can be one or more space or tab characters. The fields (from the left) are: i.e. a datestamp followed by twenty-four repetitions of a floating-point instrument value and that instrument's associated integer flag. Flag values are >= 1 if the instrument is working and < 1 if there is some problem with it, in which case that instrument's value should be ignored. A sample from the full data file readings.txt, which is also used in the Text processing/1 task, follows: Data is no longer available at that link. Zipped mirror available here
Let's start with the solution:
Step by Step solution about How to resolve the algorithm Text processing/2 step by step in the Ruby programming language
The Ruby code snippet you provided is a program that reads a file of readings and checks for formatting errors and missing or invalid data values.
It uses the Set class to track unique values and the enum_slice method to iterate over the fields in the file two at a time.
The program prints out error messages for each issue it finds, and it also provides summary statistics about the number of lines processed, the number of errors found, and the percentage of records with good readings.
Here is a breakdown of the code:
-
The require 'set' line loads the Set class from the Ruby standard library.
-
The munge2 function takes a file object as an argument and processes the lines in the file, checking for errors and missing or invalid data values.
-The datePat, valuPat, and statPat variables define regular expressions that are used to match dates, values, and statistics, respectively.
- The totalLines variable keeps track of the total number of lines processed.
-The dupdate, badform, badlen, and badreading variables are sets that are used to track duplicate dates, lines with bad formatting, lines with the wrong number of fields, and lines with bad readings, respectively.
-The datestamps set is used to track the unique dates in the file.
-The for line in readings loop iterates over each line in the file.
-The fields array is populated with the fields from the current line.
-The date variable is set to the first field in the fields array.
-The pairs array is populated with the pairs of values and statistics from the current line.
-The lineFormatOk variable is set to true if the date matches the datePat regular expression and all of the pairs match the valuPat and statPat regular expressions.
-If lineFormatOk is false, the line is printed out with an error message.
-If the number of pairs is not 24 or any of the statistics values are less than 1, the line is printed out with an error message.
-If the date is already in the datestamps set, the line is printed out with an error message.
-The date is added to the datestamps set.
-After processing all of the lines in the file, the program prints out summary statistics about the number of lines processed, the number of errors found, and the percentage of records with good readings.
-The open('readings.txt','r') do |readings| block opens the readings.txt file for reading and passes the file object to the munge2 function.
Source code in the ruby programming language
require 'set'
def munge2(readings, debug=false)
datePat = /^\d{4}-\d{2}-\d{2}/
valuPat = /^[-+]?\d+\.\d+/
statPat = /^-?\d+/
totalLines = 0
dupdate, badform, badlen, badreading = Set[], Set[], Set[], 0
datestamps = Set[[]]
for line in readings
totalLines += 1
fields = line.split(/\t/)
date = fields.shift
pairs = fields.enum_slice(2).to_a
lineFormatOk = date =~ datePat &&
pairs.all? { |x,y| x =~ valuPat && y =~ statPat }
if !lineFormatOk
puts 'Bad formatting ' + line if debug
badform << date
end
if pairs.length != 24 ||
pairs.any? { |x,y| y.to_i < 1 }
puts 'Missing values ' + line if debug
end
if pairs.length != 24
badlen << date
end
if pairs.any? { |x,y| y.to_i < 1 }
badreading += 1
end
if datestamps.include?(date)
puts 'Duplicate datestamp ' + line if debug
dupdate << date
end
datestamps << date
end
puts 'Duplicate dates:', dupdate.sort.map { |x| ' ' + x }
puts 'Bad format:', badform.sort.map { |x| ' ' + x }
puts 'Bad number of fields:', badlen.sort.map { |x| ' ' + x }
puts 'Records with good readings: %i = %5.2f%%' % [
totalLines-badreading, (totalLines-badreading)/totalLines.to_f*100 ]
puts
puts 'Total records: %d' % totalLines
end
open('readings.txt','r') do |readings|
munge2(readings)
end
You may also check:How to resolve the algorithm Middle three digits step by step in the V (Vlang) programming language
You may also check:How to resolve the algorithm LZW compression step by step in the Eiffel programming language
You may also check:How to resolve the algorithm A+B step by step in the bc programming language
You may also check:How to resolve the algorithm Stack step by step in the Go programming language
You may also check:How to resolve the algorithm Vector step by step in the Racket programming language