How to resolve the algorithm Web scraping step by step in the Go programming language
How to resolve the algorithm Web scraping step by step in the Go programming language
Table of Contents
Problem Statement
Create a program that downloads the time from this URL: http://tycho.usno.navy.mil/cgi-bin/timer.pl and then prints the current UTC time by extracting just the UTC time from the web page's HTML. Alternatively, if the above url is not working, grab the first date/time off this page's talk page.
If possible, only use libraries that come at no extra monetary cost with the programming language and that are widely available and popular such as CPAN for Perl or Boost for C++.
Let's start with the solution:
Step by Step solution about How to resolve the algorithm Web scraping step by step in the Go programming language
This code is a simple program that tries to fetch the current UTC time from a specific website and print it to the console. The code is written in Go, a programming language developed by Google.
The main function starts with an HTTP GET request to fetch the content of the website using http.Get
. The response is stored in a variable resp
. If there is an error while making the request, the program prints the error and returns.
Next, the program uses an XML decoder (xml.NewDecoder
) to parse the response body. It iterates through the XML tokens until it finds the string "UTC" in the XML. Once the "UTC" string is found, it reads the surrounding text and stores it in a string variable us
.
The program then attempts to parse the string us
as a time using the expected date format "Jan. 2, 15:04:05 UTC" using time.Parse
. If the parsing is successful, it prints the parsed time in a human-readable format.
If the parsing fails, the program falls back to searching for a time-like string in the us
string using a regular expression. If a time-like string is found, it prints the found time.
As a last resort, if no time-like string is found, the program prints the entire XML element containing the "UTC" string, hoping that it contains a human-readable time somewhere.
Source code in the go programming language
package main
import (
"bytes"
"encoding/xml"
"fmt"
"io"
"net/http"
"regexp"
"time"
)
func main() {
resp, err := http.Get("http://tycho.usno.navy.mil/cgi-bin/timer.pl")
if err != nil {
fmt.Println(err) // connection or request fail
return
}
defer resp.Body.Close()
var us string
var ux int
utc := []byte("UTC")
for p := xml.NewDecoder(resp.Body); ; {
t, err := p.RawToken()
switch err {
case nil:
case io.EOF:
fmt.Println("UTC not found")
return
default:
fmt.Println(err) // read or parse fail
return
}
if ub, ok := t.(xml.CharData); ok {
if ux = bytes.Index(ub, utc); ux != -1 {
// success: found a line with the string "UTC"
us = string([]byte(ub))
break
}
}
}
// first thing to try: parsing the expected date format
if t, err := time.Parse("Jan. 2, 15:04:05 UTC", us[:ux+3]); err == nil {
fmt.Println("parsed UTC:", t.Format("January 2, 15:04:05"))
return
}
// fallback: search for anything looking like a time and print that
tx := regexp.MustCompile("[0-2]?[0-9]:[0-5][0-9]:[0-6][0-9]")
if justTime := tx.FindString(us); justTime > "" {
fmt.Println("found UTC:", justTime)
return
}
// last resort: just print the whole element containing "UTC" and hope
// there is a human readable time in there somewhere.
fmt.Println(us)
}
You may also check:How to resolve the algorithm Bitwise operations step by step in the Ecstasy programming language
You may also check:How to resolve the algorithm Comments step by step in the Brat programming language
You may also check:How to resolve the algorithm Averages/Mean angle step by step in the Factor programming language
You may also check:How to resolve the algorithm Undefined values step by step in the Fortran programming language
You may also check:How to resolve the algorithm Table creation/Postal addresses step by step in the PostgreSQL programming language