How Golang is Used for Web Scraping with Concurrency?

Concurrency is required to collect data from several sites and update it. In general, this could create a queue that would have to wait for each page to be completed before moving on to the next. Concurrency will allow you to scrape numerous sites at once.

Data Flow Designs

When scraping many websites, you must wait for each page to finish before moving on to the next. The Data Flow diagram shows how ordinary scraping works.

On other hand, a web scraper may scrape several pages at the same time using parallelism in a single request. The Data flow below explains how this differs from traditional scraping.

Use Cases

There might be various reasons to employ concurrency when scraping. We will try to acquire historical exchange rates for a certain currency in this case. Extracting historical data for a month will require 30 queries, and the loading time will grow in proportion to the number of requests.

Here, we will try to find a historical currency between two currencies with a certain customized data range.

Also, we will try to make a comparison to equate the speed of the non-currency versus concurrency.

Code Highlights

The final code will be large. Please check the source as there is only an important part mentioned in this blog.

GITHUB Go-currency

To access the page’s content, you can simply utilize the net/http package in Golang and submit the request using the NewRequestWithContext method.

You can use the Go lib goquery to quickly parse the HTML tags using selectors. This source may be used to get more information on how to install it or how to use it.

GITHUB Go-query-master

Code Implementation

Initially, utilize the goquery package to develop a reusable function to fetch the page content and parse it into an HTML document. To utilize this function, simply supply the arguments such as the destination URL, method type, and any other factors that will be required (header, form-data, cookies, etc.)

package utils

import (

"context"

"fmt"

"io"

"net/http"

"net/url"

"strings"

"time"

"github.com/PuerkitoBio/goquery"

)

// GetPage call the client page by HTTP request and extract the body to HTML document.

func GetPage(ctx context.Context, method, siteURL string, cookies []*http.Cookie, headers, formDatas map[string]string, timeout int) (*goquery.Document, []*http.Cookie, error) {

// This function can handle both all methods.

// Initiate this body variable as nil for method that doesn't required body.

body := io.Reader(nil)

// If the request contain form-data, add the form-data parameters to the body.

if len(formDatas) > 0 {

    form := url.Values{}

    for k, v := range formDatas {

        form.Add(k, v)

    }

    body = strings.NewReader(form.Encode())

}

// Create a new HTTP request with context.

req, err := http.NewRequestWithContext(ctx, method, siteURL, body)

if err != nil {

    return nil, nil, fmt.Errorf("failed to create http request context: %w", err)

}

// If the request contain headers, add the header parameters.

if len(headers) > 0 {

    for k, v := range headers {

        req.Header.Add(k, v)

    }

}

// If the request contain cookies, add the cookie parameters.

if len(cookies) > 0 {

    for _, c := range cookies {

        req.AddCookie(c)

    }

}

// Use the default timeout if the timeout parameter isn't configured.

reqTimeout := 10 * time.Second

if timeout != 0 {

    reqTimeout = time.Duration(timeout) * time.Second

}

// Use default http Client.

httpClient := &http.Client{

    Transport:     http.DefaultTransport,

    CheckRedirect: nil,

    Jar:           nil,

    Timeout:       reqTimeout,

}

// Execute the request.

resp, err := httpClient.Do(req)

if err != nil {

    return nil, nil, fmt.Errorf("failed to execute http request: %w", err)

}

// Close the response body

defer func() { _ = resp.Body.Close() }()

// // Parsing response body to HTML document reader.

doc, err := goquery.NewDocumentFromReader(resp.Body)

if err != nil {

    return nil, nil, fmt.Errorf("failed to parse html: %w", err)

}

// Return HTML doc, cookies.

return doc, resp.Cookies(), nil

We’ll then attempt to parse the currency value. Our service will include criteria such as “from,” “to,” and “date.”

The complete scraping code to obtain the currency based on parameters is shown below. We’ll need another method to iterate through the time range arguments because this program only scrapes on a given day.

// getCurrencyHistory gets the currency value on a specific date.
func getCurrencyHistory(ctx context.Context, from, to, date string) (*entities.CurrencyHistory, error) {
urlValues := url.Values{
"from": {to}, // Reverse `from` and `to` due to easily parse the currency value.
"amount": {"1"},
"date": {date},
}

siteURL := fmt.Sprintf("https://www.x-rates.com/historical/?%s", urlValues.Encode())

// Scrape the page.
doc, _, err := utils.GetPage(ctx, http.MethodGet, siteURL, nil, nil, nil, 0)
if err != nil {
return nil, err
}

var currencyHistory *entities.CurrencyHistory

// Scrape the currency value.
doc.Find(".ratesTable tbody tr td").EachWithBreak(func(i int, s *goquery.Selection) bool {
// Scrap the attribute href value from `a` tag HTML.
// https://www.x-rates.com/graph/?from=JPY&to=IDR
// Ignore exists value due to also will check in next line.
href, _ := s.Find("a").Attr("href")


// Reverse `from` and `to` due to easily parse the currency value.
if !strings.Contains(href, "to="+from) {
return true
}

// If the target currency match, scrape the text value.
valueString := s.Find("a").Text()
value, err := strconv.ParseFloat(valueString, 64)
if err != nil {
return true
}

currencyHistory = &entities.CurrencyHistory{
Date: date,
Value: value,
}

return false
})

return currencyHistory, nil
}

The concurrency part must then be implemented. Because you will need to manage the problem, you will iterate the time range argument and utilize error group. The below given is how to final code will look:

/ getCurrencyHistories gets the currencies value on a range date.
func getCurrencyHistories(ctx context.Context, start, end time.Time, from, to string) ([]*entities.CurrencyHistory, error) {
// Get the number of days between start and end date.
days := int(end.Sub(start).Hours()/24) + 1

currencyHistories := make([]*entities.CurrencyHistory, days)

eg, ctx := errgroup.WithContext(ctx)

idx := 0
for d := start; d.After(end) == false; d = d.AddDate(0, 0, 1) {
// Defined new variable to avoid mismatch value when using goroutine.
d := d
i := idx

// Concurrently gets the value on specific date.
eg.Go(func() (err error) {
currencyHistory, err := getCurrencyHistory(ctx, from, to, d.Format("2006-01-02"))
currencyHistories[i] = currencyHistory
return err
})

idx++
}

// Wait all request finished and check the error.
if err := eg.Wait(); err != nil {
return nil, err
}

return currencyHistories, nil
}

The performance of utilizing concurrency vs not using concurrency will be compared. Without concurrency, the code will look like this:

// getCurrencyHistories gets the currencies value on a range date.
func getCurrencyHistories(ctx context.Context, start, end time.Time, from, to string) ([]*entities.CurrencyHistory, error) {
// Get the number of days between start and end date.
days := int(end.Sub(start).Hours()/24) + 1
currencyHistories := make([]*entities.CurrencyHistory, days)

idx := 0
for d := start; d.After(end) == false; d = d.AddDate(0, 0, 1) {

currencyHistory, err := getCurrencyHistory(ctx, from, to, d.Format("2006-01-02"))
if err != nil {
return nil, err
}

currencyHistories[idx] = currencyHistory
idx++
}

return currencyHistories, nil
}

Benchmarking

Multiple requests with varying amounts of date queries (1, 2, 5, 10, 20, and 30) will be tested for non and concurrency. To mimic the benchmark, just launch the service and make a request to a currency history endpoint. The number of inquiries is determined by the number of days that pass between the start and finish dates.

For example, suppose you have ten queries:

v1/currency/history?from=IDR&to=JPY&start_date=2022-03-01&end_date=2022-03-10

Test is performed with the device having the specifications as below:

Mac Mini (M1,200)
Chip Apple M1
16 GB memory
macOS Monterey version 12.3
Internet speed 42.53 (Download) 15.34 (Upload) 29ms (Ping)
Internet region: Indonesia

As predicted, the response time for non-concurrency would gradually grow over time as the number of requests increases. However, with concurrency, response time rose somewhat but not significantly.

Note that response times may vary based on the state of the website, traffic, internet speed, geography, and other factors.

For more queries or web scraping services contact iWeb Scraping today!

Request for a quote!

Frequently Asked Questions

The primary advantage is scalability and real-time business intelligence. Manually reading tweets is inefficient. Sentiment analysis tools allow you to instantly analyze thousands of tweets about your brand, products, or campaigns. This provides a scalable way to understand customer feelings, track brand reputation, and gather actionable insights from a massive, unfiltered source of public opinion, as highlighted in the blog’s “Advantages” section.

By analyzing the sentiment behind tweets, businesses can directly understand why customers feel the way they do. It helps identify pain points with certain products, gauge reactions to new launches, and understand the reasons behind positive feedback. This deep insight into the “voice of the customer” allows companies to make data-driven decisions to improve products, address complaints quickly, and enhance overall customer satisfaction, which aligns with the business applications discussed in the blog.

Yes, when using advanced tools, it provides reliable and consistent criteria. As the blog notes, manual analysis can be inconsistent due to human bias. Automated sentiment analysis using Machine Learning and AI (like the technology used by iWeb Scraping) trains models to tag data uniformly. This eliminates human inconsistency, provides results with a high degree of accuracy, and offers a reliable foundation for strategic business decisions.

Businesses can use a range of tools, from code-based libraries to dedicated platforms. As mentioned in the blog, popular options include Python with libraries like Tweepy and TextBlob, or dedicated services like MeaningCloud and iWeb Scraping’s Text Analytics API. The choice depends on your needs: Python offers customization for technical teams, while off-the-shelf APIs from web scraping services provide a turnkey solution for automatically scraping Twitter and extracting brand insights quickly and accurately.

Share this Article :

Build the scraper you want123

We’ll customize your concurrency, speed, and extended trial — for high-volume scraping.

Continue Reading

E-Commerce2

How to Extract & Save Facebook Group Members to a Google Sheet?

Get a jump on including Bootstrap's source files in a new project with our official guides.Get a jump on including Bootstrap's source files.

Parth Vataliya 4 Min Read
E-Commerce2

How to Extract & Save Facebook Group Members to a Google Sheet?

Get a jump on including Bootstrap's source files in a new project with our official guides.Get a jump on including Bootstrap's source files.

Parth Vataliya 4 Min Read
E-Commerce2

How to Extract & Save Facebook Group Members to a Google Sheet?

Get a jump on including Bootstrap's source files in a new project with our official guides.Get a jump on including Bootstrap's source files.

Parth Vataliya 4 Min Read
Scroll to Top