Web Scraping with Node.js

Now, before you begin reading the tutorial. There is something that should be kept into knowledge and it is the fact that web scraping depending upon the source can, in fact, be illegal. So do read the terms and conditions of the website. You want to scrape data from.

Node libraries we would be using :

  • Request
  • Cheerio

SET up a node project

Create a directory and give it a name (this would be used as the project name) and run the following Command

npm init -y && npm i request cheerio 

Basic project Setup

The only thing to understand with the code is that we would use the request module to do an HTTP GET request. The response that we then get will be used by the cheerio module to parse and get us the relevant Data we are looking for.

let request = require('request')
let cheerio = require('cheerio')

const uri = 'https://en.wikipedia.org/wiki/Compass_rose';

let options ={
 uri : 'https://en.wikipedia.org/wiki/Compass_rose',   
 headers : { 
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/74.0.3729.157 Safari/537.36',
 }
}

request.get(options, function(err, response, body){
    if(err){
        throw err;
    }
  
    console.log(body)
})

The only thing worth mentioning here is the use of the header element. The header element can be populated with more values to make it seems like that the request is coming in from the browser. If you look at the only key in the header i.e., User-Agent it basically, in a nutshell, specifies the operating system and the browser to the web-server.

Now the answer as to why we need this is: Sometimes we may not get the correct response from the webserver so we need to tell the web server that this response is coming in from a browser. Additionally, you may want to pass additional information in the request object, for example, a few websites may use compression algorithms for-ex gzip in that case you may want to pass gzip : true to our request options object.

We now have the HTML in the body what we need to do now is to load this up in cheerio we can do this by :

const $ = cheerio.load(body);
console.log($('h1').text())

We are selecting the h1 tag inside the body and what the text() does is it prints out the text value associated with the selector. In this case, an h1 tag We are printing this value to the console. it should print out “Compass rose” (The h1 value inside the HTML body of our URI).

Now You should know the structure of the HTML page before you can begin to Scrape data. To know this structure you will need to use the dev tools of the browser. To check an element on the web page simply right click and click Inspect Element. Example Inspecting the Main heading gives us this

Now if you right-click on the highlighted text. You could copy its selector. you can then pass this selector value to our cheerio selector to get the same value as we got earlier. Always remember any changes to the structure of the HTML and your scrapper may not work as intended.

Using Cheerio

Now as of this point we know that to extract or scrape data with cheerio. We need to have a selector it could be the path or the id, in fact, anything that can uniquely identify that element.

To extract the src of the image on the page get the selector for the image using dev tools

let imgSrc = $('#mw-content-text > div > div:nth-child(4) > div > a > img').attr('src')

The imgSrc variable now holds the src URL for the image. To get all the images in the page

    let links = $('img');
    for (let property in links) {
            if (links[property].attribs && links[property].attribs.src) { 
                    console.log(links[property].attribs.src)
            }
        }

The code prints out all the images on the page as we used the img as the selector in this case we got all the img tags we then filtered out the images that had an src element to it and printed on the console.

We can get the length of the list on the html page by first getting the unique selector for the list and then using find to get all the list items and then getting the length of the found list items:

    let lengthOfList = $('#toc').find('li').length

Using Parent, Children , Sibling and Next methods

Lets randomly pick up a Paragraph by getting the selector from dev tools to perform the above operations

let main = $('#mw-content-text > div > p:nth-child(26)').parent().text()

let child = $('#mw-content-text > div > p:nth-child(26)').children().text()

let sibling = $('#mw-content-text > div > p:nth-child(26)').siblings().text()

let nextPara = $('#mw-content-text > div > p:nth-child(26)').next().text()

let prevPara = $('#mw-content-text > div > p:nth-child(26)').prev().text()

The parent() method returns the parent of the element in the selector in our case it prints out the text of the parent element.

The children() method returns all the element that is/are nested within the element in the selector in the dom.

The siblings() method returns all the element that is/are at the same level as that of the element in the selector in the dom.

The next() method returns the immediate next sibling of the element in the selector.

The prev() method returns the previous element (previous sibling) of the element in the selector.

Now that we have a basic understanding of scraping we can use this knowledge to build a sophisticated scraper dedicated towards a particular website. Additionally, we can use a headless browser like Puppeteer in conjunction with the Cheerio library to build an even advanced web scraper using Node.js check out this article to learn the use of Puppeteer.

In the end, Do keep in mind that scraping can be illegal depending upon the source. So, do read the terms and conditions of a web site before scraping its data.

Leave a Reply