Web Scrapping with node js

Bruteforce
5 min readOct 13, 2020

I will teach you how to scrape content from a dynamic website and make your own API.

Environment — WINDOWS

Introduction

In this tutorial, we’ll learn how to:
1. Set up a Node.JS Server (with Express)
2. Scrape the internet for the data that we want
3. Extract the information and format it for the user
4. Expose it using a RESTful API that can be consumed by other applications
5. Deploy it on Heroku
6. Resolve CORS-Issue

Requirements

Alright, so first and foremost, we’ll need the base server that runs the entire system. If you haven’t already guessed it, it’s Node.JS.

But before that, let’s sync up with the bare necessities:
1. VS Code(code editor)
2. Command Prompt

If all that seemed a bit far-fetched to you, don’t worry… it’s going to be a simple process and I’m sure you’ll learn as you code along.

If you need the completed code for this tutorial, you can find it on GitHub

Setup Node.JS

  1. Visit NodeJS website
Downloads Section of NodeJS

2. Select the compatible Environment
3. Install the downloaded file

Once you have Node.JS installed, let’s verify if it works. Type the commands below to check if they’re actually working.

>node -v
v10.16.0
>npm -v
6.14.8

If you see the above results, that means you’re good to go! 😃 😊

Step 1:

  1. Make a folder in your system. For example: - made a folder named as “horoscopeAPI”.
  2. Copy that folder’s path.
  3. Open Command Prompt.
  4. Type the following command.
cd <folder path>
npm init
package name: (horoscopeAPI)
version: (1.0.0)
description: A RESTful API that scrapes the internet to get you today's horoscope reading.
entry point: (index.js)
test command:
git repository:
keywords: horoscope, astrology, restful, api, nodejs
author: <your name>
license: (ISC)

And you’ll end up with something like this:

{
"name": "horoscopeAPI",
"version": "1.0.0",
"description": "A RESTful API that scrapes the internet to get you today's horoscope reading.",
"main": "index.js",
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1"
},
"keywords": [
"horoscope",
"astrology",
"restful",
"api",
"nodejs"
],
"author": "Kaustubh",
"license": "ISC"
}

That’s great. Now, let’s start installing the packages we need to begin working on our API. We need the following:

  1. Express — It’s the framework that helps you take care of routing and server-side mumbo-jumbo, and is also capable of templating.
  2. Request — This package helps us make HTTP requests and calls.
  3. Cheerio — This is basically jQuery on the Server Side. We’re going to be using this package to scrape the DOM much more easily.

To install all these, we just need to run this simple command:

npm install --save express request cheerio

Step 2:

  1. Open VS Code Editor
  2. Open the horoscope folder in the editor
  3. Create a file named index.js .

Now, we’ll be building the entry point into our API
The contents of the file begin with this, where we’re basically going to pull in all of the packages we installed and bring them onto our application:

const express = require('express');
const request = require('request');
const cheerio = require('cheerio');
const app = express();

Now, we just need the boilerplate code to set up the server:

app.get('/', function(req, res){
// Will add scraping code here
});app.listen('8000');
console.log('API is running on http://localhost:8000');module.exports = app;

Now, that was an easy step, wasn’t it? You can even run the server now by visiting the link http://localhost:8000 on your browser but it just wouldn’t do anything. Let’s go on with building our API.

Step 3:

Select a website to scrape: -

  1. Select website and content to scrape. I am scraping this.
  2. Select the content you need to get.
Selected Content

3. Right Click on the selected Content and hit Inspect.

You need to do is find out where our data is within the DOM:

<div> and <p>

Get the div name

div.main-horoscope > p

So all we need to do is extract the text from it.

      var prediction = $('div.main-horoscope > p').text();
var json = {
id: id,
horoscope: horoscope,
prediction: prediction
};

this will store our information in json variable.

Once we have our JSON, all we need to do now is send it over back to the requesting client using the res object:

res.send(json);

Now, when we try to access the test URL, we’d get this:

http://localhost:8000

Final Code for the API:-

index.js

const express = require('express');
const request = require('request');
const cheerio = require('cheerio');
const app = express();
//CORS- ISSUE SORTED
app.use(function (req, res, next) {
res.setHeader('Access-Control-Allow-Origin', '*');
res.setHeader('Access-Control-Allow-Headers', 'Origin, X-Requested-With, Content-Type, Accept');
res.setHeader('Access-Control-Allow-Methods', 'POST, GET, PATCH, DELETE, OPTIONS');
next();
});
//CORS- ISSUE SORTED
app.get('/', async function(req, res)
{
var prediction="";
var horoscope = ["none","Aries","Taurus","Gemini","Cancer","Leo","Virgo","Libra","Scorpio","Sagittarius","Capricorn","Aquarius","Pisces"];
var json =[];
for(id=1;id<13;id++){
url = 'https://www.horoscope.com/us/horoscopes/general/horoscope-general-daily-today.aspx?sign='+id;
var data = await new Promise(function (resolve, reject){
request(url, function(error, response, html) {
if(!error) {
$ = cheerio.load(html);
prediction = $('div.main-horoscope > p').text();
resolve({
id: id,
horoscope: horoscope[id],
prediction: prediction,
});
}else{
reject(undefined);
}
});
});
json.push(data);
}
res.send(json);
});
app.listen(process.env.PORT || 5000);
module.exports = app;

Now, test wheater API is working on localhost:-

For that, Perform these steps:-

Open Command Prompt

cd <folder path>
node index.js

Now click here

Kudos You got your json file, Now it’s time to deploy. 😍

Step 4:

Now you will end up with these three files: -

API is ready
  1. Create a new File named as Procfile
  2. Write:-
web: node index.js

3. Save it.

4. Push these files on GitHub in your repository.

5. Goto your Heroku dashboard

6. Create a new app

7. Connect this repository in the deploy section and hit deploy.

Now, Click on the deployed link, and see the magic

API

😎 And our API is complete 😜

For any issues, refer to my GitHub repository and follow me there.
Don’t forget to give Claps 👏 👏

--

--