Web Scrapping with node js

I will teach you how to scrape content from a dynamic website and make your own API.

Environment — WINDOWS

Introduction

In this tutorial, we’ll learn how to:
1. Set up a Node.JS Server (with Express)
2. Scrape the internet for the data that we want
3. Extract the information and format it for the user
4. Expose it using a RESTful API that can be consumed by other applications
5. Deploy it on Heroku
6. Resolve CORS-Issue

Requirements

Alright, so first and foremost, we’ll need the base server that runs the entire system. If you haven’t already guessed it, it’s Node.JS.

But before that, let’s sync up with the bare necessities:
1. VS Code(code editor)
2. Command Prompt

If all that seemed a bit far-fetched to you, don’t worry… it’s going to be a simple process and I’m sure you’ll learn as you code along.

If you need the completed code for this tutorial, you can find it on GitHub

Setup Node.JS

Image for post
Image for post
Downloads Section of NodeJS

2. Select the compatible Environment
3. Install the downloaded file

Once you have Node.JS installed, let’s verify if it works. Type the commands below to check if they’re actually working.

>node -v
v10.16.0
>npm -v
6.14.8

If you see the above results, that means you’re good to go! 😃 😊

Step 1:

cd <folder path>
npm init
package name: (horoscopeAPI)
version: (1.0.0)
description: A RESTful API that scrapes the internet to get you today's horoscope reading.
entry point: (index.js)
test command:
git repository:
keywords: horoscope, astrology, restful, api, nodejs
author: <your name>
license: (ISC)

And you’ll end up with something like this:

{
"name": "horoscopeAPI",
"version": "1.0.0",
"description": "A RESTful API that scrapes the internet to get you today's horoscope reading.",
"main": "index.js",
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1"
},
"keywords": [
"horoscope",
"astrology",
"restful",
"api",
"nodejs"
],
"author": "Kaustubh",
"license": "ISC"
}

That’s great. Now, let’s start installing the packages we need to begin working on our API. We need the following:

To install all these, we just need to run this simple command:

npm install --save express request cheerio

Step 2:

Now, we’ll be building the entry point into our API
The contents of the file begin with this, where we’re basically going to pull in all of the packages we installed and bring them onto our application:

const express = require('express');
const request = require('request');
const cheerio = require('cheerio');
const app = express();

Now, we just need the boilerplate code to set up the server:

app.get('/', function(req, res){
// Will add scraping code here
});app.listen('8000');
console.log('API is running on http://localhost:8000');module.exports = app;

Now, that was an easy step, wasn’t it? You can even run the server now by visiting the link http://localhost:8000 on your browser but it just wouldn’t do anything. Let’s go on with building our API.

Step 3:

Select a website to scrape: -

Image for post
Image for post
Selected Content

3. Right Click on the selected Content and hit Inspect.

You need to do is find out where our data is within the DOM:

Image for post
Image for post
<div> and <p>

Get the div name

div.main-horoscope > p

So all we need to do is extract the text from it.

      var prediction = $('div.main-horoscope > p').text();
var json = {
id: id,
horoscope: horoscope,
prediction: prediction
};

this will store our information in json variable.

Once we have our JSON, all we need to do now is send it over back to the requesting client using the res object:

res.send(json);

Now, when we try to access the test URL, we’d get this:

http://localhost:8000

Final Code for the API:-

index.js

const express = require('express');
const request = require('request');
const cheerio = require('cheerio');
const app = express();
//CORS- ISSUE SORTED
app.use(function (req, res, next) {
res.setHeader('Access-Control-Allow-Origin', '*');
res.setHeader('Access-Control-Allow-Headers', 'Origin, X-Requested-With, Content-Type, Accept');
res.setHeader('Access-Control-Allow-Methods', 'POST, GET, PATCH, DELETE, OPTIONS');
next();
});
//CORS- ISSUE SORTED
app.get('/', async function(req, res)
{
var prediction="";
var horoscope = ["none","Aries","Taurus","Gemini","Cancer","Leo","Virgo","Libra","Scorpio","Sagittarius","Capricorn","Aquarius","Pisces"];
var json =[];
for(id=1;id<13;id++){
url = 'https://www.horoscope.com/us/horoscopes/general/horoscope-general-daily-today.aspx?sign='+id;
var data = await new Promise(function (resolve, reject){
request(url, function(error, response, html) {
if(!error) {
$ = cheerio.load(html);
prediction = $('div.main-horoscope > p').text();
resolve({
id: id,
horoscope: horoscope[id],
prediction: prediction,
});
}else{
reject(undefined);
}
});
});
json.push(data);
}
res.send(json);
});
app.listen(process.env.PORT || 5000);
module.exports = app;

Now, test wheater API is working on localhost:-

For that, Perform these steps:-

Open Command Prompt

cd <folder path>
node index.js

Now click here

Kudos You got your json file, Now it’s time to deploy. 😍

Step 4:

Now you will end up with these three files: -

Image for post
Image for post
API is ready
web: node index.js

3. Save it.

4. Push these files on GitHub in your repository.

5. Goto your Heroku dashboard

6. Create a new app

7. Connect this repository in the deploy section and hit deploy.

Now, Click on the deployed link, and see the magic

Image for post
Image for post
API

😎 And our API is complete 😜

For any issues, refer to my GitHub repository and follow me there.
Don’t forget to give Claps 👏 👏

Written by

FULL STACK | DATA SCIENTIST

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store