How to manage a background (scraping) process with node.js - node.js

I have an Express app which extract data from plenty of websites. To do it, currently I have to run a task with a route (e.g. localhost/scrapdata) which get the data and store it on my pgsql db. This task is running infinitely.
I have other routes to get the data from my db.
Is it a good strategy to start my scraping task with a route? Or there is another strategies?

This doesn't need to be an Express app, but a simple Node.js script that gets fired off at a specified interval. What you are looking for is Cron.
If you want to keep your current Express app, then I recommend keeping it's current structure, but use something like node-schedule. So in another file, you could have something like:
// my-job.js
const schedule = require('node-schedule')
module.exports = schedule.scheduleJob('42 * * * *', () => {
console.log('The answer to life, the universe, and everything!')
})
Then in your main app.js, just import the file to start the job:
const express = require('express')
...
require('./my-job')
Then in another route like /shutdown, you could do:
const express = require('express')
const j = require('./my-job')
const router = express.Router()
router.get('/shutdown', () => {
j.cancel()
res.json({ message: 'Canceled.' })
})
This is just an idea, the above has not been tested.
Keep in mind though, scraping websites is a gray area. If they offer an API, then use that instead.

Related

Concurrency with fs.writeFileSync using NodeJS and ExpressJS

I have the following code written with NodeJS and ExpressJS:
const express = require("express");
const fs = require("fs");
const bodyParser = require("body-parser");
const jsonParser = bodyParser.json();
const hostname = "127.0.0.1";
let port = 3001;
const app = express();
app.use(express.static(__dirname + "/answers"));
const answersPath = __dirname + "/answers/answers.json";
app.patch("/new/answer", jsonParser, function (req, res) {
try {
const questionId = req.body.questionId;
const answer = req.body.answer;
const answersJson = JSON.parse(fs.readFileSync(`${answersPath}`, "utf8"));
if (answersJson[questionId]) {
answersJson[questionId] = [...answersJson[questionId], answer];
} else {
answersJson[questionId] = [answer];
}
fs.writeFileSync(`${answersPath}`, JSON.stringify(answersJson));
res.sendStatus(200);
} catch (e) {
console.error(e);
res.sendStatus(500);
}
});
app.listen(port);
console.log(`Server running at http://${hostname}:${port}/`);
What it basically does, it has an endpoint (/new/question), on which it receives as a JSON format, a question and an answer.
If the question exists already in the answers.json file, it adds the new answer to the list of answers for that question. If not, it creates a new question with a list of the answer.
Now, I've read the following article: https://www.geeksforgeeks.org/how-to-handle-concurrency-in-node-js/
And what I understood from here, is that even though the endpoint would get called at the same time by two clients, both of the responses will be saved, one after the other - one of them will wait for the other one, i.e. the file will not get overwritten.
So my question is, is this true? NodeJS deals with concurrency on its own, or do I need to implement something to prevent this from happening?
Thank you, and sorry if this is a dumb question 😞.
Although readFileSync() and writeFileSync() might do what you want to achieve, you should avoid using synchronous functions in Node.js.
Synchronous functions will block the entire Node.js process, not just the a single Express route. This means your server will become unresponsive while reading or writing the file. This will become an issue if the file gets bigger.
Instead of using a file, you could keep the data only in memory. If you need to persist the data between server restarts, you can read it when the server starts and write it when the server stops. In this case it might be okay to use synchronous functions.

Using Firebase function as a proxy server

I built an app with Vuejs which is hosted on firebase, I recently added dynamic rendering with rendertron to improve SEO, I'm hosting the rendertron on Heroku. The rendertron client work well.
In order to send requests coming from bots like googlebot to rendertron and recieve a compiled HTML file, I used firebase function, it checks for the user agent, if it's a bot then it sends it to the rendertron link, if it's not, it fetches the app and resend result.
Here's the function code:
const functions = require('firebase-functions');
const express = require('express');
const fetch = require('node-fetch');
const url = require('url');
const app = express();
const appUrl = 'khbich.com';
const renderUrl = 'https://khbich-render.herokuapp.com/render';
function generateUrl(request){
return url.format({
protocol:request.protocol,
host:appUrl,
pathname:request.originalUrl
});
}
function detectBot(userAgent){
let bots = [
"googlebot",
"bingbot",
"facebookexternalhit",
"twitterbot",
"linkedinbot",
"facebot"
]
const agent = userAgent.toLowerCase()
for(let bot of bots){
if(agent.indexOf(bot)>-1){
console.log('bot-detected',bot,agent)
}
}
}
app.get('*', (req,res)=>{
let isBot = detectBot(req.headers['user-agent']);
if(isBot){
let botUrl= generateUrl(req);
fetch(`${renderUrl}/${botUrl}`)
.then(res => res.text())
.then(body=>{
res.set('Cache-Control','public','max-age=300','s-maxage=600')
res.set('Vary','User-Agent');
res.send(body.toString())
})
}
else{
fetch(`https://${appUrl}`)
.then(res=>res.text())
.then(body=>{
res.send(body.toString())
})
}
});
I used the function as an entry point for firebase hosting, so it's invoked whenever someone enters the app.
I checked on the firebase dashboard to see if it's working, and I noticed that it crashed for exceeding the number of requests per 100 second quota, I don't have much users when I checked, and the function invocations reached 370 calls in one minute.
I don't see why I had a large number of calls all at once, I'm thinking that maybe since I'm fetching the website if the user agent is not a bot, then the function is re-invoked causing an infinite loop of invocations, but I don't know if that's really why ?
If it's an infinite loop, how can I redirect users to their entered url without reinvoking the function ? will a redirect work ?

How to set custom Cloud Functions Path

Let's say I want a cloud function to have a path such as:
https://[MY_DOMAIN]/login/change_password
How do I achieve the "login/" part in Node?
Or even something more complicated such as
login/admin/get_data
?
I tried using
module.exports = {
"login/change_password" = [function]
}
But I got an error when deploying and "change_password" was omitted, so it only tried to deploy a "login" function.
Another thing I tried was using express routers but that resulted in only deploying a single function, which routed to the right path (e.g. myfunction/login/change_password) which is problematic as I have to deploy in bulk every time and can't deploy a function individually.
If you want the flexibility to define routes (paths) that are more complex than just the name of the function, you should provide an Express app to Cloud Functions. The express app can define routes that add path components to the base name of the function you export from index.js. This is discussed in the documentation for HTTP functions. For example:
const functions = require('firebase-functions');
const express = require('express');
const app = express();
app.get('/some/other/path', (req, res) => { ... });
exports.foo = functions.https.onRequest(app);
In that case, all your paths will hang off of the path prefix "foo".
There is also an official samples illustrating use of Express apps: https://github.com/firebase/functions-samples/tree/master/authorized-https-endpoint
Thanks to the discussion with Doug Stevenson I was able to better phrase my question and find that it was already answered in this question.
So this would be an example of my implementation:
const functions = require('firebase-functions');
const express = require('express');
const login = require('./login.js');
const edit_data = require('./edit-data.js');
const login_app = express();
login_app.use('/get_uuid', login.getUUID);
login_app.use('/get_credentials', login.getCredentials);
login_app.use('/authorize', login.authorize);
const edit_data_app = express();
edit_data_app.use('/set_data', edit_data.setData);
edit_data_app.use('/get_data', edit_data.getData);
edit_data_app.use('/update_data', edit_data.updateData);
edit_data_app.use('/remove_data', edit_data.removeData);
exports.login = functions.https.onRequest(login_app);
exports.edit_data = functions.https.onRequest(edit_data_app);
My takeaway from this is that there is a one-to-one Express app to HTTP function correspondence, so if I wanted to have 3 different functions I would need 3 Express apps.
A good balance is to have one app and one function per module (as shown above), which also means you can separate out your functions across several modules/javascript files for ease of maintenance.
In the above example, we can then trigger those HTTP functions using
https://[DOMAIN]/login/get_uuid/
or, from the firebase functions shell
login.get("/get_uuid")

Testing for server in Koa

I am using Koa for web development in NodeJS, I have a server file, which does nothing but to start the server and initialise few middlewares. Following is the sample code
server.js
const Koa = require('koa');
var Router = require('koa-router');
var bodyParser = require('koa-bodyparser');
var app = new Koa();
var router = new Router();
app.use(bodyParser());
router.post('/abc', AbcController.abcAction);
router.post('/pqr', PqrController.pqrAction);
app.use(router.routes());
app.listen(3000);
When we run npm start the server will start on 3000 port and now I want to write unit test case for this file using mocha, chai and sinon.
One way is to create a test file lets say server_test.js and do something like the following(just an example):
var server = require(./server);
server.close();
For this we need to add the following lines to server.js
var server = app.listen(3000);
module.exports = server;
Is this a good practice to do? I think we should not expose server in this fashion?
As we don't have self created function here in this file, is testing really required?
Should we also exclude such files from sonarqube coverage?
Any other better suggestion is always welcome. Need your help guys. Thank you.
You can use chai-http for testing the endpoint
this is what I use for my project
const chai = require('chai');
const chaiHttp = require('chai-http');
const expect = chai.expect;
const app = require('../app');
describe('/GET roles', function () {
it('should return bla bla bla',
function (done) {
chai.request(app)
.get('/roles')
.end(function (err, res) {
expect(res.status).eql(200)
expect(res.body).to.have.property('message').eql('Get role list success');
expect(res.body).to.have.property('roles');
expect(err).to.be.null;
done();
});
}
);
});
There are primarily 2 ways through which you can actually handle rest cases.
One is to put your test cases along with your source code file. ( in your case it would be server.spec.js ). I personally prefer this way as it encourages code modularity and make your modules totally independent.
Another way is to create another directory, let say test, where you can put your entire test cases according to same directory structure as followed by the main application. This is really useful for applications where test cases are only considered while they are in development phase and then at time of production you can simply ignore sending these files.
Also, I usually prefer following the concepts of functional programming as it really helps to test each block of code independently.
Hope this helps

Application-scope Variables in Node.js?

Is there a variable scope in Node.js that persists longer than the request, that is available to all application users? I'd like to create a variable that can be read/written to by multiple app users, like so:
var appScope.appHitCount += 1;
The session scope won't work because it is user specific. Looking for application specific. I couldn't find anything in the Node.js docs. Thanks.
If you have a variable scoped to the app module it will be available and shared by everything within the scope. I imagine in a big app you would need to be careful with this, but it's just regular javascript scoping. For example this will continue to append to the array as you hit it:
const express = require('express')
const app = express()
var input = []
app.get('/:input/', (req, res) => {
var params = req.params
input.push(params.input)
res.send("last input was: " + JSON.stringify(input))
})
app.listen(8080, () => console.log('Example app listening on port 8080!'))
Now visiting 'http://localhost:8080/hi' returns:
last input was: ["hi"]
and then 'http://localhost:8080/there' returns:
last input was: ["hi", "there"]
...etc.
If you want something shared by the entire app (i.e. all the modules) you could set up a module that is require()d by all the modules and i=has the responsibility of getting and setting that value.

Resources