scraping a site without need of client request to server - node.js

I am building a scraper app with nodejs and I'd like it to scrape a certain site 2 times a day.
now, there's a problem though.
what I am used to doing is that from client side, someone makes a request and the app scrapes data and shows the result.
but what If I want the app to just do the scraping 2 times a day, without the need for client to make a request to server.
how does one do that?
Basically, it's a site where the user puts in keywords they are searching for.
the app searches for that keyword everyday and it notifies the user when the keyword shows up on the page.
so, how does one do that without having the user to search for the keyword everyday?
Seems like we can use cron jobs for scheduling, and the scraping will happen twice a day or any times I choose, but the thing is how do I send the data from the scraping to client side?
Or how do I notify the site user that the keyword was found and he can come to the site and look at it?

but what If I want the app to just do the scraping 2 times a day, without the need for client to make a request to server. how does one do that?
You use a task scheduler, such as Cron.
how do I notify the site user that the keyword was found and he can come to the site and look at it?
There are lots of options.
Email
SMS
Twitter messages
Notications + Service Workers
etc

The request npm module would allow you to do that. The following (server) app queries from an external API every 10 seconds:
const request = require('request');
function doRequest() {
request('http://www.randomtext.me/api/', function (error, response, body) {
console.log('error:', error);
console.log('statusCode:', response && response.statusCode);
console.log('body:', body);
// do whatever you need to do with you result
// and notify the user (... not clear what channel you want to use)
// could be done with sockets, email, ... or text messages (twillio) ...
});
}
setInterval(doRequest, 10000); // <-- adapt your intervall here
So this is an easy example for server to server requests ... hope that helps.

Related

How does NodeJS process multiple GET requests from different users/browsers?

I'd like to know how does NodeJS process multiple GET requests from different users/browsers which have event emitted to return the results? I'd like to think of it as each time a user executes the GET request, it's as if a new session is started for that user.
For example if I have this GET request
var tester = require('./tester-class');
app.get('/triggerEv', async function(req, res, next) {
// Start the data processing
tester.startProcessing('some-data');
// tester has event emitters that are triggered when processing is complete (success or fail)
tester.on('success', function(data) {
return res.send('success');
}
tester.on('fail', function(data) {
return res.send('fail');
}
}
What I'm thinking is that if I open a browser and run this GET request by passing some-data and start processing. Then open another browser to execute this GET request with different data (to simulate multiple users accessing it at the same time), it will overwrite the previous startProcessing function and rerun it again with the new data.
So if multiple users execute this GET request at the same time, would it handle it separately for each user as if it was different and independent sessions then return when there's a response for each user's sessions? Or will it do as I mentioned above (this case I will have to somehow manage different sessions for each user that triggers this GET request)?
I want to make it so that each user that executes this GET request doesn't interfere with other users that also execute this GET request at the same time and the correct response is returned for each user based on their own data sent to the startProcessing function.
Thanks, I hope I'm making sense. Will clarify if not.
If you're sharing the global tester object among different requests, then the 2nd request will interfere with the first request. Since all incoming requests use the same global environment in node.js, the usual model is that any request that may be "in-flight" for awhile needs to create its own resources and keep them for itself. Then, if some other request arrives while the first one is still waiting for something to complete, then it will also create its own resources and the two will not conflict.
The server environment does not have a concept of "sessions" in the way you're using the term. There is no separate server-session or server state that each request lives in other than the request and response objects that are created for each incoming request. This is not like PHP - there is not a whole new interpreter state for each request.
I want to make it so that each user that executes this GET request doesn't interfere with other users that also execute this GET request at the same time and the correct response is returned for each user based on their own data sent to the startProcessing function.
Then, don't share any resources between requests and don't use any objects that have global state. I don't know what your tester is, but one way to keep multiple requests separate from each other is to just make a new tester object for each request so they can each use it to their heart's content without any conflict.

Slow time response on DialogFlow fullfilment http requests

I am developing an app for google assistant on DialogFlow.
On certain intent I have a fullfilment which has to do a http request.
The code is like this:
const syncrequest = require('sync-request');
console.log('Request start');
var res = syncrequest('GET', urlRequest, {
json: {},
});
console.log('Request end');
Testing the url that I'm using it takes approximately 0.103 seconds to respond.
But looking at the firebase log, it is like this:
3:01:58.555 PM dialogflowFirebaseFulfillment Request end
3:01:56.585 PM dialogflowFirebaseFulfillment Request start
Even thought my server respond in 0.103 seconds, the request takes 2 seconds to be processed.
Sometimes it takes more than 4 seconds and makes my app crash.
Does anyone have any idea why is it taking so long? Is there something that I can do to do the request faster?
Thanks in advance
I haven't looked too hard at the sync-request package, but I do see this big warning on the npm page for it:
You should not be using this in a production application. In a node.js
application you will find that you are completely unable to scale your
server. In a client application you will find that sync-request causes
the app to hang/freeze. Synchronous web requests are the number one
cause of browser crashes. For production apps, you should use
then-request, which is exactly the same except that it is
asynchronous.
Based on this, and some other information on the page, it sounds like this package is very poor on performance, and may handle the synchronous operations grossly inefficiently.
You may wish to switch to the then-request package, as it suggests, however the most common way to handle HTTP calls is using request-promise-native, where you'd do something like:
const rp = require('request-promise-native');
return rp.get(url)
.then( body => {
// Set the Dialogflow response here
// You didn't really show this in your code.
});
If you are doing asynchronous tasks - you must return a promise from your intent handler.

node.js- post requests to endpoint begin to get stuck after a while

I've developed a node.js webapp with express+mongoose deployed to an Amazon EC2 instance.
The app receives SNS notifications when a file is uploaded to a specific s3 bucket, stores something in mongodb and then makes an https post to some endpoint outside amazon. https post is done in this way using requests library:
var options = {
url:"https://"+config.get('some.endpoint')+"/somepath",
method:'POST',
body:postdata,
json:true
};
requests.post(options,function(err,response,body){
if (!err && response.statusCode === 200) {
logger.info("notified ok ");
}else{
logger.error("1 " + err);
logger.error("2 " + response);
logger.error("3 " + body);
}
});
This was done using a simple callback model ( i.e I didn't use async library).
Files are uploaded continously so the SNS hits my app at the same rate ( ~5/10 requests per second). The first ten minutes of the app being up, I can see ( via checking logs) that http post are being delivered in a near speed as incoming requests arrive.
But at some point, the requests.post callback starts falling behind until it stops showing up in the log file,(despite requests keep coming). I can tell, by checking the other endpoint (the one specified in config.get('some.endpoint')) , effectively, that posts aren't being delivered. In different bursts and with great delays ( 5 min or more) , some new messages appear in the log, like if it was trying to catch up, but in the long term they stop showing up at all.
I've realized that if I make some manual flow-control by stopping/restarting the incoming requests I can make it work ok.
Am I doing something wrong? are requests getting stacked up somewhere because of some reason? How can I check this? Should I use some library to ensure execution?
Could it be that node.js prefers to process new incoming requests vs processing old requests callback and somehow these callbacks are never executed?
Any help or suggestion on how I can debug this issue is welcomed.
Thanks in advance!

Fetching external resources in parallel in node - good practice?

I have a setup where a node server acts as a proxy server to serve images.
For example an image "test1.jpg", the exact same image can be fetched from 3 external sources - lets say -
a. www.abc.com/test1.jpg
b. www.def.com/test1.jpg
c. www.ghi.com/test1.jpg
When the nodejs server gets a request for "test1.jpg" it first gets a list of external URLs from a DB. Now amongst these external resources, at least one is always behind a CDN and is "expected" to respond faster and hence is a preferred source for the image.
My question is what is the correct method to achieve this out of the two below (or if there is any other method)
Fire http requests (using mikeal's request client module) for all the URLs at the same time. Get their promise objects and whichever source responds first, send that image back to the user (it can be any of the three sources, not necessarily the preferred source behind the cDN - but doesnt matter since the image is exactly the same). The disadvantage that I see is that for every image we hit 3 sources. Also the promises for http requests can still get fulfilled after the response from the first successful source has been sent out.
Fire http requests one at a time starting with the most preferred image, wait for it to fail (i.e. a 404 on the image) and then proceed to the next preferred image. We have lesser number of HTTP requests but more wait time for the user.
Some pseudo code
Method 1
while(imagePreferences.length > 0) {
var url = imagePreferences.splice(0,1);
getImage(url).then(function() {
sendImage();
}, function(err) {
console.log(err);
});
}
Method 2
if(imageUrls.length > 0) {
var url = imageUrls.splice(0,1);
getImage(url).then(function(imageResp) {
sendImageResp();
}, function(err) {
getNextImage(); //recurse over this
});
}
This is just pseudo code. I am new to nodejs. Any help/views would be appreciated.
I prefer the 1st option, CDNs are designed to receive massive requests. Your code is perfectly fine to send HTTP requests to multiple sources in parallel.
In case you want to stop the other requests after successfully receiving the first image, you can use async.detect: https://github.com/caolan/async#detect

How to inform a NodeJS server of something using PHP?

I'd like to add a live functionality to a PHP based forum - new posts would be automatically shown to users as soon as they are created.
What I find a bit confusing is the interaction between the PHP code and NodeJS+socket.io.
How would I go about informing the NodeJS server about new posts and have the server inform the clients that are watching the thread in which the post was posted?
Edit
Tried the following code, and it seems to work, my only question is whether this is considered a good solution, as it looks kind of messy to me.
I use socket.io to listen on port 81 to clients, and the server running om port 82 is only intended to be used by the forum - when a new post is created, a PHP script sends a POST request to localhost on port 82, along with the data.
Is this ok?
var io = require('socket.io').listen(81);
io.sockets.on('connection', function(socket) {
socket.on('init', function(threadid) {
socket.join(threadid);
});
});
var forumserver = require('http').createServer(function(req, res) {
if (res.socket.remoteAddress == '127.0.0.1' && req.method == 'POST') {
req.on('data', function(chunk) {
data = JSON.parse(chunk.toString());
io.sockets.in(data.threadid).emit('new-post', data.content);
});
}
res.end();
}).listen(82);
Your solution of a HTTP server running on a special port is exactly the solution I ended up with when faced with a similar problem. The PHP app simply uses curl to POST to the Node server, which then pushes a message out to socket.io.
However, your HTTP server implementation is broken. The data event is a Stream event; Streams do not emit messages, they emit chunks of data. In other words, the request entity data may be split up and emitted in two chunks.
If the data event emitted a partial chunk of data, JSON.parse would almost assuredly throw an exception, and your Node server would crash.
You either need to manually buffer data, or (my recommendation) use a more robust framework for your HTTP server like Express:
var express = require('express'), forumserver = express();
forumserver.use(express.bodyParser()); // handles buffering and parsing of the
// request entity for you
forumserver.post('/post/:threadid', function(req, res) {
io.sockets.in(req.params.threadid).emit('new-post', req.body.content);
res.send(204); // HTTP 204 No Content (empty response)
});
forumserver.listen(82);
PHP simply needs to post to http​://localhost:82/post/1234 with an entity body containing content. (JSON, URL-encoded, or multipart-encoded entities are acceptable.) Make sure your firewall blocks port 82 on your public interface.
Regarding the PHP code / forum's interaction with Node.JS, you probably need to create an API endpoint of sorts that can listen for changes made to the forum. Depending on your forum software, you would want to hook into the process of creating a new post and perform the API callback to Node.js at this time.
Socket.io out of the box is geared towards visitors of the site being connected on the frontend via Javascript. Upon the Node server receiving notification of a new post update, it would then notify connected clients of this new post and its details, at which point it would probably add new HTML to the DOM of the page the visitor is viewing.
You may want to arrange the Socket.io part of things so that users only subscribe to specific events being emitted by them being in a specific room such as "subforum123" so that they only receive notifications of applicable posts.

Resources