Scale web scraping site with node.js

Scale web scraping site with node.js - node.js

I'm developing a web scraping website to find available delivery restaurants. The website searches on the most popular delivery portals and shows the result aggregated in a single page.
The site is hosted on Heroku with 4 dynos.
http://deliveria.net/#05409-002
When a user makes a request on the website, it makes around 30 HTTP requests to retrieve the result.
The problem is the performance, the requests aren't fast and each search can make 30 of them, locking the app while the search is being performed for a single user.
I tried to increase Heroku dynos:
heroku scale web=10
And I didn't feel a perceptible gain.
What is the best approach to scale this kind of application?
(I can't use cache, as the searches need to be in real time)
Current stack:
Heroku
Node.js
express
request module
EJS
Pusher
Redis

The important thing here is to have workers, because you must avoid blocking the event loop in your main app.
Try to delegate the 30 http requests between the available workers. Maybe Kue can help you with this aspect (you push new jobs to the queue and they get executed one by one by the workers). So for example if you have 10 dynos on Heroku, use 9 for workers (that make those 30 http searches).
From the user's point of view it's important to know that the application is reacting fast to his search (and doesn't give him the 'freeze' impression), so maybe you would like to update him as soon as you have preliminary results (for example 10 pages get searched out of 30). You could do that via WebSockets (Socket.IO) and even show a nice graphical progress bar or something similar.

Related

Node.js design approach. Server polling periodically from clients

I'm trying to learn Node.js and adequate design approaches.
I've implemented a little API server (using express) that fetches a set of data from several remote sites, according to client requests that use the API.
This process can take some time (several fecth / await), so I want the user to know how is his request doing. I've read about socket.io / websockets but maybe that's somewhat an overkill solution for this case.
So what I did is:
For each client request, a requestID is generated and returned to the client.
With that ID, the client can query the API (via another endpoint) to know his request status at any time.
Using setTimeout() on the client page and some DOM manipulation, I can update and display the current request status every X, like a polling approach.
Although the solution works fine, even with several clients connecting concurrently, maybe there's a better solution?. Are there any caveats I'm not considering?

TL;DR The approach you're using is just fine, although it may not scale very well. Websockets are a different approach to solve the same problem, but again, may not scale very well.
You've identified what are basically the only two options for real-time (or close to it) updates on a web site:
polling the server - the client requests information periodically
using Websockets - the server can push updates to the client when something happens
There are a couple of things to consider.
How important are "real time" updates? If the user can wait several seconds (or longer), then go with polling.
What sort of load can the server handle? If load is a concern, then Websockets might be the way to go.
That last question is really the crux of the issue. If you're expecting a few or a few dozen clients to use this functionality, then either solution will work just fine.
If you're expecting thousands or more to be connecting, then polling starts to become a concern, because now we're talking about many repeated requests to the server. Of course, if the interval is longer, the load will be lower.
It is my understanding that the overhead for Websockets is lower, but still can be a concern when you're talking about large numbers of clients. Again, a lot of clients means the server is managing a lot of open connections.
The way large services handle this is to design their applications in such a way that they can be distributed over many identical servers and which server you connect to is managed by a load balancer. This is true for either polling or Websockets.

How web app can used offlinely base on Service Worker?

I know there are many documents about Service Worker, also many questions already asked.
But today is a long day with me, so I'm very tired to read many many docs now.
I just want to explain my thinking about Service Worker, how it helps us serve web app offline, and I hope everybody can tell me whether it's right or not.
Everything I know about Service Worker is it intercepts on the network request job of browser, and do something. So I guess when it intercepts, it will cached every request. So when the network isn't connected, Service Worker uses the data it cached for serving to the users
Thanks for all reply,

Yes, your thoughts are right. Here I will provide some more details about the whole functioning.
A service worker (SW), like a web worker, runs on different thread than the one used by the main web app. This allows SW to keep running even when the web app is not opened, allowing for instance to receive and show web notifications.
A SW, differently from a web worker, used for generic purposes, acts specifically as a proxy between our web application and the network. However is up to us to define and implement what and how the SW has to cache data locally, otherwise, by default, the SW doesn't know what to store in the cache.
For this we have to implement caching strategies that target static assets (like .js or .css files, for instance) or even URLs (but keep in mind that the CACHE API, used by the SW, can only cache GET calls, no PUT/POST).
Once the assets or URLs we are interested are defined within the scope of a specific strategy, the SW will intercept all outgoing requests and see if there is a match and eventually provide the data from the local cache, instead of going over the network.
Of course this depends on the strategy we chose/implement.
Since the requested data is already available locally, the SW can deliver it even when the user is offline.
If interested, I wrote an article, describing in detail the service workers and some of the most common caching strategies, applied to different scenarios.

Pass data between multiple NodeJS servers

I am still pretty new to NodeJS and want to know if I am looking at this in the wrong way.
Background:
I am making an app that runs once a week, generates a report, and then emails that out to a list of recipients. My initial reason for using Node was because I have an existing front end already built using angular and I wanted to be able to reuse code in order to simplify maintenance. My main idea was to have 4+ individual node apps running in parallel on our server.
The first app would use node-cron in order to run every Sunday. This would check the database for all scheduled tasks and retrieve the stored parameters for the reports it is running.
The next app is a simple queue that would store the scheduled tasks and pass them to the worker tasks.
The actual pdf generation would be somewhat CPU intensive, so this would be a cluster of n apps that would retrieve and run individual reports from the queue.
When done making the pdf, they would pass to a final email app that would send the file out.
My main concerns are communication between apps. At the moment I am setting up the 3 lower levels (ie. all but the scheduler) on separate ports with express, and opening http requests to them when needed. Is there a better way to handle this? Would the basic 'net' work better than the 'http' package? Is Express even necessary for something like this, or would I be better off running everything as a basic http/net server? So far the only real use I've made of Express is to specifically listen to a path for put requests and to parse the incoming json. I was led to asking here because in tracking logs so far I see every so often the http request is reset, which doesn't appear to affect the data received on the child process, but I still like to avoid errors in my coding.

I think that his kind of decoupling could leverage some sort of stateful priority queue with features like retry on failure, clustering, ...
I've used Kue.js in the past with great sucess, it's redis backed and has nice documentation and interface http://automattic.github.io/kue/

How do they do real time website notifications?

So on a site like, say, stack overflow parts of the page update when things happen like your reputation increase. How do they do that lol? Does a script check from time to time or is it a push notification somehow?

About 2 years ago stackexchange started using web sockets as stated here:
https://meta.stackexchange.com/questions/125677/new-feature-real-time-updates-to-questions-answers-and-inbox
If you take a look at the stackoverflow site source you will see that a JavaScript function subscribes to a web socket server.
There are many different approaches to that technology now. Microsoft for example introduced SignalR (http://signalr.net/) which degrades gracefully to older browser too by switching to other technologies where sockets don't work like long polling (asking every X seconds if changes are available).
You as a Python guy would probably start looking at something like: https://pypi.python.org/pypi/websockets/1.0
Have fun with web sockets!

If I didn't want to use web sockets, I would do it like this:
Have the server maintain a queue of notifications for a session or user or whatever context you want.
Have a URL for fetching such notifications.
When a client tries to GET that URL, and there are notifications available in the queue, return them immediately.
Otherwise, have the HTTP connection block until there are notifications queued.
On the client, side, then; simply try to GET the notification URL over and over again. Normally, the connection will sit blocking for data to read, but I don't see that this should be a problem.
I would think this should be easier to implement on the server side than web sockets are, since the HTTP server doesn't have to support any special HTTP extensions. On the other hand, depending on the HTTP server you're using, each such open connection may be using a thread or other system resource that you want to use sparingly.

Load test a Backbone App

I've got an NGinx/Node/Express3/Socket.io/Redis/Backbone/Backbone.Marionette app that proxies requests to a PHP/MySQL REST API. I need to load test the entire stack as a whole.
My app takes advantage of static asset caching with NGinx, clustering with node/express and socket is multi-core enabled using Redis. All that's to say, I've gone through a lot of trouble to try and make sure it can stand up to the load.
I hit it with 50,000 users in 10 seconds using blitz.io and it didn't even blink... Which concerned me because I wanted to see it crash, or at least breath a little heavy; but 50k was the max you could throw at it with that tool, indicating to me that they expect you to not reasonably be able to, or need to, handle more than that... Which is when I realized it wasn't actually incurring the load I was expecting because the load is initiated after the page loads and the Backbone app starts up and kicks off the socket connection and requests the data from the correct REST API endpoint (from different server).
So, here's my question:
How can I load test the entire app as a whole? I need the load test to tax the server in the same way that the clients actually will, which means:
Request the single page Backbone app from my NGinx/Node/Express server
Kick off requests for the static assets from NGinx (simulating what the browser would do)
Kick off requests to the REST API (PHP/MySQL running on a different server)
Create the connection to the Socket.io service (running on NGinx/Node/Express, utilizing Redis to handle multi-core junk)
If the testing tool uses a browser-like environment to load the page up, parsing the JS and running it, everything will be copasetic (NGinx/Node/Express server will get hit and so will the PHP/MySQL server). Otherwise, the testing tool will need to simulate this by firing off at least a dozen different kinds of requests nearly simultaneously. Otherwise it's like stress testing a door by looking at it 10,000 times (that is to say, it's pointless).
I need to ensure my app can handle 1,000 users hitting it in under a minute all loading the same page.

You should learn to use Apache JMeter http://jmeter.apache.org/
You can perform stress tests with it,
see this tutorial https://www.youtube.com/watch?v=8NLeq-QxkSw
As you said, "I need the load test to tax the server in the same way that the clients actually will"
That means that the tests is agnostic to the technology you are using.
I highly recommend Jmeter, is widely used and you can integrate it with Jenkins and do a lot of cool stuff with it.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string