phantomJS scraping with breaks not working - node.js

I'm trying to scrape some URLS from a webservice, its working perfect but I need to scrape something like 10,000 pages from the same web servicve.
I do this by creating multiple phantomJS processes and they each open and evaluate a different URL (Its the same service, all I change is one parameter in the URL of the website).
Problem is I don't want to open 10,000 pages at once, since I don't want their service to crash, and I don't want my server to crash either.
I'm trying to make some logic of opening/evaluating/insertingToDB ~10 pages, and then sleeping for 1 minute or so.
Let's say this is what I have now:
var numOfRequests = 10,000; //Total requests
for (var dataIndex = 0; dataIndex < numOfRequests; dataIndex++) {
phantom.create({'port' : freeport}, function(ph) {
ph.createPage(function(page) {
page.open("http://..." + data[dataIncFirstPage], function(status) {
I want to insert somewhere in the middle something like:
if(dataIndex % 10 == 0){
sleep(60); //I can use the sleep module
}
Every where I try to place sleepJS the program crashes/freezes/loops forever...
Any idea what I should try?
I've tried placing the above code as the first line after the for loop, but this doesn't work (maybe because of the callback functions that are waiting to fire..)
If I place it inside the phantom.create() callback also doesn't work..

Realize that NodeJS runs asynchronously and in your for-loop, each method call is being executing one after the other. That phantom.create call finishes near immediately, and then the next cycle of the for-loop kicks in.
To answer your question, you want the sleep command at the end of the phantom.create block, still in side the for-loop. Like this:
var numOfRequests = 10000; // Total requests
for( var dataIndex = 0; dataIndex < numOfRequests; dataIndex++ ) {
phantom.create( { 'port' : freeport }, function( ph ) {
// ..whatever in here
} );
if(dataIndex % 10 == 0){
sleep(60); //I can use the sleep module
}
}
Also, consider using a package to help with these control flow issues. Async is a good one, and has a method, eachLimit that will concurrently run a number of processes, up to a limit. Handy! You will need to create an input object array for each iteration you wish to run, like this:
var dataInputs = [ { id: 0, data: "/abc"}, { id : 1, data : "/def"} ];
function processPhantom( dataItem, callback ){
console.log("Starting processing for " + JSON.stringify( dataItem ) );
phantom.create( { 'port' : freeport }, function( ph ) {
// ..whatever in here.
//When done, in inner-most callback, call:
//callback(null); //let the next parallel items into the queue
//or
//callback( new Error("Something went wrong") ); //break the processing
} );
}
async.eachLimit( dataInputs, 10, processPhantom, function( err ){
//Can check for err.
//It is here that everything is finished.
console.log("Finished with async.eachLimit");
});
Sleeping for a minute isn't a bad idea, but in groups of 10, that will take you 1000 minutes, which is over 16 hours! Would be more convenient for you to only call when there is space in your queue - and be sure to log what requests are in process, and have completed.

Related

NodeJS - While loop until certain condition is met or certain time is passed

I've seen some questions/answers very similar but none exactly describing what I would like to achieve. Some background, this is a multi step provision flow. In pretty short words this is the goal.
1. POST an action.
2. GET status based in one variable submitted above. If response == "done" then proceed. Returns an ID.
3. POST an action. Returns an ID.
4. GET status based on ID returned above. If response == "done" then proceed. Returns an ID.
5. (..)
I think there are 6/7 steps in total.
The first question is, are there any modules that could help me somehow achieve this? The only requirement is that each attempt to get status should have an X amount of delay and should expire, marking the flow as failed after an X amount of time.
Nevertheless, the best I could get to, is this, assuming for example step 2:
GetNewDeviceId : function(req, res) {
const delay = ms => new Promise((resolve, reject) => setTimeout(resolve, ms));
var ip = req;
async function main() {
let response;
while (true) {
try {
response = await service.GetNewDeviceId(ip);
console.log("Running again for: " + ip + " - " + response)
if (response["value"] != null) {
break;
}
} catch {
// In case it fails
}
console.log("Delaying for: " + ip)
await delay(30000);
}
//Call next step
console.log("Moving on for: "+ ip)
}
main();
}
This brings couple of questions,
I'm not sure this is indeed the best/clean way.
How can I set a global timeout, let's say 30 minutes, forcing it to step out of the loop and call a "failure" function.
The other thing I'm not sure (NodeJS newbie here) is that, assuming this get's called let's say 4 times, with different IP before any of those 4 are finished, NodeJS will run each call in each own context right? I quickly tested this and it seems like so.
I'm not sure this is indeed the best/clean way.
It am unsure whether your function GetNewDeviceId involves a recursion, that is, whether it invokes itself as service.GetNewDeviceId. That would not make sense, service.GetNewDeviceId should perform a GET request, right? If that is the case, your function seems clean to me.
How can I set a global timeout, let's say 30 minutes, forcing it to step out of the loop and call a "failure" function.
let response;
let failAt = new Date().getTime() + 30 * 60 * 1000; // 30 minutes after now
while (true) {
if (new Date().getTime() >= failAt)
return res.status(500).send("Failure");
try {...}
...
await delay(30000);
}
The other thing I'm not sure (NodeJS newbie here) is that, assuming this get's called let's say 4 times, with different IP before any of those 4 are finished, NodeJS will run each call in each own context right?
Yes. Each invocation of the function GetNewDeviceId establishes a new execution context (called a "closure"), with its own copies of the parameters req and res and the variables response and failAt.

Inconsistent request behavior in Node when requesting large number of links?

I am currently using this piece of code to connect to a massive list of links (a total of 2458 links, dumped at https://pastebin.com/2wC8hwad) to get feeds from numerous sources, and to deliver them to users of my program.
It's basically splitting up one massive array into multiple batches (arrays), then forking a process to handle a batch to request each stored link for a 200 status code. Only when a batch is complete is the next batch sent for processing, and when its all done the forked process is disconnected. However I'm facing issues concerning apparent inconsistency in how this is performing with this logic, particularly the part where it requests the code.
const req = require('./request.js')
const process = require('child_process')
const linkList = require('./links.json')
let processor
console.log(`Total length: ${linkList.length}`) // 2458 links
const batchLength = 400
const batchList = [] // Contains batches (arrays) of links
let currentBatch = []
for (var i in linkList) {
if (currentBatch.length < batchLength) currentBatch.push(linkList[i])
else {
batchList.push(currentBatch)
currentBatch = []
currentBatch.push(linkList[i])
}
}
if (currentBatch.length > 0) batchList.push(currentBatch)
console.log(`Batch list length by default is ${batchList.length}`)
// cutDownBatchList(1)
console.log(`New batch list length is ${batchList.length}`)
const startTime = new Date()
getBatchIsolated(0, batchList)
let failCount = 0
function getBatchIsolated (batchNumber) {
console.log('Starting batch #' + batchNumber)
let completedLinks = 0
const currentBatch = batchList[batchNumber]
if (!processor) processor = process.fork('./request.js')
for (var u in currentBatch) { processor.send(currentBatch[u]) }
processor.on('message', function (linkCompletion) {
if (linkCompletion === 'failed') failCount++
if (++completedLinks === currentBatch.length) {
if (batchNumber !== batchList.length - 1) setTimeout(getBatchIsolated, 500, batchNumber + 1)
else finish()
}
})
}
function finish() {
console.log(`Completed, time taken: ${((new Date() - startTime) / 1000).toFixed(2)}s. (${failCount}/${linkList.length} failed)`)
processor.disconnect()
}
function cutDownBatchList(maxBatches) {
for (var r = batchList.length - 1; batchList.length > maxBatches && r >= 0; r--) {
batchList.splice(r, 1)
}
return batchList
}
Below is request.js, using needle. (However, for some strange reason it may completely hang up on a particular site indefinitely - in that case, I just use this workaround)
const needle = require('needle')
function connect (link, callback) {
const options = {
timeout: 10000,
read_timeout: 8000,
follow_max: 5,
rejectUnauthorized: true
}
const request = needle.get(link, options)
.on('header', (statusCode, headers) => {
if (statusCode === 200) callback(null, link)
else request.emit('err', new Error(`Bad status code (${statusCode})`))
})
.on('err', err => callback(err, link))
}
process.on('message', function(linkRequest) {
connect(linkRequest, function(err, link) {
if (err) {
console.log(`Couldn't connect to ${link} (${err})`)
process.send('failed')
} else process.send('success')
})
})
In theory, I think this should perform perfectly fine - it spawns off a separate process to handle the dirty work in sequential batches so its not overloaded and is super scaleable. However, when using using the full list of links at length 2458 with a total of 7 batches, I often get massive "socket hang up" errors on random batches on almost every trial that I do, similar to what would happen if I requested all the links at once.
If I cut down the number of batches to 1 using the function cutDownBatchList it performs perfectly fine on almost every trial. This is all happening on a Linux Debian VPS with two 3.1GHz vCores and 4 GB RAM from OVH, on Node v6.11.2
One thing I also noticed is that if I increased the timeout to 30000 (30 sec) in request.js for 7 batches, it works as intended - however it works perfectly fine with a much lower timeout when I cut it down to 1 batch. If I also try to do all 2458 links at once, with a higher timeout, I also face no issues (which basically makes this mini algorithm useless if I can't cut down the timeout via batch handling links). This all goes back to the inconsistent behavior issue.
The best TLDR I can do: Trying to request a bunch of links in sequential batches in a forked child process - succeeds almost every time with a lower number of batches, fails consistently with full number of batches even though behavior should be the same since its handling it in isolated batches.
Any help would be greatly appreciated in solving this issue as I just cannot for the life of me figure it out!

Queue up javascript code in a single process

Lets say I have a bunch of tasks in an object, each with a date object. I was wondering if it's even possible to have tasks within the object be run within a single process and trigger when the date is called.
Here's an example:
var tasks = [
"when": "1501121620",
"what": function(){
console.log("hello world");
},
"when": "1501121625",
"what": function(){
console.log("hello world x2");
},
]
I'm fine with having these stored within a database and the what script being evaled from a string. I need a point in the right direction. I've never seen anything like this in the node world.
I'm thinking about using hotload and using the file system so I don't need to deal with databases.
Should I just look into setInterval or is there something out there that is more sophisticated? I know things like cron exist, the thing is I need all of these tasks to occur within an already existing running process. I need to be able to add a new task to the queue without ending the process.
To add a little context I need some way of queuing up socket.io .emit() functions.
Do not reinvent the wheel. Use cron package from npm. He is written pure on js (using second variant from bellow). So all of these tasks will occur within an already existing running process. For example your can create CronJob like this:
var CronJob = require('cron').CronJob;
var job = new CronJob(1421110908157);
job.addCallback(function() { /* some stuff to do */ });
In pure javascript you can do it only through setTimeout and setInterval methods. There are two variants:
1) Set interval callback, which will check your task queue and execute callbacks in appropriate time:
setInterval(function() {
for (var i = 0; ii = tasks.length; ++i) {
var task = tasks[i];
if (task.when*1000 < Date.now()) {
task.what();
tasks.splice(i,1);
--i;
}
};
}, 1000);
As you see accuracy of callback calling time will be dependent on interval time. Less interval time => more accuracy, but also more CPU usage.
2) Create wrapper around your tasks. So when you want to add new task you're calling some method addTask, that will be calling setTimeout with your task callback. Beware that maximum time for setTimeout is 2147483647ms (around 25 days). So if your time exceeds max time, you must set timeout on the maximum time with callback which will be set new timeout with remaining time. For example:
var MAX_TIME = 2147483647;
function addTask(task) {
if (task.when*1000 < MAX_TIME) {
setTimeout(task.what, task.when);
}
else {
task.when -= MAX_TIME/1000;
setTimeout(addTask.bind(null, task), MAX_TIME);
}
}

Node.js setTimeout() behaviour

I want a piece of code to repeat 100 times with 1 sec of delay in between. This is my code:
for(var i = 0; i < 100; i++){
setTimeout(function(){
//do stuff
},1000);
}
While this seems correct to me it is not. Instead of running "do stuff" 100 times and waiting 1 sec in between what it does is wait 1 sec and then run "do stuff" 100 times with no delay.
Anybody has any idea about this?
You can accomplish it by using setInterval().
It calls function of our choice as long as clearTimeout is called to a variable timer which stores it.
See example below with comments: (and remember to open your developer console (in chrome right click -> inspect element -> console) to view console.log).
// Total count we have called doStuff()
var count = 0;
/**
* Method for calling doStuff() 100 times
*
*/
var timer = setInterval(function() {
// If count increased by one is smaller than 100, keep running and return
if(count++ < 100) {
return doStuff();
}
// mission complete, clear timeout
clearTimeout(timer);
}, 1000); // One second in milliseconds
/**
* Method for doing stuff
*
*/
function doStuff() {
console.log("doing stuff");
}
Here is also: jsfiddle example
As a bonus: Your original method won't work because you are basically assigning 100 setTimeout calls as fast as possible. So instead of them running with one second gaps. They will run as fast as the for loop is placing them to queue, starting after 1000 milliseconds of current time.
For instance, following code shows timestamps when your approach is used:
for(var i = 0; i < 100; i++){
setTimeout(function(){
// Current time in milliseconds
console.log(new Date().getTime());
},1000);
}
It will output something like (milliseconds):
1404911593267 (14 times called with this timestamp...)
1404911593268 (10 times called with this timestamp...)
1404911593269 (12 times called with this timestamp...)
1404911593270 (15 times called with this timestamp...)
1404911593271 (12 times called with this timestamp...)
You can see the behaviour also in: js fiddle
You need to use callback, node.js is asynchronous:
function call_1000times(callback) {
var i = 0,
function do_stuff() {
//do stuff
if (i < 1000) {
i = i + 1;
do_stuff();
} else {
callback(list);
}
}
do_stuff();
}
Or, more cleaner:
setInterval(function () {
//do stuff
}, 1000);
Now that you appreciate that the for loop is iterating in a matter of milliseconds, another way to do it would be to simply adjust the setTimeout delay according to the count.
for(var i = 0; i < 100; i++){
setTimeout(function(){
//do stuff
}, i * 1000);
}
For many use-cases, this could be seen as bad. But in particular circumstances where you know that you definitely want to run code x number of times after y number of seconds, it could be useful.
It's also worth noting there are some that believe using setInterval is bad practise.
I prefer the recursive function. Call the function initially with the value of counter = 0, and then within the function check to see that counter is less than 100. If so, do your stuff, then call setTimeout with another call to doStuff but with a value of counter + 1. The function will run exactly 100 times, once per second, then quit :
const doStuff = counter => {
if (counter < 100) {
// do some stuff
setTimeout(()=>doStuff(counter + 1), 1000)
}
return;
}
doStuff(0)

How to setTimeout in node.js?

I need to be able to make retries in node.js in the event of failure inside a function. I've setup a while loop as shown below, but I am getting slightly confused about how I should wrap the function call to not make sure that it won't block my whole server.
What should I do?
while(retryCount < 10 && !success){
// Alternative one
while(new Date().getTime() < now + 1000) {
myFunction();
}
// Or:
setTimeout( myFunction(), 1000);
}
You can store number of tryes in function object. It's will be fine for cronjob. If you need same behaviour in request context you must store attempts counter in request scope (not in function object).
var fnc = function() {
console.log('try');
if (true) { // Error condition
// Error here
if (!fnc.tryes) fnc.tryes = 0;
fnc.tryes++;
console.log(fnc.tryes);
if (fnc.tryes <= 10) {
setTimeout(fnc, 1000);
} else {
fnc.tryes = 0;
}
// Something wrong
} else {
// We hame result
}
};
fnc();
I'd say use the setTimeout method, that way the client won't be stuck inside the while loop that checks the time.
That outer while loop is going to block, you'd have to refactor using only setTimeout. However, the fact that you want this sort of thing indicates to me that your code structure is really terrible and needs more reworking. What is it that you are retrying? How are you detecting an error condition? Does doing it 10 times really make the chances of success higher?
I have a gist containing a generic function that will do this sort of thing for you, but I'm reluctant to share if this is an XY problem.

Resources