I have about 5000 links and I need to crawl all those. So Im wonder is there a better approach than this. Here is my code.
let urls = [ 5000 urls go here ];
const doms = await getDoms(urls);
// processing and storing the doms
getDoms = (urls) => {
let data = await Promise.all(urls.map(url => {
return getSiteCrawlPromise(url)
}));
return data;
}
getSiteCrawlPromise = (url) => {
return new Promise((resolve, reject) => {
let j = request.jar();
request.get({url: url, jar: j}, function(err, response, body) {
if(err)
return resolve({ body: null, jar: j, error: err});
return resolve({body: body, jar: j, error: null});
});
})
}
Is there a mechanism implemented in promise so it can devide the jobs to multiple threads and process. then return the output as a whole ?
and I don't want to devide the urls into smaller fragments and process those fragments
The Promise object represents the eventual completion (or failure) of an asynchronous operation, and its resulting value.
There is no in-built mechanism in Promises to "divide jobs into multiple threads and process". If you must do that, you'll have to fragment the urls array into smaller arrays and queue the fragmented arrays onto separate crawler instances simultaneously.
But, there is absolutely no need to go that way, since you're using node-js and node-crawler, you can use the maxConnections option of the node-crawler. This is what it was built for and the end result would be the same. You'll be crawling the urls on multiple threads, without wasting time and effort on manual chunking and handling of multiple crawler instances, or depending on any concurrency libraries.
There isn't such a mechanism built-in to Javascript, at least right now.
You can use third-party Promise libraries that offer more features, like Bluebird, in which you can make use of their concurrency feature:
const Promise = require('bluebird');
// Crawl all URLs, with 10 concurrent "threads".
Promise.map(arrayOfUrls, url => {
return /* promise for crawling the url */;
}, { concurrency: 10 });
Another option is to use a dedicated throttling library (I recommend highly bottleneck), which lets you express any generic kind of rate limit. The syntax in that case would be similar to what you already have:
const Bottleneck = require('bottleneck');
const limit = new Bottleneck({ maxConcurrent: 10 });
const getCallSitePromise = limit.wrap(url => {
// the body of your getCallSitePromise function, as normal
});
// getDoms stays exactly the same
You can solve this problem yourself, but bringing one (or both!) of the libraries above will save you a lot of code.
Related
I am using the excellent Papa Parse library in nodejs mode, to stream a large (500 MB) CSV file of over 1 million rows, into a slow persistence API, that can only take one request at a time. The persistence API is based on Promises, but from Papa Parse, I receive each parsed CSV row in a synchronous event like so: parseStream.on("data", row => { ... }
The challenge I am facing is that Papa Parse dumps its CSV rows from the stream so fast that my slow persistence API can't keep up. Because Papa is synchronous and my API is Promise-based, I can't just call await doDirtyWork(row) in the on event handler, because sync and async code doesn't mix.
Or can they mix and I just don't know how?
My question is, can I make Papa's event handler wait for my API call to finish? Kind of doing the persistence API request directly in the on("data") event, making the on() function linger around somehow until the dirty API work is done?
The solution I have so far is not much better than using Papa's non-streaming mode, in terms of memory footprint. I actually need to queue up the torrent of on("data") events, in form of generator function iterations. I could have also queued up promise factories in an array and work it off in a loop. Any which way, I end up saving almost the entire CSV file as huge collection of future Promises (promise factories) in memory, until my slow API calls have worked all the way through.
async importCSV(filePath) {
let parsedNum = 0, processedNum = 0;
async function* gen() {
let pf = yield;
do {
pf = yield await pf();
} while (typeof pf === "function");
};
var g = gen();
g.next();
await new Promise((resolve, reject) => {
try {
const dataStream = fs.createReadStream(filePath);
const parseStream = Papa.parse(Papa.NODE_STREAM_INPUT, {delimiter: ",", header: false});
dataStream.pipe(parseStream);
parseStream.on("data", row => {
// Received a CSV row from Papa.parse()
try {
console.log("PA#", parsedNum, ": parsed", row.filter((e, i) => i <= 2 ? e : undefined)
);
parsedNum++;
// Simulate some really slow async/await dirty work here, for example
// send requests to a one-at-a-time persistence API
g.next(() => { // don't execute now, call in sequence via the generator above
return new Promise((res, rej) => {
console.log(
"DW#", processedNum, ": dirty work START",
row.filter((e, i) => i <= 2 ? e : undefined)
);
setTimeout(() => {
console.log(
"DW#", processedNum, ": dirty work STOP ",
row.filter((e, i) => i <= 2 ? e : undefined)
);
processedNum++;
res();
}, 1000)
})
});
} catch (err) {
console.log(err.stack);
reject(err);
}
});
parseStream.on("finish", () => {
console.log(`Parsed ${parsedNum} rows`);
resolve();
});
} catch (err) {
console.log(err.stack);
reject(err);
}
});
while(!(await g.next()).done);
}
So why the rush Papa? Why not allow me to work down the file a bit slower -- the data in the original CSV file isn't gonna run away, we have hours to finish the streaming, why hammer me with on("data") events that I can't seem to slow down?
So what I really need is for Papa to become more of a grandpa, and minimize or eliminate any queuing or buffering of CSV rows. Ideally I would be able to completely sync Papa's parsing events with the speed (or lack thereof) of my API. So if it weren't for the dogma that async code can't make sync code "sleep", I would ideally send each CSV row to the API inside the Papa event, and only then return control to Papa.
Suggestions? Some kind of "loose coupling" of the event handler with the slowness of my async API is fine too. I don't mind if a few hundred rows get queued up. But when tens of thousands pile up, I will run out of heap fast.
Why hammer me with on("data") events that I can't seem to slow down?
You can, you just were not asking papa to stop. You can do this by calling stream.pause(), then later stream.resume() to make use of Node stream's builtin back-pressure.
However, there's a much nicer API to use than dealing with this on your own in callback-based code: use the stream as an async iterator! When you await in the body of a for await loop, the generator has to pause as well. So you can write
async importCSV(filePath) {
let parsedNum = 0;
const dataStream = fs.createReadStream(filePath);
const parseStream = Papa.parse(Papa.NODE_STREAM_INPUT, {delimiter: ",", header: false});
dataStream.pipe(parseStream);
for await (const row of parseStream) {
// Received a CSV row from Papa.parse()
const data = row.filter((e, i) => i <= 2 ? e : undefined);
console.log("PA#", parsedNum, ": parsed", data);
parsedNum++;
await dirtyWork(data);
}
console.log(`Parsed ${parsedNum} rows`);
}
importCSV('sample.csv').catch(console.error);
let processedNum = 0;
function dirtyWork(data) {
// Simulate some really slow async/await dirty work here,
// for example send requests to a one-at-a-time persistence API
return new Promise((res, rej) => {
console.log("DW#", processedNum, ": dirty work START", data)
setTimeout(() => {
console.log("DW#", processedNum, ": dirty work STOP ", data);
processedNum++;
res();
}, 1000);
});
}
Async code in JavaScript can sometimes be a little hard to grok. It's important to remember how Node operates handles concurrency.
The node process is single-threaded, but it uses a concept called an event loop. The consequence of this is that async code and callbacks are essentially equivalent representations of the same thing.
Of course, you need an async function to use await, but your callback from Papa Parse can be an async function:
parse.on("data", async row => {
await sync(row)
})
Once the await operation completes, the arrow function ends, and all references to row will be eliminated, so the garbage collector can successfully collect row, releasing that memory.
The effect this has is concurrently executing sync every time a row is parsed, so if you can only sync one record at a time, then I would recommend wrapping the sync function in a debouncer.
I have a list of promises and currently I am using promiseAll to resolve them
Here is my code for now:
const pageFutures = myQuery.pages.map(async (pageNumber: number) => {
const urlObject: any = await this._service.getResultURL(searchRecord.details.id, authorization, pageNumber);
if (!urlObject.url) {
// throw error
}
const data = await rp.get({
gzip: true,
headers: {
"Accept-Encoding": "gzip,deflate",
},
json: true,
uri: `${urlObject.url}`,
})
const objects = data.objects.filter((object: any) => object.type === "observed-data" && object.created);
return new Promise((resolve, reject) => {
this._resultsDatastore.bulkInsert(
databaseName,
objects
).then(succ => {
resolve(succ)
}, err => {
reject(err)
})
})
})
const all: any = await Promise.all(pageFutures).catch(e => {
console.log(e)
})
So as you see here I use promise all and it works:
const all: any = await Promise.all(pageFutures).catch(e => {
console.log(e)
})
However I noticed it affects the database performance wise so I decided to resolve every 3 of them at a time.
for that I was thinking of different ways like cwait, async pool or wrting my own iterator
but I get confused on how to do that?
For example when I use cwait:
let promiseQueue = new TaskQueue(Promise,3);
const all=new Promise.map(pageFutures, promiseQueue.wrap(()=>{}));
I do not know what to pass inside the wrap so I pass ()=>{} for now plus I get
Property 'map' does not exist on type 'PromiseConstructor'.
So whatever way I can get it working(my own iterator or any library) I am ok with as far as I have a good understanding of it.
I appreciate if anyone can shed light on that and help me to get out of this confusion?
First some remarks:
Indeed, in your current setup, the database may have to process several bulk inserts concurrently. But that concurrency is not caused by using Promise.all. Even if you had left out Promise.all from your code, it would still have that behaviour. That is because the promises were already created, and so the database requests will be executed any way.
Not related to your issue, but don't use the promise constructor antipattern: there is no need to create a promise with new Promise when you already have a promise in your hands: bulkInsert() returns a promise, so return that one.
As your concern is about the database load, I would limit the work initiated by the pageFutures promises to the non-database aspects: they don't have to wait for eachother's resolution, so that code can stay like it was.
Let those promises resolve with what you currently store in objects: the data you want to have inserted. Then concatenate all those arrays together to one big array, and feed that to one database bulkInsert() call.
Here is how that could look:
const pageFutures = myQuery.pages.map(async (pageNumber: number) => {
const urlObject: any = await this._service.getResultURL(searchRecord.details.id,
authorization, pageNumber);
if (!urlObject.url) { // throw error }
const data = await rp.get({
gzip: true,
headers: { "Accept-Encoding": "gzip,deflate" },
json: true,
uri: `${urlObject.url}`,
});
// Return here, don't access the database yet...
return data.objects.filter((object: any) => object.type === "observed-data"
&& object.created);
});
const all: any = await Promise.all(pageFutures).catch(e => {
console.log(e);
return []; // in case of error, still return an array
}).flat(); // flatten it, so all data chunks are concatenated in one long array
// Don't create a new Promise with `new`, only to wrap an other promise.
// It is an antipattern. Use the promise returned by `bulkInsert`
return this._resultsDatastore.bulkInsert(databaseName, objects);
This uses .flat() which is rather new. In case you have no support for it, look at the alternatives provided on mdn.
First, you asked a question about a failing solution attempt. That is called X/Y problem.
So in fact, as I understand your question, you want to delay some DB request.
You don't want to delay the resolving of a Promise created by a DB request... Like No! Don't try that! The promise wil resolve when the DB will return a result. It's a bad idea to interfere with that process.
I banged my head a while with the library you tried... But I could not do anything to solve your issue with it. So I came with the idea of just looping the data and setting some timeouts.
I made a runnable demo here: Delaying DB request in small batch
Here is the code. Notice that I simulated some data and a DB request. You will have to adapt it. You also will have to adjust the timeout delay. A full second certainly is too long.
// That part is to simulate some data you would like to save.
// Let's make it a random amount for fun.
let howMuch = Math.ceil(Math.random()*20)
// A fake data array...
let someData = []
for(let i=0; i<howMuch; i++){
someData.push("Data #"+i)
}
console.log("Some feak data")
console.log(someData)
console.log("")
// So we have some data that look real. (lol)
// We want to save it by small group
// And that is to simulate your DB request.
let saveToDB = (data, dataIterator) => {
console.log("Requesting DB...")
return new Promise(function(resolve, reject) {
resolve("Request #"+dataIterator+" complete.");
})
}
// Ok, we have everything. Let's proceed!
let batchSize = 3 // The amount of request to do at once.
let delay = 1000 // The delay between each batch.
// Loop through all the data you have.
for(let i=0;i<someData.length;i++){
if(i%batchSize == 0){
console.log("Splitting in batch...")
// Process a batch on one timeout.
let timeout = setTimeout(() => {
// An empty line to clarify the console.
console.log("")
// Grouping the request by the "batchSize" or less if we're almost done.
for(let j=0;j<batchSize;j++){
// If there still is data to process.
if(i+j < someData.length){
// Your real database request goes here.
saveToDB(someData[i+j], i+j).then(result=>{
console.log(result)
// Do something with the result.
// ...
})
} // END if there is still data.
} // END sending requests for that batch.
},delay*i) // Timeout delay.
} // END splitting in batch.
} // END for each data.
Below is code:
var fs = require('fs')
for(let i=0;i<6551200;i++){
fs.appendFile('file',i,function(err){
})
}
When I run this code, after a few seconds, it show:
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
and yet nothing in file!
my qusetion is :
why is no byte in file?
where cause out of memory?
how to async write file in for loop no mater how large the write times?
thanks advance.
Bottom line here is that fs.appendFile() is an asynchronous call and you simply are not "awaiting" that call to complete on each loop iteration. This has a number of consequences, including but not limited to:
The callbacks keep getting allocated before they are resolved, which results in the "heap out of memory" eventually being reached.
You are contesting with a file handle, since you the function you are employing is actually opening/writing/closing the file given, and if you don't wait for each turn to do so, then you're simply going to clash.
So the simple solution here is to "wait", and some modern syntax sugar makes that easy:
const fs = require('mz/fs');
const x = 6551200;
(async function() {
try {
const fd = await fs.open('file','w');
for (let i = 0; i < x; i++) {
await fs.write(fd, `${i}\n`);
}
await fs.close(fd);
} catch(e) {
console.error(e)
} finally {
process.exit();
}
})()
That will of course take a while, but it's not going to "blow up" your system whilst it does it's work.
The very first simplified thing is to just get hold of the mz library, which already wraps common nodejs libraries with modernized versions of each function supporting promises. This will help clean up the syntax a lot as opposed to using callbacks.
The next thing to realize is what was mentioned about that fs.appendFile() in how it is "opening/writing/closing" all in one call. That's not great, so what you would typically do is simply open and then write the bytes in a loop, and when that is complete you can actually close the file handle.
That "sugar" comes in modern versions, and though "possible" with plain promise chaining, it's still not really that manageable. So if you don't actually have a nodejs environment that supports that async/await sugar or the tools to "transpile" such code, then you might alternately consider using the asyncjs libary with plain callbacks:
const Async = require('async');
const fs = require('fs');
const x = 6551200;
let i = 0;
fs.open('file','w',(err,fd) => {
if (err) throw err;
Async.whilst(
() => i < x,
callback => fs.write(fd,`${i}\n`,err => {
i++;
callback(err)
}),
err => {
if (err) throw err;
fs.closeSync(fd);
process.exit();
}
);
});
The same base principle applies as we are "waiting" for each callback to complete before continuing. the whilst() helper here allows iteration until the test condition is met, and of course does not do the next iteration until data is passed to the callback of the iterator itself.
There are other ways to approach this, but those are probably the two most sane for a "large loop" of iterations. Common approaches such as "chaining" via .reduce() are really more suited to a "reasonable" sized array of data you already have, and building arrays of such sizes here has inherent problems of it's own.
For instance, the following "works" ( on my machine at least ) but it really consumes a lot of resources to do it:
const fs = require('mz/fs');
const x = 6551200;
fs.open('file','w')
.then( fd =>
[ ...Array(x)].reduce(
(p,e,i) => p.then( () => fs.write(fd,`${i}\n`) )
, Promise.resolve()
)
.then(() => fs.close(fd))
)
.catch(e => console.error(e) )
.then(() => process.exit());
So that's really not that practical to essentially build such a large chain in memory and then allow it to resolve. You could put some "governance" on this, but the main two approaches as shown are a lot more straightforward.
For that case then you either have the async/await sugar available as it is within current LTS versions of Node ( LTS 8.x ), or I would stick with the other tried and true "async helpers" for callbacks where you were restricted to a version without that support
You can of course "promisify" any function with the last few releases of nodejs right "out of the box" as it where, as Promise has been a global thing for some time:
const fs = require('fs');
await new Promise((resolve, reject) => fs.open('file','w',(err,fd) => {
if (err) reject(err);
resolve(fd);
});
So there really is no need to import libraries just to do that, but the mz library given as example here does all of that for you. So it's really up to personal preferences on bringing in additional dependencies.
Javascript is a single threaded language, which means your code can execute one function at the time. So when you execute an async function, it will be "queued" in the stack to be executed next.
so in your code, you are sending 6551200 calls to the stack, which would of course crash your app before starting working "appendFile" on any of them.
you can achieve what you want by splitting your loop into smaller loops, use async and await functions, or iterators.
if what you are trying to achieve is as simple as your code, you can use the following:
const fs = require("fs");
function SomeTask(i=0){
fs.appendFile('file',i,function(err){
//err in the write function
if(err) console.log("Error", err);
//check if you want to continue (loop)
if(i<6551200) return SomeTask(i);
//on finish
console.log("done");
});
}
SomeTask();
In the above code, you write a single line, and when that is done, you call the next one.
This function is just for basic usage, it needs a refactor and use of Javascript Iterators for advanced usage check out Iterators and generators on MDN web docs
1 - The file is empty because none of the fs.append calls have ever finished, the Node.JS process broken before.
2 - The Node.JS heap memory is limited and stores the callback until it returns, not only the "i" variable.
3 - You could try to use promises to do that.
"use strict";
const Bluebird = require('bluebird');
const fs = Bluebird.promisifyAll(require('fs'));
let promisses = [];
for (let i = 0; i < 6551200; i++){
promisses.push(fs.appendFileAsync('file', i + '\n'));
}
Bluebird.all(promisses)
.then(data => {
console.log(data, 'End.');
})
.catch(e => console.error(e));
But no logic can avoid heap memory error for a loop this big. You could increase Node.JS Heep Memory or, the reasonable way, take chunks of data for interval:
'use strict';
const fs = require('fs');
let total = 6551200;
let interval = setInterval(() => {
fs.appendFile('file', total + '\n', () => {});
total--;
if (total < 1) {
clearInterval(interval);
}
}, 1);
Been looking around the net for an answer to this, but not found anything conclusive.
I have a node application that (potentially) needs to make a large number of HTTP GET requests.
Let's say http://foo.com/bar allows an 'id' query parameter, and I have a large number of IDs to process (~1k), i.e.
http://foo.com/bar?id=100
http://foo.com/bar?id=101
etc.
What libraries that folks have used might be best suited to this task?
I guess I'm looking for something between a queue and a connection pool:
The setup:
A large array of IDs exists to be processed (up to ~1k IDs)
The process:
Some kind of pool containing X number of 'workers' is defined
Each worker takes an ID and makes a request (with up to X concurrent workers running at a time)
When a worker completes, it takes the next ID from the array and processes that
etc. until all IDs have been processed
Any experience welcome
It was actually a lot simpler than I initially thought, and only requires Bluebird (I'm paraphrasing here a little bit since my final code ended up much more complex):
var Promise = require('bluebird');
...
var allResults = [];
...
Promise.map(idList, (id) => {
// For each ID in idList, make a HTTP call
return http.get( ... url: 'http://foo.com/bar?id=' + id ... )
.then((httpResposne) => {
return allResults.push(httpResposne);
})
.catch((err) => {
var errMsg = 'ERROR: [' + err + ']';
console.log(errMsg + (err.stack ? '\n' + err.stack : ''));
});
}, { concurrency: 10 }) // Max of 10 concurrent HTTP calls at once
.then(() => {
// All requests are now complete, return all results
return res.json(allResults);
});
I am working on a Node.js application which uses the WordPress JSON API as a kind of headless CMS. When the application spins up, we query out to the WP database and pull in the information we need (using Axios), manipulate it, and store it temporarily.
Simple enough - but one of our post categories in the CMS has a rather large number of entries. For some godforsaken reason, WordPress has capped the API request limit to a maximum of 99 posts at a time, and requires that we write a loop that can send concurrent API requests until all the data has been pulled.
For instance, if we have 250 posts of some given type, I need to hit that route three separate times, specifying the specific "page" of data I want each time.
Per the docs, https://developer.wordpress.org/rest-api/using-the-rest-api/pagination/, I have access to a ?page= query string that I can use to send these requests concurrently. (i.e. ...&page=2)
I also have access to X-WP-Total in the headers object, which gives me the total number of posts within the given category.
However, these API calls are part of a nested promise chain, and the whole process needs to return a promise I can continue chaining off of.
The idea is to make it dynamic so it will always pull all of the data, and return it as one giant array of posts. Here's what I have, which is functional:
const request = require('axios');
module.exports = (request_url) => new Promise((resolve, reject) => {
// START WITH SMALL ARBITRARY REQUEST TO GET TOTAL NUMBER OF POSTS FAST
request.get(request_url + '&per_page=1').then(
(apiData) => {
// SETUP FOR PROMISE.ALL()
let promiseArray = [];
// COMPUTE HOW MANY REQUESTS WE NEED
// ALWAYS ROUND TOTAL NUMBER OF PAGES UP TO GET ALL THE DATA
const totalPages = Math.ceil(apiData.headers['x-wp-total']/99);
for (let i = 1; i <= totalPages; i++) {
promiseArray.push( request.get(`${request_url}&per_page=99&page=${i}`) )
};
resolve(
Promise.all(promiseArray)
.then((resolvedArray) => {
// PUSH IT ALL INTO A SINGLE ARRAY
let compiledPosts = [];
resolvedArray.forEach((axios_response) => {
// AXIOS MAKES US ACCESS W/RES.DATA
axios_response.data.forEach((post) => {
compiledPosts.push(post);
})
});
// RETURN AN ARRAY OF ALL POSTS REGARDLESS OF LENGTH
return compiledPosts;
}).catch((e) => { console.log('ERROR'); reject(e);})
)
}
).catch((e) => { console.log('ERROR'); reject(e);})
})
Any creative ideas to make this pattern better?
I have exactly the same question. In my case, I use Vue Resource :
this.$resource('wp/v2/media').query().then((response) => {
let pagesNumber = Math.ceil(response.headers.get('X-WP-TotalPages'));
for(let i=1; i <= pagesNumber; i++) {
this.$resource('wp/v2/media?page='+ i).query().then((response) => {
this.medias.push(response.data);
this.medias = _.flatten(this.medias);
console.log(this.medias);
});
}
I'm pretty sure there is a better workaround to achieve this.