I am working on a crawler. I have a list of URL need to be requested. There are several hundreds of request at the same time if I don't set it to be async. I am afraid that it would explode my bandwidth or produce to much network access to the target website. What should I do?
Here is what I am doing:
urlList.forEach((url, index) => {
console.log('Fetching ' + url);
request(url, function(error, response, body) {
//do sth for body
});
});
I want one request is called after one request is completed.
You can use something like Promise library e.g. snippet
const Promise = require("bluebird");
const axios = require("axios");
//Axios wrapper for error handling
const axios_wrapper = (options) => {
return axios(...options)
.then((r) => {
return Promise.resolve({
data: r.data,
error: null,
});
})
.catch((e) => {
return Promise.resolve({
data: null,
error: e.response ? e.response.data : e,
});
});
};
Promise.map(
urls,
(k) => {
return axios_wrapper({
method: "GET",
url: k,
});
},
{ concurrency: 1 } // Here 1 represents how many requests you want to run in parallel
)
.then((r) => {
console.log(r);
//Here r will be an array of objects like {data: [{}], error: null}, where if the request was successfull it will have data value present otherwise error value will be non-null
})
.catch((e) => {
console.error(e);
});
The things you need to watch for are:
Whether the target site has rate limiting and you may be blocked from access if you try to request too much too fast?
How many simultaneous requests the target site can handle without degrading its performance?
How much bandwidth your server has on its end of things?
How many simultaneous requests your own server can have in flight and process without causing excess memory usage or a pegged CPU.
In general, the scheme for managing all this is to create a way to tune how many requests you launch. There are many different ways to control this by number of simultaneous requests, number of requests per second, amount of data used, etc...
The simplest way to start would be to just control how many simultaneous requests you make. That can be done like this:
function runRequests(arrayOfData, maxInFlight, fn) {
return new Promise((resolve, reject) => {
let index = 0;
let inFlight = 0;
function next() {
while (inFlight < maxInFlight && index < arrayOfData.length) {
++inFlight;
fn(arrayOfData[index++]).then(result => {
--inFlight;
next();
}).catch(err => {
--inFlight;
console.log(err);
// purposely eat the error and let the rest of the processing continue
// if you want to stop further processing, you can call reject() here
next();
});
}
if (inFlight === 0) {
// all done
resolve();
}
}
next();
});
}
And, then you would use that like this:
const rp = require('request-promise');
// run the whole urlList, no more than 10 at a time
runRequests(urlList, 10, function(url) {
return rp(url).then(function(data) {
// process fetched data here for one url
}).catch(function(err) {
console.log(url, err);
});
}).then(function() {
// all requests done here
});
This can be made as sophisticated as you want by adding a time element to it (no more than N requests per second) or even a bandwidth element to it.
I want one request is called after one request is completed.
That's a very slow way to do things. If you really want that, then you can just pass a 1 for the maxInFlight parameter to the above function, but typically, things would work a lot faster and not cause problems by allowing somewhere between 5 and 50 simultaneous requests. Only testing would tell you where the sweet spot is for your particular target sites and your particular server infrastructure and amount of processing you need to do on the results.
you can use set timeout function to process all request within loop. for that you must know maximum time to process a request.
Related
I have a list of APIs I want to call GET simultaneously on all of them and return as soon as one API finishes the request with a response code of 200.
I tried using a for-loop and break, but that doesn't seem to work. It would always use the first API
import axios from 'axios';
const listOfApi = ['https://example.com/api/instanceOne', 'https://example.com/api/instanceTwo'];
let response;
for (const api of listOfApi) {
try {
response = await axios.get(api, {
data: {
url: 'https://example.com/',
},
});
break;
} catch (error) {
console.error(`Error occurred: ${error.message}`);
}
}
You can use Promise.race() to see which of an array of promises finishes first while running all the requests in parallel in flight at the same time:
import axios from 'axios';
const listOfApi = ['https://example.com/api/instanceOne', 'https://example.com/api/instanceTwo'];
Promise.any(listOfApi.map(api => {
return axios.get(api, {data: {url: 'https://example.com/'}}).then(response => {
// skip any responses without a status of 200
if (response.status !== 200) {
throw new Error(`Response status ${response.status}`, {cause: response});
}
return response;
});
})).then(result => {
// first result available here
console.log(result);
}).catch(err => {
console.log(err);
});
Note, this uses Promise.any() which finds the first promise that resolves successfully (skipping promises that reject). You can also use Promise.race() if you want the first promise that resolves or rejects.
I think jfriend00's answer is good, but I want to expand on it a bit and show how it would look with async/await, because that's what you are already using.
As mentioned, you can use Promise.any (or Promise.race). Both take an array of promises as argument. Promise.any will yield the result of the first promise that resolves successfully, while Promise.race will simply wait for the first promise that finishes (regardless of whether it was fulfilled or rejected) and yield its result.
To keep your code in the style of async/await as it originally was, you can map the array using an async callback function, which will effectively return a promise. This way, you don't have to "branch off into .then territory" and can keep the code more readable and easier to expand with conditions, etc.
This way, the code can look as follows:
import axios from 'axios';
const listOfApi = ['https://example.com/api/instanceOne', 'https://example.com/api/instanceTwo'];
try {
const firstResponse = await Promise.any(listOfApi.map(async api => {
const response = await axios.get(api, {
data: {
url: 'https://example.com/',
},
});
if (response.status !== 200) {
throw new Error(`Response status ${response.status}`, {cause: response});
}
return response;
}));
// DO SOMETHING WITH firstResponse HERE
} catch (error) {
console.error('Error occured:', error);
}
Side note: I changed your console.error slightly. Logging only error.message is a common mistake that hinders you from effective debugging later on, because it will lack a lot of important information because it prints only the message and not the error stack, the error name or any additional properties the error may have. Using .stack and not .message will already be better as it includes name and stack then, but what's best is to supply the error as separate argument to console.error so that inspect gets called on it and it can print the whole error object, with stack and any additional properties you may be interested in. This is very valuable when you encounter an error in production that is not so easy to reproduce.
In a NodeJS v10.x.x environment, when trying to create a PDF page from some HTML code, I'm getting a closed page issue every time I try to do something with it (setCacheEnabled, setRequestInterception, etc...):
async (page, data) => {
try {
const {options, urlOrHtml} = data;
const finalOptions = { ...config.puppeteerOptions, ...options };
// Set caching flag (if provided)
const cache = finalOptions.cache;
if (cache != undefined) {
delete finalOptions.cache;
await page.setCacheEnabled(cache); //THIS LINE IS CAUSING THE PAGE TO BE CLOSED
}
// Setup timeout option (if provided)
let requestOptions = {};
const timeout = finalOptions.timeout;
if (timeout != undefined) {
delete finalOptions.timeout;
requestOptions.timeout = timeout;
}
requestOptions.waitUntil = 'networkidle0';
if (urlOrHtml.match(/^http/i)) {
await page.setRequestInterception(true); //THIS LINE IS CAUSING ERROR DUE TO THE PAGE BEING ALREADY CLOSED
page.once('request', request => {
if(finalOptions.method === "POST" && finalOptions.payload !== undefined) {
request.continue({method: 'POST', postData: JSON.stringify(finalOptions.payload)});
}
});
// Request is for a URL, so request it
await page.goto(urlOrHtml, requestOptions);
}
return await page.pdf(finalOptions);
} catch (err) {
logger.info(err);
}
};
I read somewhere that this issue could be caused due to some await missing, but that doesn't look like my case.
I'm not using directly puppeteer, but this library that creates a cluster on top of it and handles processes:
https://github.com/thomasdondorf/puppeteer-cluster
You already gave the solution, but as this is a common problem with the library (I'm the author 🙂) I would like to provide some more insights.
How the task function works
When a job is queued and ready to be executed, puppeteer-cluster will create a page and call the task function (given to cluster.task) with the created page object and the queued data. The cluster then waits until the Promise is finished (fulfilled or rejected) and will close the page and execute the next job in the queue.
As an async-function is implicitly creating a Promise, this means as soon as the async-function given to the cluster.task function is finished, the page is closed. There is no magic happening to determine if the page might be used in the future.
Waiting for asynchronous events
Below is a code sample with a common mistake. The user might want to wait for an external event before closing the page as in the (not working) example below:
Non-working (!) code sample:
await cluster.task(async ({ page, data }) => {
await page.goto('...');
setTimeout(() => { // user is waiting for an asynchronous event
await page.evaluate(/* ... */); // Will throw an error as the page is already closed
}, 1000);
});
In this code, the page is already closed before the asynchronous function is executed. To correct way to do this would be to return a Promise instead.
Working code sample:
await cluster.task(async ({ page, data }) => {
await page.goto('...');
// will wait until the Promise resolves
await new Promise(resolve => {
setTimeout(() => { // user is waiting for an asynchronous event
try {
await page.evalute(/* ... */);
resolve();
} catch (err) {
// handle error
}
}, 1000);
});
});
In this code sample, the task function waits until the inner promise is resolved until it resolves the function. This will keep the page open until the asynchronous function calls resolve. In addition, the code uses a try..catch block as the library is not able to catch events thrown inside asynchronous code blocks.
I got it.
I was indeed forgetting an await to the call that was made to the function I posted.
That call was in another file that I use fot the cluster instance creation:
async function createCluster() {
//We will protect our app with a Cluster that handles all the processes running in our headless browser
const cluster = await Cluster.launch({
concurrency: Cluster[config.cluster.concurrencyModel],
maxConcurrency: config.cluster.maxConcurrency
});
// Event handler to be called in case of problems
cluster.on('taskerror', (err, data) => {
console.log(`Error on cluster task... ${data}: ${err.message}`);
});
// Incoming task for the cluster to handle
await cluster.task(async ({ page, data }) => {
main.postController(page, data); // <-- I WAS MISSING A return await HERE
});
return cluster;
}
I'm using react, electron, nodejs, asyncjs redux and thunk.
I wrote the following code which is supposed to download a list of files and write it to disk. In my code when the user presses a button i call this actionCreator:
export function downloadList(pack) {
return (dispatch, getState) => {
const { downloadManager } = getState();
async.each(downloadManager.downloadQueue[pack].libs, async (url, callback) => {
const filename = url.split('/').pop().split('#')[0].split('?')[0];
await downloadFile(url, `dl/${filename}`);
callback();
}, (err) => {
if (err) {
console.log('A file failed to process');
} else {
dispatch({
type: DOWNLOAD_COMPLETED,
packName: pack
});
}
});
};
}
async function downloadFile(url, path) {
const file = fs.createWriteStream(path);
const request = https.get(url, (response) => {
response.pipe(file);
file.on('finish', () => {
file.close();
});
}).on('error', (err) => { // Handle errors
fs.unlink(path); // Delete the file async. (But we don't check the result)
});
}
It does what it's supposed to do but while it does that, it blocks the entire UI. I really can't understand why it's happening since if i use an
setTimeout
with 3000ms delay inside the async.each it doesn't block the UI.
Another strange behaviour is that if i use the eachLimit function of asyncJS it just downloads me the limit of files, so if i want to download 100 files but i set eachLimit to 10 parallel, it just downloads the first 10 files and then stops. Can you enlight me about this?
I wanted to use axios to download files since it doesn't need to know if the urls are http or https but i can't find any resource on using axios with stream responsetype
I can answer the first part. Pretty much every existent implementation of JavaScript runs on one thread. This means that the runtime is concurrent, but not parallel, i.e. the runtime does one and exactly one thing at a time. This means that if there is a function call that takes a while, it will block everything else. Therefore, something in the downloadList function is blocking the event loop. However, if you use setTimeout, then the downloadList function will be pushed onto the message queue, which will unblock the event and allow the UI to be rendered. For more information on the event loop check out this video
Does it matter if you implement the get method before another method such as post for example implement app.post() before app.get()? I am not sure why there would be significance in changing the order, but in the express app that I built if I implemented post before get, my data would buffer and then be posted every other call, the posting was inconsistent. When I switched the order the issue was fixed.
This is the code for the requests
const xhrPost = new XMLHttpRequest();
const xhrGet = new XMLHttpRequest();
//sends data to DB
xhrPost.open("POST", '/endgame', true);
xhrPost.setRequestHeader("Content-Type", "application/json;charset=UTF-8");
xhrPost.send(JSON.stringify({
playerScore: score
}));
//when data is done being posted, get list of scores from db
xhrPost.onreadystatechange = function() {
console.log(this.responseText);
if (this.readyState === 4 && this.status === 200) {
xhrGet.open("GET", '/endgame', true);
xhrGet.setRequestHeader("Content-Type", "application/json;charset=UTF-8");
xhrGet.send();
}
}
//when scores retrieved display results on console
xhrGet.onreadystatechange = function() {
if (this.readyState === 4 && this.status === 200) {
console.table(JSON.parse(this.responseText));
var data = (JSON.parse(this.responseText));
ctx.fillText(data[0].playerScore, 50, 150);
}
};
and this is the server side code
mongodb.MongoClient.connect(url, (error, database) => {
if (error) return process.exit(1)
const db = database.db('js-snake-scores')
app.post('/endgame', (req, res) => {
let score = req.body
db.collection('scores')
.insert(score, (error, results) => {
if (error) return
res.send(results)
})
})
app.get('/endgame', (req, res) => {
db.collection('scores')
.find({}, {
playerScore: 1
}).toArray((err, data) => {
if (err) return next(err)
res.send(data)
})
})
app.use(express.static(path.join(__dirname, 'static')))
app.listen(process.env.PORT || 5000)
})
Does it matter if you implement the get method before another method such as post for example implement app.post() before app.get()?
No. Order matters only when two routes would handle both the same path and the same method. So, since app.post() and app.get() each only intercept different methods, they don't compete in any way and thus their relative ordering to each other does not matter. Only one will ever trigger on a GET and only the other one will ever trigger on a POST regardless of their order of definition.
If you saw a difference in behavior due to the order, then it must have been due to some other effect besides just an app.get() and an app.post() with the same path because those two are not ever activated on the same request. If we could see the two implementations of code where you say order mattered when you switched them, then we could likely offer you a better idea of why you saw a difference in behavior. app.post() and app.get() ordering by themselves would not cause what you described.
I need to check the availability of about 300.000 URLs on a local server via HTTP. The files are not in a local file system but a key value store and the goal is to sanity check if every system needing access to those files is able to do so vial HTTP.
To do so, I would use HTTP HEAD requests that return HTTP 200 for every file found and 404 for every file not found.
The problem is, if I do too many requests at once, I get rate limited by nginx or a local proxy, hence no info whether a file is really accessible.
My method to look for the availability of files looks as follows:
...
const request = require('request'); // Using the request lib.
...
const checkEntity = entity => {
logger.debug("HTTP HEAD ", entity);
return request({ method: "HEAD", uri: entity.url })
.then(result => {
logger.debug("Successfully retrieved file: " + entity.url);
entity.valid = result != undefined;
})
.catch(err => {
logger.debug("Failed to retrieve file.", err);
entity.valid = false;
});
}
If I call this function a few times, things work as expected. When trying to run it within recursive promises, I quickly exceed the maximum stack. Setting up one promise for each call causes too much memory usage.
How could this be solved?
This problem can be solved in these steps:
Define a queue and store all your entities (all URLs that need to be checked).
Define how many HTTP requests you want to send in parallel. This number should not be too small or too large. If it's too small, the program is not efficient. If it is too large, current requests-number-limit problem will occur. Let's make it as N, you can define a reasonable number according to your server status.
Send N HTTP requests in parallel at the beginning.
When 1 request is finished, fetch a new entity from the queue and send a new request. To get notified when request is done, you can add a callback parameter in your checkEntity function.
In this way, the maximum HTTP requests number will never be more than N.
Here is a pseudo code example based on your code snippet:
let allEntities = [...]; // 300000 URLs
let finishedEntities = [];
const request = require('request'); // Using the request lib.
...
const checkEntity = function(entity, callback) {
logger.debug("HTTP HEAD ", entity);
return request({ method: "HEAD", uri: entity.url })
.then(result => {
logger.debug("Successfully retrieved file: " + entity.url);
entity.valid = result != undefined;
callback(entity);
})
.catch(err => {
logger.debug("Failed to retrieve file.", err);
entity.valid = false;
callback(entity)
});
}
function checkEntityCallback(entity) {
finishedEntities.push(entity);
let newEntity = allEntities.shift();
if (newEntity) {
checkEntity(allEntities.shift(), checkEntityCallback);
}
}
for (let i=0; i<10; i++) {
checkEntity(allEntities.shift(), checkEntityCallback);
}
To make things easier to understand, you can change the usage of request and remove all Promise stuff:
const checkEntity = function(entity, callback) {
logger.debug("HTTP HEAD ", entity);
request({ method: "HEAD", uri: entity.url }, function(error, response, body) {
if (error) {
logger.debug("Failed to retrieve file.", error);
entity.valid = false;
callback(entity);
return;
}
logger.debug("Successfully retrieved file: " + entity.url);
entity.valid = body != undefined;
callback(entity);
});
}