So I'm trying to create a data scraper with Nodejs using the Request module. I'd like to limit the concurrency to 1 domains on a 20ms cycle to go through 50,000 urls.
When I execute the code, I'm DoS-ing the network with the 40Gbps bandwidth my system has access to... This creates local problems and remote problems.
The 5 concurrent scans on a 120ms cycle for 50k domains (if I calculated correctly) will finish the list in ~20 minutes and will not create any issues remotely at least.
The code I'm testing with:
var urls = // data from mongodb
urls.forEach(fn(url) {
// pseudo
request the url
process
});
The forEach function executes instantly "queueing" all urls and tries to fetch all. It seems impossible to do a delay on each loop. All google searches seem to show how to rate limit incoming request to your server/api. Same thing appears to happen with a for loop as well. Impossible to control how fast the loops execute. I'm missing something probably or the code logic is wrong. Any suggestions?
For simplification your code implementation use async/await and Promises instead callbacks.
Use package got or axios for run Promised requests.
Use p-map or similar approach form promise-fun
There is copypasted example:
const pMap = require('p-map');
const urls = [
'sindresorhus.com',
'ava.li',
'github.com',
…
];
console.log(urls.length);
//=> 100
const mapper = url => {
return fetchStats(url); //=> Promise
};
pMap(urls, mapper, {concurrency: 5}).then(result => {
console.log(result);
//=> [{url: 'sindresorhus.com', stats: {…}}, …]
});
Related
The lambda's job is to see if a query returns any results and alert subscribers via an SNS topic. If no rows are return, all good, no action needed. This has to be done every 10 minutes.
For some reasons, I was told that we can't have any triggers added on the database, and no on prem environment is suitable to host a cron job
Here comes lambda.
This is what I have in the handler, inside a loop for each database.
sequelize.authenticate()
.then(() => {
for (let j = 0; j < database[i].rawQueries[j].length; j++) {
sequelize.query(database[i].rawQueries[j] => {
if (results[0].length > 0) {
let message = "Temporary message for testing purposes" // + query results
publishSns("Auto Query Alert", message)
}
}).catch(err => {
publishSns("Auto Query SQL Error", `The following query could not be executed: ${database[i].rawQueries[j])}\n${err}`)
})
}
})
.catch(err => {
publishSns("Auto Query DB Connection Error", `The following database could not be accessed: ${databases[i].database}\n${err}`)
})
.then(() => sequelize.close())
// sns publisher
function publishSns(subject, message) {
const params = {
Message: message,
Subject: subject,
TopicArn: process.env.SNStopic
}
SNS.publish(params).promise()
}
I have 3 separate database configurations, and for those few SELECT queries, I thought I could just loop through the connection instances inside a single lambda.
The process is asynchronous and it takes 9 to 12 seconds per invocation, which I assume is far far from optimal
The whole thing feels very very sub optimal but that's my current level :)
To make things worse, I now read that lambda and sequelize don't really play well together:
I am using sequelize because that's the only way I could get 3 connections to the database in the same invocation to work without issues. I tried mssql and tedious packages and wasn't able with either of them
It now feels like using an ORM is an overkill for this very simple task of a SELECT query, and I would really like to at least have the connections and their queries done asynchronously to save some execution time
I am looking into different ways to accomplish this and i went down the rabbit hole and I now have more questions than before! Generators? are they still useful? Observables with RxJs? Could this apply here? Async/Await or just Promises? Do I even need sequelize?
Any guidance/opinion/criticism would be very appreciated
I'm not familiar with sequelize.js but hope I can help. I don't know your level with RxJS and Observables but it's worth to try.
I think you could definitely use Observables and RxJS.
I would start with an interval() that will run the code every time you define.
You can then pipe the interval since it's an Observable, do the auth bit and do a map() to get an array of Observables (for each .query call, I am assuming all your calls, authenticate and query, are Promises so it's possible to transform them into Observables with from()). You can then use something like forkJoin() with the previous array to get a response after all calls are done.
In the .subscribe at the end, you would make the publishSns().
You can pipe a catchError() too and process errors.
The map() part might be not necessary and do it previously and have it stored in a variable since you don't depend on an authenticate value.
I'm certain my solution isn't the only one or the best but i think it would work.
Hope it helps and let me know if it works!
I've implemented a web scraper with Nodejs, cheerio and request-promise that scrapes an endpoint (basic html page) and return certain information. The content of the page I'm crawling differs based on a parameter at the end of the url (http://some-url.com?value=12345 where 12345 is my dynamic value).
I need this crawler to work every x minutes and crawl multiple pages, and to do that I've set a cronjob using Google Cloud Scheduler. (I'm fetching the dynamic values I need from Firebase).
There could be more than 50 different values for which I'd need to crawl the specific page, but I would like to ease the load with which I'm sending the requests so the server doesn't choke. To accomplish this, I've tried to add a delay
1) using setTimeout
2) using setInterval
3) using a custom sleep implementation:
const sleep = require('util').promisify(setTimeout);
All 3 of these methods work locally; all of the requests are made with y seconds delay as intended.
But when tried with Firebase Cloud Functions and Google Cloud Scheduler
1) not all of the requests are sent
2) the delay is NOT consistent (some requests fire with the proper delay, then there are no requests made for a while and other requests are sent with a major delay)
I've tried many things but I wasn't able to solve this problem.
I was wondering if anyone could suggest a different theoretical approach or a certain library etc. I can take for this scenario, since the one I have now doesn't seem to work as I intended. I'm adding one of the approaches that locally work below.
Cheers!
courseDataRefArray.forEach(async (dataRefObject: CourseDataRef, index: number) => {
console.log(`Foreach index = ${index} -- Hello StackOverflow`);
setTimeout(async () => {
console.log(`Index in setTimeout = ${index} -- Hello StackOverflow`);
await CourseUtil.initiateJobForCourse(dataRefObject.ref, dataRefObject.data);
}, 2000 * index);
});
(Note: I can provide more code samples if necessary; but it's mostly following a loop & async/await & setTimeout pattern, and since it works locally I'm assuming that's not the main problem.)
We have a NextJS app with an Express server.
The problem we're seeing is lots of network timeouts to the API we are calling (the underlying exception says "socket hangup"). However, that API does not show any errors or a slow response time. It's as if the API calls aren't even making it all the way to the API.
Theories and things we've tried:
Blocked event loop: we tried replacing synchronous logging with asynchronous "winston" framework, to make sure we're not blocking the event loop. Not sure what else could be blocking
High CPU: the CPU can spike up to 60% sometimes. We're trying to minimize that spike by taking out some regexes we were using (since we heard those are expensive, CPU-wise).
Something about how big the JSON response is from the API? We're passing around a lot of data…
Too many complex routes in our Express routing structure: We minimized the number of routes by combining some together (which results in more complicated regexes in the route definitions)…
Any ideas why we would be seeing these fetch timeouts? They only appear during load tests and in production environments, but they can bring down the whole app with heavy load.
The code that emits the error:
function socketCloseListener() {
const socket = this;
const req = socket._httpMessage;
debug('HTTP socket close');
// Pull through final chunk, if anything is buffered.
// the ondata function will handle it properly, and this
// is a no-op if no final chunk remains.
socket.read();
// NOTE: It's important to get parser here, because it could be freed by
// the `socketOnData`.
const parser = socket.parser;
const res = req.res;
if (res) {
// Socket closed before we emitted 'end' below.
if (!res.complete) {
res.aborted = true;
res.emit('aborted');
}
req.emit('close');
if (res.readable) {
res.on('end', function() {
this.emit('close');
});
res.push(null);
} else {
res.emit('close');
}
} else {
if (!req.socket._hadError) {
// This socket error fired before we started to
// receive a response. The error needs to
// fire on the request.
req.socket._hadError = true;
req.emit('error', connResetException('socket hang up'));
}
req.emit('close');
The message is generated when the server does not send and response.
That's that easy bit.
But why would the API server not send a response?
Well, without seeing the minimum code that repro this I can only give you some pointers.
This issue here discusses at length the changes between version 6 and 8, in particular how a GET with a body now can cause it. This change of behaviour is more aligned to the REST specs.
I'm currently looking to set up an endpoint that accepts a request, and returns the response data in increments as they load.
The application of this is that given one upload of data, I would like to calculate a number of different metrics for that data. As each metric gets calculated asynchronously, I want to return this metric's value to the front-end to render.
For testing, my controller looks as follows, trying to use res.write
uploadData = (req, res) => {
res.write("test");
setTimeout(() => {
res.write("test 2");
res.end();
}, 3000);
}
However, I think the issue stems from my client-side which I'm writing in React-Redux, and calling that route through an Axios call. From my understanding, it's because the axios request closes once receiving the first response, and the connection doesn't stay open. Here is what my axios call looks like:
axios.post('/api', data)
.then((response) => {
console.log(response);
})
.catch((error) => {
console.log(error);
});
Is there an easy way to do this? I've also thought about streaming, however my concern with streaming is that I would like each connection to be direct and unique between clients that are open for short amount of time (i.e. only open when the metrics are being calculated).
I should also mention that the resource being uploaded is a db, and I would like to avoid parsing and opening a connection multiple times as a result of multiple endpoints.
Thanks in advance, and please let me know if I can provide any more context
One way to handle this while still using a traditional API would be to store the metrics in an object somewhere, either a database or redis for example, then just long poll the resource.
For a real world example, say you want to calculate the following metrics of foo, time completed, length of request, bar, foobar.
You could create an object in storage that looks like this:
{
id: 1,
lengthOfRequest: 123,
.....
}
then you would create an endpoint in your API that like so metrics/{id}
and would return the object. Just keep calling the route until everything completes.
There are some obvious drawbacks to this of course, but once you get enough information to know how long the metrics will take to complete on average you can tweak the time in between the calls to your API.
I'm new in Node JS and i wonder if under mentioned snippets of code has multisession problem.
Consider I have Node JS server (express) and I listen on some POST request:
app.post('/sync/:method', onPostRequest);
var onPostRequest = function(req,res){
// parse request and fetch email list
var emails = [....]; // pseudocode
doJob(emails);
res.status(200).end('OK');
}
function doJob(_emails){
try {
emailsFromFile = fs.readFileSync(FILE_PATH, "utf8") || {};
if(_.isString(oldEmails)){
emailsFromFile = JSON.parse(emailsFromFile);
}
_emails.forEach(function(_email){
if( !emailsFromFile[_email] ){
emailsFromFile[_email] = 0;
}
else{
emailsFromFile[_email] += 1;
}
});
// write object back
fs.writeFileSync(FILE_PATH, JSON.stringify(emailsFromFile));
} catch (e) {
console.error(e);
};
}
So doJob method receives _emails list and I update (counter +1) these emails from object emailsFromFile loaded from file.
Consider I got 2 requests at the same time and it triggers doJob twice. I afraid that when one request loaded emailsFromFile from file, the second request might change file content.
Can anybody spread the light on this issue?
Because the code in the doJob() function is all synchronous, there is no risk of multiple requests causing a concurrency problem.
If you were using async IO in that function, then there would be possible concurrency issues.
To explain, Javascript in node.js is single threaded. So, there is only one thread of Javascript execution running at a time and that thread of execution runs until it returns back to the event loop. So, any sequence of entirely synchronous code like you have in doJob() will run to completion without interruption.
If, on the other hand, you use any asynchronous operations such as fs.readFile() instead of fs.readFileSync(), then that thread of execution will return back to the event loop at the point you call fs.readFileSync() and another request can be run while it is reading the file. If that were the case, then you could end up with two requests conflicting over the same file. In that case, you would have to implement some form of concurrency protection (some sort of flag or queue). This is the type of thing that databases offer lots of features for.
I have a node.js app running on a Raspberry Pi that uses lots of async file I/O and I can have conflicts with that code from multiple requests. I solved it by setting a flag anytime I'm writing to a specific file and any other requests that want to write to that file first check that flag and if it is set, those requests going into my own queue are then served when the prior request finishes its write operation. There are many other ways to solve that too. If this happens in a lot of places, then it's probably worth just getting a database that offers features for this type of write contention.