Delaying execution of multiple HTTP requests in Google Cloud Function - node.js

I've implemented a web scraper with Nodejs, cheerio and request-promise that scrapes an endpoint (basic html page) and return certain information. The content of the page I'm crawling differs based on a parameter at the end of the url (http://some-url.com?value=12345 where 12345 is my dynamic value).
I need this crawler to work every x minutes and crawl multiple pages, and to do that I've set a cronjob using Google Cloud Scheduler. (I'm fetching the dynamic values I need from Firebase).
There could be more than 50 different values for which I'd need to crawl the specific page, but I would like to ease the load with which I'm sending the requests so the server doesn't choke. To accomplish this, I've tried to add a delay
1) using setTimeout
2) using setInterval
3) using a custom sleep implementation:
const sleep = require('util').promisify(setTimeout);
All 3 of these methods work locally; all of the requests are made with y seconds delay as intended.
But when tried with Firebase Cloud Functions and Google Cloud Scheduler
1) not all of the requests are sent
2) the delay is NOT consistent (some requests fire with the proper delay, then there are no requests made for a while and other requests are sent with a major delay)
I've tried many things but I wasn't able to solve this problem.
I was wondering if anyone could suggest a different theoretical approach or a certain library etc. I can take for this scenario, since the one I have now doesn't seem to work as I intended. I'm adding one of the approaches that locally work below.
Cheers!
courseDataRefArray.forEach(async (dataRefObject: CourseDataRef, index: number) => {
console.log(`Foreach index = ${index} -- Hello StackOverflow`);
setTimeout(async () => {
console.log(`Index in setTimeout = ${index} -- Hello StackOverflow`);
await CourseUtil.initiateJobForCourse(dataRefObject.ref, dataRefObject.data);
}, 2000 * index);
});
(Note: I can provide more code samples if necessary; but it's mostly following a loop & async/await & setTimeout pattern, and since it works locally I'm assuming that's not the main problem.)

Related

Testing multiple URLs using Nightwatchjs & Request

I'm using the Request package with my Nightwatchjs setup to test the status codes of a number of URLs (about 50 in total).
My issue is twofold.
Firstly, my code is currently as such (for a single URL);
var request = require('request');
module.exports = {
'Status Code testing': function (statusCode, browser) {
request(browser.launch_url + browser.globals.reviews + 'news/', function (error, response, body) {
browser.assert.equal(response.statusCode, 200);
});
},
};
but it's failing with a
✖ TypeError: Cannot read property 'launch_url' of undefined
So my first question is, how can I incorporate browser.launch_url + browser.globals.reviews + 'news/' into the script for a request?
Secondly, I have a list of about 50 URLs that I need to test the status code of.
Rather than repeat the code below 50 times, is there a more succinct, readable way of testing these URLs?
Any help would be greatly appreciated.
Many thanks.
The correct call you should be making is browser.launchUrl to access the API then you can concatenate the appending string paths.
Also, I believe you may be mistaking the purpose of Nightwatch. Nightwatch isn't an API testing tool, it's used to test UI as it relates to end-to-end testing. While you can incorporate some data validation to supplement your UI testing with Nightwatch, there are better options out there. But your question was how to do so with Nightwatch, so you don't have to repeat it over and over again. I created a function for my GET requests and a separate function for my POST requests. My get functions would pass in a token for authentication, and my POST requests would pass in a token as well as the payload (in my case JSON). Hope this helps you out

Rate Limit the Nodejs Module Request

So I'm trying to create a data scraper with Nodejs using the Request module. I'd like to limit the concurrency to 1 domains on a 20ms cycle to go through 50,000 urls.
When I execute the code, I'm DoS-ing the network with the 40Gbps bandwidth my system has access to... This creates local problems and remote problems.
The 5 concurrent scans on a 120ms cycle for 50k domains (if I calculated correctly) will finish the list in ~20 minutes and will not create any issues remotely at least.
The code I'm testing with:
var urls = // data from mongodb
urls.forEach(fn(url) {
// pseudo
request the url
process
});
The forEach function executes instantly "queueing" all urls and tries to fetch all. It seems impossible to do a delay on each loop. All google searches seem to show how to rate limit incoming request to your server/api. Same thing appears to happen with a for loop as well. Impossible to control how fast the loops execute. I'm missing something probably or the code logic is wrong. Any suggestions?
For simplification your code implementation use async/await and Promises instead callbacks.
Use package got or axios for run Promised requests.
Use p-map or similar approach form promise-fun
There is copypasted example:
const pMap = require('p-map');
const urls = [
'sindresorhus.com',
'ava.li',
'github.com',
…
];
console.log(urls.length);
//=> 100
const mapper = url => {
return fetchStats(url); //=> Promise
};
pMap(urls, mapper, {concurrency: 5}).then(result => {
console.log(result);
//=> [{url: 'sindresorhus.com', stats: {…}}, …]
});

How to perform multiple nightmare function without it hanging up

I'm trying to scrape a webpage with nightmareJS and got stuck.
In my program i pass to the function an array on links which i need to the same data from all of them
The list can be very long (over 60) and if i try to do a
async.each(Links, function (url, callback) {
var nightmare = Nightmare(size);
...
}
Only the first couple few instances actually return a value , others just hang up and wont load (blank page).When i try to do only three it work perfectly.
How can i fix it? How can i redistribute the work , for example three in parallel and only when all done it will do the next set? One more thought maybe use the same instance and repeat the steps for all the links?
There are two possible solutions:
using eachSeries which waits until one operation is done before launching the other one.
Or in async.eachpass another argument which limits how many operation are running in the same time.

MarkLogic 8 - XQuery write large result set to a file efficiently

UPDATE: See MarkLogic 8 - Stream large result set to a file - JavaScript - Node.js Client API for someone's answer on how to do this in Javascript. This question is specifically asking about XQuery.
I have a web application that consumes rest services hosted in node.js.
Node simply proxies the request to XQuery which then queries MarkLogic.
These queries already have paging setup and work fine in the normal case to return a page of data to the UI.
I need to have an export feature such that when I put a URL parameter of export=all on a request, it doesn't lookup a page anymore.
At that point it should get the whole result set, even if it's a million records, and save it to a file.
The actual request needs to return immediately saying, "We will notify you when your download is ready."
One suggestion was to use xdmp:spawn to call the XQuery in the background which would save the results to a file. My actual HTTP request could then return immediately.
For the spawn piece, I think the idea is that I run my query with different options in order to get all results instead of one page. Then I would loop through the data and create a string variable to call xdmp:save with.
Some questions, is this a good idea? Is there a better way? If I loop through the result set and it does happen to be very large (gigabytes) it could cause memory issues.
Is there no way to directly stream the results to a file in XQuery?
Note: Another idea I had was to intercept the request at the proxy (node) layer and then do an xdmp:estimate to get the record count and then loop through querying each page and flushing it to disk. In this case I would need to find some way to return my request immediately yet process in the background in node which seems to have some ideas here: http://www.pubnub.com/blog/node-background-jobs-async-processing-for-async-language/
One possible strategy would be to use a self-spawning task that, on each iteration, gets the next page of the results for a query.
Instead of saving the results directly to a file, however, you might want to consider using xdmp:http-post() to send each page to a server:
http://docs.marklogic.com/xdmp:http-post?q=xdmp:http-post&v=8.0&api=true
In particular, the server could be a Node.js server that appends each page as it arrives to a file or any other datasink.
That way, Node.js could handle the long-running asynchronous IO with minimal load on the database server.
When a self-spawned task hits the end of the query, it can again use an HTTP request to notify Node.js to close the file and report that the export is finished.
Hping that helps,

150ms delay in performing a HTTPS versus HTTP get request in Node

I don't know much about how the https module in node.js works so if any of you can answer this question then that would be great.
I have noticed in a small app I made that it takes about ~150ms for a HTTPS.get(...) function to execute from scratch before any actual request is sent out. This is what im talking about:
var http = require('http');
var https = require('https');
console.time("Begin");
function request() {
console.timeEnd("Begin");
var myvar = https.get("https://www.fiadkbjadfklblnfthiswebsidedoesnotexist.com", function(res) {
});
console.timeEnd("Begin");
}
request();
When I use 'https.get', the console says that approximately 150ms passed before the code even starts doing anything with the get request. However when I use 'http.get' the delay is less than <5ms.
My question is, what exactly is causing this 150ms delay and is there anyway to reduce it? Im sure that it is not ssl handshaking because this delay happens even when I input a non-existant website. It would be great if it was possible to code something earlier in the program so that when I execute a https.get() request, it would not have such a long startup time.
You are using console.timeEnd('Begin') multiple times in your code
As of node v6.0.0, timeEnd() deletes the timer to avoid leaking it
So when you call console.timeEnd('Begin') first, it deletes the timer & on second call of console.timeEnd('Begin') it can not find the reference to same
You can create multiple labels if you want 2 timers for different issues, but make sure you just write timeEnd() only once for every time()

Resources