Requesting HTTP HEAD of thousands of URLs via NodeJS

Requesting HTTP HEAD of thousands of URLs via NodeJS - node.js

I need to check the availability of about 300.000 URLs on a local server via HTTP. The files are not in a local file system but a key value store and the goal is to sanity check if every system needing access to those files is able to do so vial HTTP.
To do so, I would use HTTP HEAD requests that return HTTP 200 for every file found and 404 for every file not found.
The problem is, if I do too many requests at once, I get rate limited by nginx or a local proxy, hence no info whether a file is really accessible.
My method to look for the availability of files looks as follows:
...
const request = require('request'); // Using the request lib.
...
const checkEntity = entity => {
logger.debug("HTTP HEAD ", entity);
return request({ method: "HEAD", uri: entity.url })
.then(result => {
logger.debug("Successfully retrieved file: " + entity.url);
entity.valid = result != undefined;
})
.catch(err => {
logger.debug("Failed to retrieve file.", err);
entity.valid = false;
});
}
If I call this function a few times, things work as expected. When trying to run it within recursive promises, I quickly exceed the maximum stack. Setting up one promise for each call causes too much memory usage.
How could this be solved?

This problem can be solved in these steps:
Define a queue and store all your entities (all URLs that need to be checked).
Define how many HTTP requests you want to send in parallel. This number should not be too small or too large. If it's too small, the program is not efficient. If it is too large, current requests-number-limit problem will occur. Let's make it as N, you can define a reasonable number according to your server status.
Send N HTTP requests in parallel at the beginning.
When 1 request is finished, fetch a new entity from the queue and send a new request. To get notified when request is done, you can add a callback parameter in your checkEntity function.
In this way, the maximum HTTP requests number will never be more than N.
Here is a pseudo code example based on your code snippet:
let allEntities = [...]; // 300000 URLs
let finishedEntities = [];
const request = require('request'); // Using the request lib.
...
const checkEntity = function(entity, callback) {
logger.debug("HTTP HEAD ", entity);
return request({ method: "HEAD", uri: entity.url })
.then(result => {
logger.debug("Successfully retrieved file: " + entity.url);
entity.valid = result != undefined;
callback(entity);
})
.catch(err => {
logger.debug("Failed to retrieve file.", err);
entity.valid = false;
callback(entity)
});
}
function checkEntityCallback(entity) {
finishedEntities.push(entity);
let newEntity = allEntities.shift();
if (newEntity) {
checkEntity(allEntities.shift(), checkEntityCallback);
}
}
for (let i=0; i<10; i++) {
checkEntity(allEntities.shift(), checkEntityCallback);
}
To make things easier to understand, you can change the usage of request and remove all Promise stuff:
const checkEntity = function(entity, callback) {
logger.debug("HTTP HEAD ", entity);
request({ method: "HEAD", uri: entity.url }, function(error, response, body) {
if (error) {
logger.debug("Failed to retrieve file.", error);
entity.valid = false;
callback(entity);
return;
}
logger.debug("Successfully retrieved file: " + entity.url);
entity.valid = body != undefined;
callback(entity);
});
}

Related

How to Store an respone into a variable nodejs request module

I am trying to store the response of an http request made using nodejs by request module but the problem is I can't acsess it after the request is completed in more details we can say after the callback
How I can add it
Here is what I tried till now
Tried to use var instead of let
Tried passing it to a function so that i can use it later but no luck
Here is my code can anyone help actually new to nodejs that's why maybe a noob question
var request = require('request')
var response
function sort(body) {
for (var i = 0; i < body.length; i++) {
body[i] = body[i].replace("\r", "");
}
response = body
return response
}
request.get(
"https://api.proxyscrape.com/?request=getproxies&proxytype=http&timeout=10000&country=all&ssl=all&anonymity=all",
(err, res, body) => {
if (err) {
return console.log(err);
}
body = body.split("\n");
sort(body);
}
);
console.log(response)
In this I am fetching up the proxies from this api and trying to store them in a variable called as response

var request = require("request");
var response;
async function sort(body) {
await body.split("\n");
response = await body;
console.log(response); // this console log show you after function process is done.
return response;
}
request.get(
"https://api.proxyscrape.com/?request=getproxies&proxytype=http&timeout=10000&country=all&ssl=all&anonymity=all",
(err, res, body) => {
if (err) {
return console.log(err);
}
sort(body);
}
);
// console.log(response); //This console log runs before the function still on process, so that's why it gives you undefined.
Try this code it works fine I just tested.
put the console log inside the function so you can see the result.
The console.log that you put actually runs before you process the data so that's why you are getting "undefined".
Actually, you will get the data after the sort Function is done processing.

empty response body in HTTP response in github actions

I'm trying to create a github action which requires sending an http request to https://www.instagram.com/<username>/?__a=1.
When I'm running it locally, it runs perfectly fine and gives me the number of followers.
But when I use it in github actions, it isn't able to parse the JSON string as the response is null
Here is a link to the github action file https://github.com/ashawe/actions-check/blob/e80ca115544979cdb3180207b99c7724e4446849/index.js
Here is the code to get the followers ( starts at line #94 )
promiseArray.push(new Promise((resolve, reject) => {
const url = 'https://www.instagram.com/' + INSTAGRAM_USERNAME + '/?__a=1';
core.info("url is");
core.info(url);
http.get(url, (response) => {
let chunks_of_data = [];
response.on('data', (fragments) => {
chunks_of_data.push(fragments);
});
response.on('end', () => {
let response_body = Buffer.concat(chunks_of_data);
core.info(response_body.toString());
let responseJSON = JSON.parse(response_body.toString());
resolve((responseJSON.graphql.user.edge_followed_by.count).toString());
});
response.on('error', (error) => {
reject(error);
});
});
}));
and then I'm processing it like:
Promise.allSettled(promiseArray).then((results) => {
results.forEach((result, index) => {
if (result.status === 'fulfilled') {
// Succeeded
// core.info(runnerNameArray[index] + ' runner succeeded. Post count: ' + result.value.length);
// postsArray.push(result.value);
instagram_followers = result.value;
} else {
jobFailFlag = true;
// Rejected
//core.error(runnerNameArray[index] + ' runner failed, please verify the configuration. Error:');
core.error(result.reason);
}
});
}).finally(() => {
try {
const followers = instagram_followers;
const readmeData = fs.readFileSync(README_FILE_PATH, 'utf8');
// core.info(readmeData);
const shieldURL = "https://img.shields.io/badge/ %40 " + INSTAGRAM_USERNAME + "-" + followers + "-%23E4405F?style=for-the-badge&logo=instagram";
const instagramBadge = "<img align='left' alt='instagram-followers' src='" + shieldURL + "' />";
const newReadme = buildReadme(readmeData, instagramBadge);
// core.info(newReadme);
// if there's change in readme file update it
if (newReadme !== readmeData) {
core.info('Writing to ' + README_FILE_PATH);
fs.writeFileSync(README_FILE_PATH, newReadme);
if (!process.env.TEST_MODE) {
// noinspection JSIgnoredPromiseFromCall
commitReadme();
}
} else {
core.info('No change detected, skipping');
process.exit(0);
}
} catch (e) {
core.error(e);
process.exit(1);
}
});
But when I run the action, it gives this error:
which means that the response_body isn't complete JSON response but a request to https://www.instagram.com/USERNAME/?__a=1 does send a json response.

UPDATE
Basically every time you hit that endpoint it returns the login html page, which causes the json parse to fail. It appears that you may need to use the api which requires you to authenticate before getting info from users. Or figure out other scraping methodologies.
I was able to recreate this failure in my local pc by jumping into a vpn and private browser. When I hit the endpoint it took me to the login screen. And when i hit the endpoint through curl in terminal it returned nothing. But when i got off the vpn, all worked fine. I think the reason it worked in your local is because there's some caching happening in the browser and you're probs not in a vpn. I am thinking there's some network blacklisting happening when on vpn. I don't know the github hosted network so I would recommend opening a ticket with them if you want to learn more about that.
Here are the instagram api docs for quick reference
https://developers.facebook.com/docs/instagram-basic-display-api/getting-started
Previews Response: Leaving here for other users future reference.
You are not passing username so it's trying to query the endpoint with empty username
Instead of running just node index.js in your action, you need to call your action and provide it with the parameters that it needs
- name: Your github action
uses: ./ # Uses an action in the root directory
with:
username: '_meroware'
Then your code will pick it put properly
const INSTAGRAM_USERNAME = core.getInput('username');
const url = 'https://www.instagram.com/' + INSTAGRAM_USERNAME + '/?__a=1';
Resources:
https://docs.github.com/en/actions/creating-actions/creating-a-javascript-action

http request takes too much time only when it is actually deployed on server

I'm trying to make POST request on front page via 'jquery ajax' to my server, and then with that data from front make POST request to outer server on my server. Using that final response I got from outer request, I wanna render new data into my front using ajax success function.
It seemed to be working well on local server, but when I deploy this project with heroku or azure, this whole process take 1000~2000ms and doesn't seem to be working at all.
what's wrong with my code?
I'm trying to build some detecting system that would notify users if there's a vacancy on wanted course. so I let user to pick a class and at the same time I call a function to check if there's a vacancy on that class via POST request to school server.
//index.html
//in front page, when user pick a course to start observing, I send POST to my server
function scanEmpty(data,cn,elem){
$.ajax({
url: './getLeftSeat',
crossDomain: true,
type: 'POST',
data: data+`&cn=${cn}`,
success: function (data) {
alert('got!')
}
})
}
//app.js
// when I get POST from my server I call scanEmpty()
app.post('/getLeftSeat', async (req,res) => {
scanEmpty(qs.stringify(req.body), req.body["cn"], () => {res.json({success: true})})
})
// and that is like this
const scanEmpty = async(data, CN, cb) => {
if(await parseGetLeftSeat(await getData(data), CN)) cb()
else {
await scanEmpty(data,CN,cb)
}
}
// send POST to school server and get response using axios
async function getData(data) {
return await axios.post(school_server_url, data, {'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8'});
}
// it's just parsing and get data that I want
const parseGetLeftSeat = async (res, CN) => {
return new Promise((resolve, reject) => {
const $ = cheerio.load(res.data);
$("#premier1 > div > table > tbody > tr > td").each((i, e) => {
if (e.firstChild && e.firstChild.data == CN && e.next) {
const tmp = e.next.next.next.next.next.next.next.next.next.next.next.next.next.next.next.next.next.next.next.next.next.next.next.next.next.next.next.next.next.next.firstChild.data.split('/')
resolve(Number(tmp[1].trim()) - Number(tmp[0].trim()) < 1 ? false : true)
}
})
})
}
It works alright though takes 1000~2000 ms when actually on deploy server while it takes 100~200 ms on local server. I tested some codes and it looks like axios.post() is the one. but even if I change it to node-fetch, the result was the same. I really don't know what to do.

Nodejs: Async request with a list of URL

I am working on a crawler. I have a list of URL need to be requested. There are several hundreds of request at the same time if I don't set it to be async. I am afraid that it would explode my bandwidth or produce to much network access to the target website. What should I do?
Here is what I am doing:
urlList.forEach((url, index) => {
console.log('Fetching ' + url);
request(url, function(error, response, body) {
//do sth for body
});
});
I want one request is called after one request is completed.

You can use something like Promise library e.g. snippet
const Promise = require("bluebird");
const axios = require("axios");
//Axios wrapper for error handling
const axios_wrapper = (options) => {
return axios(...options)
.then((r) => {
return Promise.resolve({
data: r.data,
error: null,
});
})
.catch((e) => {
return Promise.resolve({
data: null,
error: e.response ? e.response.data : e,
});
});
};
Promise.map(
urls,
(k) => {
return axios_wrapper({
method: "GET",
url: k,
});
},
{ concurrency: 1 } // Here 1 represents how many requests you want to run in parallel
)
.then((r) => {
console.log(r);
//Here r will be an array of objects like {data: [{}], error: null}, where if the request was successfull it will have data value present otherwise error value will be non-null
})
.catch((e) => {
console.error(e);
});

The things you need to watch for are:
Whether the target site has rate limiting and you may be blocked from access if you try to request too much too fast?
How many simultaneous requests the target site can handle without degrading its performance?
How much bandwidth your server has on its end of things?
How many simultaneous requests your own server can have in flight and process without causing excess memory usage or a pegged CPU.
In general, the scheme for managing all this is to create a way to tune how many requests you launch. There are many different ways to control this by number of simultaneous requests, number of requests per second, amount of data used, etc...
The simplest way to start would be to just control how many simultaneous requests you make. That can be done like this:
function runRequests(arrayOfData, maxInFlight, fn) {
return new Promise((resolve, reject) => {
let index = 0;
let inFlight = 0;
function next() {
while (inFlight < maxInFlight && index < arrayOfData.length) {
++inFlight;
fn(arrayOfData[index++]).then(result => {
--inFlight;
next();
}).catch(err => {
--inFlight;
console.log(err);
// purposely eat the error and let the rest of the processing continue
// if you want to stop further processing, you can call reject() here
next();
});
}
if (inFlight === 0) {
// all done
resolve();
}
}
next();
});
}
And, then you would use that like this:
const rp = require('request-promise');
// run the whole urlList, no more than 10 at a time
runRequests(urlList, 10, function(url) {
return rp(url).then(function(data) {
// process fetched data here for one url
}).catch(function(err) {
console.log(url, err);
});
}).then(function() {
// all requests done here
});
This can be made as sophisticated as you want by adding a time element to it (no more than N requests per second) or even a bandwidth element to it.
I want one request is called after one request is completed.
That's a very slow way to do things. If you really want that, then you can just pass a 1 for the maxInFlight parameter to the above function, but typically, things would work a lot faster and not cause problems by allowing somewhere between 5 and 50 simultaneous requests. Only testing would tell you where the sweet spot is for your particular target sites and your particular server infrastructure and amount of processing you need to do on the results.

you can use set timeout function to process all request within loop. for that you must know maximum time to process a request.

nodeJS blocking all requests until it calls back

I've developed a nodeJS API (using express) which allow users to login and get a list of files that they have stored in a remote server. And as you understand, the code must be non-blocking so the webserver can still responds to logging in requests, even if there are some users fetching theirs files lists.
Every time a user make a request to get his files list, the listOfFiles function is called.
This is the code:
exports.listOfFiles = function(req,res){
db.Account.find({where: {id:1}}).then(function(newAcc){
console.log("encontrou a account");
getFiles('/', newAcc.accessToken, '0', newAcc, function(error){
if (error) {
log.error('Error getting files');
}else{
console.log("callback!")
}
});
});
}
getFiles function: this function is responsible for fetching the file list from the remote server, and store them in a postgres database
function getFiles(path, accessToken, parentID, newAcc, callback){
var client = new ExternalAPI.Client({
key: config.get("default:clientId"),
secret: config.get("default:clientSecret")
});
client._oauth._token = accessToken;
var options = {
removed : false,
deleted : false,
readDir: true
}
//this is the instruction that fetch an array of items
//(metadata only) from a remote server
client.stat(path, options, function(error, entries) {
if (error) {
if (error.status == 429) {
console.log(accessToken + 'timeout')
setTimeout(
getFiles(path, accessToken, parentID, callback),
60000);
}else{
log.error(error);
callback(error,null);
}
}
else {
//When the array os items arrives:
console.log("RECEIVED FILES")
var inserted = 0;
var items = entries._json.contents;
for(var file in items){
var skyItemID = uuid.v1();
var name = items[file].path.split('/').pop();
var itemType;
if (items[file].is_dir) {
itemType = 'folder';
}else{
itemType = 'file';
}
newAcc.createItem({
name : name,
lastModified: items[file].modified,
skyItemID: skyItemID,
parentID: parentID,
itemSize: items[file].bytes,
itemType : itemType,
readOnly: items[file].read_only,
mimeType: items[file].mime_type
}).then(function(item){
console.log(item.name)
if (++inserted == items.length) {
console.log(inserted)
console.log(items.length)
console.log("callsback")
callback();
}
}).catch(function(error){
log.error('[DROPBOX] - Filename with special characters');
callback(new Error);
return;
});
}
}
});
}
The problem here is, the moment that webserver prints console.log("RECEIVED FILES") in our console, it stops responding to all other requests, such as log in or fetch files requests from other users.
And it starts responding again when it prints console.log("callback!"). So, i'm assuming that somehow nodeJS is blocking itself until getFiles function is finished and called back.
I think that this is not a normal behaviour. Shouldn't nodeJS be responding to responds to other requests even if there are some operations running in background? Shouldn't getFiles function being run in background and not affecting/blocking all other requests? What am I doing wrong?
Thanks!

I am facing the same kind of problem for long time server http request blocks the service for response other client requests. This is my topic. What is the correct behavior for Node.js while Node.js is requesting a long time http request for other servers Currently, I got no answer for that. If you got the answer, please reply me. Thanks.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Requesting HTTP HEAD of thousands of URLs via NodeJS - node.js

Related

How to Store an respone into a variable nodejs request module

empty response body in HTTP response in github actions

http request takes too much time only when it is actually deployed on server

Nodejs: Async request with a list of URL

nodeJS blocking all requests until it calls back

Categories

Resources