a media spider in node js - node.js

I'm working on a project named robot hosting on GitHub. The job of my project is to fetch medias from the url which is given from the xml config file.And the xml config file has the defined format just as you can see in scripts dir.
My problem is as below.There are two args:
A list which indicates how deep the web link is, and according to the selector(css selector) in the list item, i can find out the media url or the sub page url where i may finally find out the media.
An arr which contains the sub page urls.
The simplified example as below:
node_list = {..., next = {..., next= null}};
url_arr = [urls];
I want to iterate all the items in the url arr, so i do as below:
function fetch(url, node) {
if(node == null)
return ;
// here do something with http request
var req = http.get('www.google.com', function(res){
var data = '';
res.on('data', function(chunk) {
data += chunk;
}.on('end', function() {
// maybe here generate more new urls
// get another url_list
node = node.next;
fetch(url_new, node);
}
}
// here need to be run in sync
for (url in url_arr) {
fetch(url, node)
}
As you can see, if use async http request, it must eats all system resources. And i can not control the process.
So do anyone have a good idea to solve this problem?
Or, is nodejs not the proper way to do such jobs?

If the problem is that you get too many HTTP requests simultaneously, you could change the fetch function to operate on a stack of URLs.
Basically you would do this:
When fetch is called, insert the URL into the stack and check if a request is in progress:
If a request is not running, pick the first url from stack and process it, otherwise do nothing
When a http request is finished, have it take a new url from the stack and process that
This way you can have the for-loop add all the URLs like now, but only one URL is processed at a time so there won't be too much resources being used.

Related

How to modify a response in Electron

Let's say that I'm using a GET request on https://example.com and the response is this:
This is a response message.
How would I modify it in a way so that in my code, so that it can change the response to say something like this:
This is a MODIFIED response message.
For example, if my Electron app were to navigate to https://example.com, the screen would show me the modified content instead of the original content.
Essentially, I am trying to literally modify the request.
I have based my code off of this question but it only shows a proof of concept with a pre-typed Buffer, as in my situation I'd like modify the response instead of outright replacing it. So, my code looks like this:
protocol.interceptBufferProtocol("http", (req, CALLBACK) => {
if(req.url.includes("special url here")) {
var request = net.request({
method: req.method,
headers: req.headers,
url: req.url
});
request.on("response", (rp) => {
var d = [];
rp.on("data", c => d.push(c));
rp.on("end", () => {
var e = Buffer.concat(d);
console.log(e.toString());
// do SOMETHING with 'e', the response, then callback it.
CALLBACK(e);
});
});
request.end();
} else {
// Is supposedly going to carry out the request without interception
protocol.uninterceptProtocol("http");
}
}
This is supposed to manually request the URL, grab the response and return it. Without the protocol event, it works and gives me a response, but after some debugging, this piece of code consistently calls the same URL over and over with no response.
There is also the WebRequest API, but there is no way of modifying the response body, you can only modify the request headers & related content.
I haven't looked fully into Chromium-based solutions, but after looking at this, I'm not sure if it is possible to modify the response so it appears on my app's end in the first place. Additionally, I'm not familiar with the Chromium/Puppeteer messages that get sent all over the place.
Is there an elegant way to have Electron to get a URL response/request, call the URL using the headers/body/etc., then save & modify the response to appear different in Electron?

expressJS/multer multiple file upload, render for each file

I am using expressJS with multer, and want to create a website to upload multiple files.
So I already manage to get this working. Currently I am using XMLHttpRequest for POST request on the client side, and update elements on page also from the client side script. For 5 files selected to upload with one click on submit button, I can do 5 post request from the client-side and update feedback one by one.
// load file
formData.append(file1);
let req = new XMLHttpRequest();
req.onreadystatechange = function () {
if (req.readyState == XMLHttpRequest.DONE) {
updateView(); // Change layout in HTML when POST DONE
}
}
req.open("POST", "/upload");
req.send(formData);
Reapeat for multiple files [file1, file2, file3,...]
Question:
So now, I would like to use res.render() with parameters instead. I am wondering if it is possible to get one POST request and render the page multiple times. If I POST 5 files at once, I want to render every time one file is processed, and let the user on the client side see the feedback. I don't need a progress bar, I just want to show some basic informations and status of the file. Play a bit around with res.render(), but didn't find anything working as I wished.
So I can avoid adding any HTML code in my JavaScript. And just use the handlebar on the backend.
Front end:
formData.append(file1);
formData.append(file2);
formData.append(file3);
let req = new XMLHttpRequest();
req.open("POST", "/upload");
req.send(formData);
And for backend I want something like this:
router.post('/upload', upload.array('multi_files'), async function (req, res, next) {
const files = req.files;
for (const file of files) {
let result = processFile(file);
res.render('/', result);
}
});
But unfortunately I cannot res.render multiple times.

Chain of endpoints in Node and Express: how to prevent that some of them stops all the series?

In some page I have to get information from 8 different endpoints. 2 of them are outside of my application and sometimes they cause an delay at displaying data. The web browser waits until the data is processed. Once they're outside of my app I can't refactor them in order to make them fast, but I need to show the information that they provide. In addition, sometimes one of them returns nothing. If so, I use default data to show to the user. The waiting time takes time for the user experience perspective.
I'm using promises to call these endpoints. Below is part of the code snippet that I am using.
The code is working fine. The issue is the delay.
First. Here is the array that contains all the service that I need to process:
var requests = [{
// 0
url: urlLocalApi + '/endpointURL_1/',
headers: {
'headers': 'apitoken'
},
}, {
// 1
url: urlLocalApi + '/endpointURL_2/',
headers: {
'headers': 'apitoken'
},
];
The code of array is encapsulated in this method:
const requests = homePageFunctions.createRequest();
Now, it is how the data is processed. I am using both 'request-promise' and 'bluebird', and a personal logger to check it out if everything goes fine.
const Promise = require("bluebird");
const request = require('request-promise');
var viewsHelper = {
getPageData: function (requests) {
return Promise.map(requests, function (obj) {
return request(obj).then(function (body) {
AppLogger.log(`Endpoint parsed`, statusLogger.infodate);
return JSON.parse(body);
});
});
}
}
module.exports = viewsHelper;
How do I call this?
viewsHelper.getPageData(requests)
.then(results => {
var output = [];
for (var i = 0; i < results.length; i++) {
output.push(results[i]);
}
// render data
res.render('homepage/index', output);
AppLogger.log(`PageData is rendered`, statusLogger.infodate);
})
.catch(err => {
console.log(err);
});
};
Take a look that inside of each index item of "output" array, there is the output of each data of each endpoint.
The problem here is:
If any of the endpoint takes long, the entire chain slows even though
if they are already processed. The web page waits in a blank mode.
How to prevent this behavior?
That is an interesting question but I have questions in order to answer it effectively.
You have Node server and client (HTML/JS)
You have 8 end points 2 are slow because you don’t have control over them.
Does the client (page) aware of the 8 end points? I .e you make 8 calls everytime you reload the page?
OR
Does the page makes one request to your node JS and your nodeJS synchronously calls the 8 end points
If it is 1 then lazy loading will work easily for you since the page is making the requests.
If it is 2 lazy loading will work only at the server side however the client will be blocked because it doesn’t know (or care how you load your data. The page made one request and it is blocked waiting for that request..
Obviously each method has pros and cons ..
One way you can solve this is to asynchronously call those end points on node and cache them and when the page makes the 1 request you have the data ready ..
Again we know very little about the situation there are many ways to solve this
Hope this helps

Is there a way to limit the amount of data that I get from a response?

Hello I've got a small challenge to do where I have to display some data that I get from an api. The main page will display the first 20 results and clicking on a button will add 20 more results from the page.
The api call that I was given returns an array with around 1500 elements and the api doesn't have a parameter to limit the amount of elements in the array so my question is if I can limit it somehow with axios or should I just fetch all of these elements and display them?
This is the api: https://api.chucknorris.io/
there are two answers for your question
the short answer is :
On your side, there's nothing you can do until pagination is implemented on API side
the second answer is :
you can handle it using http module like this
http.request(opts, function(response) {
var request = this;
console.log("Content-length: ", response.headers['content-length']);
var str = '';
response.on('data', function (chunk) {
str += chunk;
if (str.length > 10000)
{
request.abort();
}
});
response.on('end', function() {
console.log('done', str.length);
...
});
}).end();
This will abort the request at around 10.000 bytes, since the data arrives in chunks of various sizes.
Since the API has no parameter to limit the amount of results you are responsible for modifying the response.
Since you're using Axios you could do this with a response interceptor so that the response is modified before reaching your application.
You may want to consider where the best place to do this is though. If you allow the full response to come back to your application and then store it somewhere, it may be easier to return the next page of 20 results at the user's request rather than repeatedly calling the API.

Node.js - Why does my HTTP GET Request return a 404 when I know the data is there # the URL I am using

I'm still new enough with Node that HTTP requests trip me up. I have checked all the answers to similar questions but none seem to address my issue.
I have been dealt a hand in the Wild of having to go after JSON files in an API. I then parse those JSON files to separate them out into rows that populate a SQL database. The API has one JSON file with an ID of 'keys.json' that looks like this:
{
"keys":["5sM5YLnnNMN_1540338527220.json","5sM5YLnnNMN_1540389571029.json","6tN6ZMooONO_1540389269289.json"]
}
Each array element in the keys property holds the value of one of the JSON data files in the API.
I am having problems getting either type of file returned to me, but I figure if I can learn what is wrong with the way I am trying to get 'keys.json', I can leverage that knowledge to get the individual JSON data files represented in the keys array.
I am using the npm modules 'request' and 'request-promise-native' as follows:
const request = require('request');
const rp = require('request-promise-native');
My URL is constructed with the following elements, as follows (I have used the ... to keep my client anonymous, but other than that it is a direct copy:
let baseURL = 'http://localhost:3000/Users/doug5solas/sandbox/.../server/.quizzes/'; // this is the development value only
let keysID = 'keys.json';
Clearly the localhost aspect will have to go away when we deploy but I am just testing now.
Here is my HTTP call:
let options = {
method: 'GET',
uri: baseURL + keysID,
headers: {
'User-Agent': 'Request-Promise'
},
json: true // Automatically parses the JSON string in the response
};
rp(options)
.then(function (res) {
jsonKeysList = res.keys;
console.log('Fetched', jsonKeysList);
})
.catch(function (err) {
// API call failed
let errMessage = err.options.uri + ' ' + err.statusCode + ' Not Found';
console.log(errMessage);
return errMessage;
});
Here is my console output:
http://localhost:3000/Users/doug5solas/sandbox/.../server/.quizzes/keys.json 404 Not Found
It is clear to me that the .catch() clause is being taken and not the .then() clause. But I do not know why that is because the data is there at that spot. I know it is because I placed it there manually.
Thanks to #Kevin B for the tip regarding serving of static files. I revamped the logic using express.static and served the file using that capability and everything worked as expected.

Resources