I am trying to build a scraper to download video streams and and save them in a private cloud instance using NightMareJs (http://www.nightmarejs.org/)
I have seen the documentation and it shows how to download simple files like this -
.evaluate(function ev(){
var el = document.querySelector("[href*='nrc_20141124.epub']");
var xhr = new XMLHttpRequest();
xhr.open("GET", el.href, false);
xhr.overrideMimeType("text/plain; charset=x-user-defined");
xhr.send();
return xhr.responseText;
}, function cb(data){
var fs = require("fs");
fs.writeFileSync("book.epub", data, "binary");
})
-- based on the SO post here -> Download a file using Nightmare
But I want to download video streams using NodeJs async streams api. Is there a way to open a stream from a remote url and pipe it to local / other remote writable stream using NodeJs inbuilt stream apis
You can check if the server sends the "Accept-Ranges" (14.5) and "Content-Length" (14.13) headers through a HEAD request to that file, then request smaller chunks of the file you're trying to download using the "Content-Range" (14.16) header and write each chunk to the target file (you can use appending mode in order to reduce management of the file stream).
Of course, this will be quite slow if you're requesting very small chunks sequentially. You could build a pool of requestors (e.g. 4) and only write the next correct chunk to the file (so the other requestors would not take on future chunks if they are already done downloading).
Related
Background: Stream Consumption
This is how you can consume and read a stream of data-bytes that is received in the client:
Get a Response object (example a fetch response.)
Retrieve the ReadableStream from the body field of it i.e response.body
As body is an instance of ReadableStream, we can do body.getReader()
And use the reader as reader.read()
That is simple enough, we can consume the stream of bytes coming from the server and we can consume it the same way in NodeJS and Browser through the Node Web Stream API.
Stream from path in NodeJS
To create a stream from URLPath in Node JS it is quite easy (you just pass the path to a stream.) You can also switch from NodeStream to WebStream with ease (Stream.toWebStream())
Problem: Stream from user upload action
Can we get a stream right from user file uploads in the browser ? Can you give a simple example ?
The point of it would be to process the files as the user is uploading them, not to use the Blob stream.
Is it possible analyze user data as it is being uploaded i.e is it possible for it to be a stream ?
I have a Nodejs server running with Hapi.
one of the job of the server is to send files to servicer API (the API only accept streams when I send buffer it return an error) on the user ask
All the files are stored in s3.
When I download them if I'm using promise(),
I get in the body buffer.
And I can get passthrough if I'm using createReadStream().
My problem is when I try to convert the buffer to stream and send it the API reject it, and the same when I use the createReadStream() result,
but when I use FS to save the file and then FS to read the API accept the stream and its work.
so I need help how can I create the same result without saving and reading the file.
edit:
here is my code I know it's the wrong way but it works I need a better way that will work
static async downloadFile(Bucket, Key) {
const result = await s3Client
.getObject({
Bucket,
Key
})
.promise();
fs.writeFileSync(`${Path.basename(Key)}`,result.Body);
const file = await fs.createReadStream(`${Path.basename(Key)}`);
return file;
}
If I understand it correctly, you want to get the object from the s3 bucket and stream to your HTTP response as the stream.
Instead of getting the data in the buffers and than figuring out the way to convert it to stream can be complicated and has its limitations, if you really want to leverage the power of streams then don't try to convert it to buffer and load the entire object to the memory, you can create a request that streams the returned data directly to a Node.js Stream object by calling the createReadStream method on the request.
Calling createReadStream returns the raw HTTP stream managed by the request. The raw data stream can then be piped into any Node.js Stream object.
This technique is useful for service calls that return raw data in their payload, such as calling getObject on an Amazon S3 service object to stream data directly into a file, as shown in this example.
//I Imagine you have something similar.
server.get ('/image', (req, res) => {
let s3 = new AWS.S3({apiVersion: '2006-03-01'});
let params = {Bucket: 'myBucket', Key: 'myImageFile.jpg'};
let readStream= s3.getObject(params).createReadStream();
// When the stream is done being read, end the response
readStream.on('close', () => {
res.end()
})
readStream.pipe(res);
});
When you stream data from a request using createReadStream, only the raw HTTP data is returned. The SDK does not post-process the data, this raw HTTP data can be directly returned.
Note:
Because Node.js is unable to rewind most streams, if the request initially succeeds, then retry logic is disabled for the rest of the response. In the event of a socket failure, while streaming, the SDK won't attempt to retry or send more data to the stream. Your application logic needs to identify such streaming failures and handle them.
Edits:
After the edits on the original question, I can see that s3 sends a PassThrough stream object which is different from a FileStream in Nodejs. So to get around the problem use the memory (If your files are not very big and or you have enough memory).
Use the package memfs, it will replace the native fs in your app
https://www.npmjs.com/package/memfs
Install the package by npm install memfs and require as follows:
const {fs} = require('memfs');
and your code will look like
static async downloadFile(Bucket, Key) {
const result = await s3
.getObject({
Bucket,
Key
})
.promise();
fs.writeFileSync(`/${Key}`,result.Body);
const file = await fs.createReadStream(`/${Key}`);
return file;
}
Note that the only change I have made in your functions is that I have changed the path ${Path.basename(Key)} to /${Key}, because now you don't need to know the path of your original filesystem we are storing files in memory. I have tested and this solution works
Introduction
Say that on the same local network we have two Node JS servers set up with Express: Server A for API and Server F for form.
Server A is an API server where it takes the request and saves it to MongoDB database (files are stored as Buffer and their details as other fields)
Server F serves up a form, handles the form post and sends the form's data to Server A.
What is the most efficient way to send files between two NodeJS servers where the receiving server is Express API? Where does the file size matter?
1. HTTP Way
If the files I'm sending are PDF files (that won't exceed 50mb) is it efficient to send the whole contents as a string over HTTP?
Algorithm is as follows:
Server F handles the file request using https://www.npmjs.com/package/multer and saves the file
then Server F reads this file and makes an HTTP request via https://github.com/request/request along with some details on the file
Server A receives this request and turns the file contents from string to Buffer and saves a record in MongoDB along with the file details.
In this algorithm, both Server A (when storing into MongoDB) and Server F (when it was sending it over to Server A) have read the file into the memory, and the request between the two servers was about the same size as the file. (Are 50Mb requests alright?)
However, one thing to consider is that -with this method- I would be using the ExpressJS style of API for the whole process and it would be consistent with the rest of the app where the /list, /details requests are also defined in the routes. I like consistency.
2. Socket.IO Way
In contrast to this algorithm, I've explored https://github.com/nkzawa/socket.io-stream way which broke away from the consistency of the HTTP API on Server A (as the handler for socket.io events are defined not in the routes but the file that has var server = http.createServer(app);).
Server F handles the form data as such in routes/some_route.js:
router.post('/', multer({dest: './uploads/'}).single('file'), function (req, res) {
var api_request = {};
api_request.name = req.body.name;
//add other fields to api_request ...
var has_file = req.hasOwnProperty('file');
var io = require('socket.io-client');
var transaction_sent = false;
var socket = io.connect('http://localhost:3000');
socket.on('connect', function () {
console.log("socket connected to 3000");
if (transaction_sent === false) {
var ss = require('socket.io-stream');
var stream = ss.createStream();
ss(socket).emit('transaction new', stream, api_request);
if (has_file) {
var fs = require('fs');
var filename = req.file.destination + req.file.filename;
console.log('sending with file: ', filename);
fs.createReadStream(filename).pipe(stream);
}
if (!has_file) {
console.log('sending without file.');
}
transaction_sent = true;
//get the response via socket
socket.on('transaction new sent', function (data) {
console.log('response from 3000:', data);
//there might be a better way to close socket. But this works.
socket.close();
console.log('Closed socket to 3000');
});
}
});
});
I said I'd be dealing with PDF files that are < 50Mb. However, if I use this program to send larger files in the future, is socket.io a better way to handle 1GB files as it's using stream?
This method does send the file and the details across but I'm new to this library and don't know if it should be used for this purpose or if there is a better way of utilizing it.
Final thoughts
What alternative methods should I explore?
Should I send the file over SCP and make an HTTP request with file details including where I've sent it- thus, separating the protocols of files and API requests?
Should I always use streams because they don't store the whole file into memory? (that's how they work, right?)
This https://github.com/liamks/Delivery.js ?
References:
File/Data transfer between two node.js servers this got me to try socket-stream way.
transfer files between two node.js servers over http for HTTP way
There are plenty of ways to achieve this , but not so much to do it right !
socket io and wesockets are efficient when you use them with a browser , but since you don't , there is no need for it.
The first method you can try is to use the builtin Net module of nodejs, basically it will make a tcp connection between the servers and pass the data.
you should also keep in mind that you need to send chunks of data not the entire file , the socket.write method of the net module seems to be a good fit for your case check it : https://nodejs.org/api/net.html
But depending on the size of your files and concurrency , memory consumption can be quite large.
if you are running linux on both servers you could even send the files at ground zero with a simple linux command called scp
nohup scp -rpC /var/www/httpdocs/* remote_user#remote_domain.com:/var/www/httpdocs &
You can even do this with windows to linux or the other way.
http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html
the client scp for windows is pscp.exe
Hope this helps !
I'm working on a Node.js application which load images from a web service, process thumbnail by using ImageMagick, and save the result back into this web service. The workflow is something like
[Web Service] ==> HTTP Response Stream ==> ImageMagick ==> HTTP Request Stream ==> [Web Service]
I'm using Node.js http module to retrieve the original image stream and save thumbnail stream, and using imagemagick-stream module to generate thumbnail without using any temporary file. Since I also need to send thumbnail length I utilized memorystream module to hold the thumbnail in memory. So the workflow changed
[Web Service] ==> HTTP Response Stream ==> ImageMagick ==> memory stream ==> HTTP Request Stream ==> [Web Service]
The code worked well in my laptop (Mac) and workstation (Windows), but it doesn't work on the Ubuntu server with a special PNG image. But this image worked well on Mac and windows, and it also worked if I were using ImageMagick command line, so this is not a bad image I think.
The phenomenon was, when processing this image, my application was continued waiting for receive the data from ImageMagick, which means it never fired 'end' or 'error' event.
When I tried on my Mac I can see the ImageMagick emitted 3 'readable' event which means it processed 3 chunks of data. But on Ubuntu it only emitted 2 'readable' event and then my code was stay there.
I'm not sure why this happen? Since the image can be processed on Mac, Windows, and can be processed on Ubuntu from command line, I don't think this is because of the image and ImageMagick. I strong doubt there's something wrong to process the stream but have no idea.
Some pseudo-code below
var im = require('imagemagick-stream');
var MemoryStream = require('memorystream');
// define ImageMagick parameters
var thumbnail = im().thumbnail(options.width + '>x' + options.height + '>').quality(options.quality);
// retrieve original image
http.request('my web service', function (res) {
// generate thumbnail
res.pipe(thumbnail);
// pipe the thumbnail into memory stream (paused) in order to retrieve its length
var ms = new MemoryStream();
ms.pause();
thumbnail.pipe(ms);
var length = ms._getQueueSize();
// save back
http.request('my web service', length, function (res) {
// ...
});
ms.pipe(req);
ms.resume();
});
I'm testing streaming by creating a basic node.js app code that basically streams a file to the response. Using code from here and here.
But If I make a request from http://127.0.0.1:8000/, then open another browser and request another file, the second file will not start to download until the first one is finished. In my example I created a 1GB file. dd if=/dev/zero of=file.dat bs=1G count=1
But if I request three more files while the first one is downloading, the three files will start downloading simultaneously once the first file has finished.
How can I change the code so that it will respond to each request as it's made and not have to wait for the current download to finish?
var http = require('http');
var fs = require('fs');
var i = 1;
http.createServer(function(req, res) {
console.log('starting #' + i++);
// This line opens the file as a readable stream
var readStream = fs.createReadStream('file.dat', { bufferSize: 64 * 1024 });
// This will wait until we know the readable stream is actually valid before piping
readStream.on('open', function () {
console.log('open');
// This just pipes the read stream to the response object (which goes to the client)
readStream.pipe(res);
});
// This catches any errors that happen while creating the readable stream (usually invalid names)
readStream.on('error', function(err) {
res.end(err);
});
}).listen(8000);
console.log('Server running at http://127.0.0.1:8000/');
Your code seems fine the way it is.
I checked it with node v0.10.3 by making a few requests in multiple term sessions:
$ wget http://127.0.0.1:8000
Two requests ran concurrently.
I get the same result when using two different browsers (i.e. Chrome & Safari).
Further, I can get concurrent downloads in Chrome by just changing the request url slightly, as in:
http://localhost:8000/foo
and
http://localhost:8000/bar
The behavior you describe seems to manifest when making multiple requests from the same browser for the same url.
This may be a browser limitation - it looks like the second request isn't even made until the first is completed or cancelled.
To answer your question, if you need multiple client downloads in a browser
Ensure that your server code is implemented such that file-to-url mapping is 1-to-many (i.e. using a wildcard).
Ensure your client code (i.e. javascript in the browser), uses a different url for each request.