Using the below code I can have the size in bytes of my (stream) file:
let bytes_copied = tokio::io::copy(&mut stream, &mut file).await?;
Is there a way I can have the size of the (stream) file using rust-s3 methods (https://github.com/durch/rust-s3) ?
bucket.put_object_stream(&mut stream, filename).await?;
Or Is there a way to "count" bytes as they "pass" in this "stream tunnel"?
Related
I'm making requests for files using surf and I want to save the request body in a file. But these files are too large to store in memory so I need to be able to stream it to the file.
Surf looks like it supports this but I have not found a way to save the result in a file.
My attempt looks like this
let mut result = surf::get(&link)
.await
.map_err(|err| anyhow!(err))
.context(format!("Failed to fetch from {}", &link))?;
let body = surf::http::Body::from_reader(req, None);
let mut image_tempfile: File = tempfile::tempfile()?;
image_tempfile.write(body);
but this does not work as write() expects a &[u8] which I believe would require reading the whole body in to memory. Is there any way I can write the content of the surf Body to a file without holding it all in memory?
Surf's Response implements AsyncRead (a.k.a. async_std::io::Read) and if you convert your temporary file into something that implements AsyncWrite (like async_std::fs::File) then you can use async_std::io::copy to move bytes from one to the other asynchronously, without buffering the whole file:
let mut response = surf::get("https://www.google.com").await.unwrap();
let mut tempfile = async_std::fs::File::from(tempfile::tempfile().unwrap());
async_std::io::copy(&mut response, &mut tempfile).await.unwrap();
Description
I have a very large CSV file (around 1 GB) which I want to process in byte chunks of around 10 MB each. For this purpose, I am creating a Readable Stream with byte-range option fs.createReadStream(sampleCSVfile, { start: 0, end: 10000000 })
Problem
Using the above approach, the stream read from the CSV file contains data for the last line which is not entirely complete. I want a way to identify the byte index at which last line break occurred and start my next Readable Stream with that byte index.
Example CSV: (ignore header row)
John,New York,52
Stacy,Chicago,19
Lisa,Indianapolis,40
Sample Operation:
fs.createReadStream(sampleCSVfile, { start: 0, end: 99 })
Data Returned: (trimmed to above-specified byte-range)
John,New York,52
Stacy,Chicago,19
Lisa,I
Required or Expected:
John,New York,52
Stacy,Chicago,19
So, suppose from the stream fetched the last new line ended at byte-index 78, then my next recursive operation will be: fs.createReadStream(sampleCSVfile, { start: 79, end: 178 })
Below is basic code
const fs = require('fs');
let stream =fs.createReadStream('test.csv', { start:0, end:40})
stream.on('data', (data) => {
console.log(data.length); //
let a = data.toString()
console.log(a);
let i = a.lastIndexOf('\n');
console.log(i);
let substr= a.substring(0, i);
console.log(substr);
let byteLength= Buffer.byteLength(substr);
console.log(byteLength);
});
DEMO: https://repl.it/#sandeepp2016/SpiritedRowdyObject
But there are already a CSV parser like fast-csv or you can use
readLine module will allow you to read steam of data line by line more efficiently
I have a function that takes in a stream
function processStream(stream) {
}
Other things process this stream after the function, so it needs to be left intact. This function only needs the first 20 bytes of a stream that could be gigabytes long in order to complete its processing. I can get this via:
function processStream(stream) {
const data = stream.read(20)
return stream
}
But by consuming those 20 bytes we've changed the stream for future functions, so we have recombine it. What's the fastest way to do this?
In the end I went with combined-streams2 to combine my streams quickly and efficiently:
const CombinedStream = require('combined-stream2')
function async processStream(stream) {
const bytes = stream.read(2048)
const original = CombinedStream.create()
original.append(bytes)
if (!stream._readableState.ended) {
original.append(stream)
}
return original
}
All of the examples of stream creation I have encountered are centered around file. I am working with an interface that requires me to pipe a read stream to a write stream. My input is raw bytes I have in memory, not a file.
https://nodejs.org/api/fs.html#fs_fs_createreadstream_path_options
How to accomplish ^^^ by passing in 'raw bytes' instead of a file descriptor?
This is what I got working (from How to create streams from string in Node.Js?):
streamFromString(raw) {
const Readable = require('stream').Readable;
const s = new Readable();
s._read = function noop() {};
s.push(raw);
s.push(null);
return s;
}
I have a file in a binary format:
The format is as follows:
[4 - header bytes] [8 bytes - int64 - how many bytes to read following] [variable num of bytes (size of the int64) - read the actual information]
And then it repeats, so I must first read the first 12 bytes to determine how many more bytes I need to read.
I have tried:
var readStream = fs.createReadStream('/path/to/file.bin');
readStream.on('data', function(chunk) { ... })
The problem I have is that chunk always comes back in chunks of 65536 bytes at a time whereas I need to be more specific on the number of bytes that I am reading.
I have always tried readStream.on('readable', function() { readStream.read(4) })
But it is also not very flexible, because it seems to turn asynchronous code into synchronous code because, I have to put the 'reading' in a while loop
Or maybe readStream is not appropriate in this case and I should use this instead? fs.read(fd, buffer, offset, length, position, callback)
Here's what I'd recommend as an abstract handler of a readStream to process abstract data like you're describing:
var pending = new Buffer(9999999);
var cursor = 0;
stream.on('data', function(d) {
d.copy(pending, cursor);
cursor += d.length;
var test = attemptToParse(pending.slice(0, cursor));
while (test !== false) {
// test is a valid blob of data
processTheThing(test);
var rawSize = test.raw.length; // How many bytes of data did the blob actually take up?
pending.copy(pending.copy, 0, rawSize, cursor); // Copy the data after the valid blob to the beginning of the pending buffer
cursor -= rawSize;
test = attemptToParse(pending.slice(0, cursor)); // Is there more than one valid blob of data in this chunk? Keep processing if so
}
});
For your use-case, ensure the initialized size of the pending Buffer is large enough to hold the largest possible valid blob of data you'll be parsing (you mention an int64; that max size plus the header size) plus one extra 65536 bytes in case the blob boundary happens just on the edge of a stream chunk.
My method requires a attemptToParse() method that takes a buffer and tries to parse the data out of it. It should return false if the length of the buffer is too short (data hasn't come in enough yet). If it is a valid object, it should return some parsed object that has a way to show the raw bytes it took up (.raw property in my example). Then you do any processing you need to do with the blob (processTheThing()), trim out that valid blob of data, shift the pending Buffer to just be the remainder and keep going. That way, you don't have a constantly growing pending buffer, or some array of "finished" blobs. Maybe process on the receiving end of processTheThing() is keeping an array of the blobs in memory, maybe it's writing them to a database, but in this example, that's abstracted away so this code just deals with how to handle the stream data.
Add the chunk to a Buffer, and then parse the data from there. Being aware not to go beyond the end of the buffer (if your data is large). I'm using my tablet right now so can't add any example source code. Maybe somebody else can?
Ok, mini source, very skeletal.
var chunks = [];
var bytesRead= 0;
stream.on('data', function(chunk) {
chunks.push(chunk);
bytesRead += chunk.length;
// look at bytesRead...
var buffer = Buffer.concat(chunks);
chunks = [buffer]; // trick for next event
// --> or, if memory is an issue, remove completed data from the beginning of chunks
// work with the buffer here...
}