Libssh2 stop reading large file - rust

I need to read some files via ssh, but when I loop on the file for reading... it stops.
In my Cargo.toml I have this dependency:
[dependencies]
ssh2 = "0.8"
It depends from libssh2.
Code is more complex, but I managed to reproduce the problem with this dirty code:
use std::fs::File;
use std::io::Read;
use std::io::Write;
#[allow(unused_imports)]
use std::io::Error;
use std::net::TcpStream;
use ssh2::Session;
use std::path::Path;
use std::io::prelude::*;
fn main() {
println!("Parto!");
let tcp = TcpStream::connect("myhost:22").unwrap();
let mut sess = Session::new().unwrap();
sess.set_tcp_stream(tcp);
sess.handshake().unwrap();
sess.userauth_password("user", "password").unwrap();
assert!(sess.authenticated());
let (mut remote_file, stat) = sess.scp_recv(Path::new("/home/pi/dbeaver.exe")).unwrap();
println!("remote file size: {}", stat.size());
let mut buf: Vec<u8> = vec![0;1000];
loop {
let read_bytes = remote_file.read(&mut buf).unwrap();
println!("Read bytes: {}.", read_bytes);
if read_bytes < 1000 {
break;
}
} //ending loop
}
I tested both on Windows 10 and Linux Debian. The problem is always there when the file is more then few KB this read:
let read_bytes = remote_file.read(&mut buf).unwrap();
reads less than buffer size (but the file is not ended).
Tested with both binary or asci files.
No errors, it just stop reading. Odd thing: it sometimes stop at 8 MB, sometimes at 16 KB. With the same file and host involved...
Where do I need to dig or what I need to check?

Quoting from the documentation for the io::Read trait (which ssh2 implements):
It is not an error if the returned value n is smaller than the buffer size, even when the reader is not at the end of the stream yet. This may happen for example because fewer bytes are actually available right now (e. g. being close to end-of-file) or because read() was interrupted by a signal.
Since you are reading remotely, this may simply mean that the remaining bytes are still in transit in the network and not yet available on the target computer.
If you are sure that there are enough incoming data to fill the buffer, you can use read_exact, but note that this will return an error if the available data is shorter than the buffer.
Edit: or even better, use read_to_end which doesn't require knowing the size beforehand (I don't know why I couldn't see it yesterday when I wrote the original answer).

Related

Detecting EOF without 0-byte read in Rust

I've been working on some code that reads data from a Read type (the input) in chunks and does some processing on each chunk. The issue is that the final chunk needs to be processed with a different function. As far as I can tell, there's a couple of ways to detect EOF from a Read, but none of them feel particularly ergonomic for this case. I'm looking for a more idiomatic solution.
My current approach is to maintain two buffers, so that the previous read result can be maintained if the next read reads zero bytes, which indicates EOF in this case, since the buffer is of non-zero length:
use std::io::{Read, Result};
const BUF_SIZE: usize = 0x1000;
fn process_stream<I: Read>(mut input: I) -> Result<()> {
// Stores a chunk of input to be processed
let mut buf = [0; BUF_SIZE];
let mut prev_buf = [0; BUF_SIZE];
let mut prev_read = input.read(&mut prev_buf)?;
loop {
let bytes_read = input.read(&mut buf)?;
if bytes_read == 0 {
break;
}
// Some function which processes the contents of a chunk
process_chunk(&prev_buf[..prev_read]);
prev_read = bytes_read;
prev_buf.copy_from_slice(&buf[..]);
}
// Some function used to process the final chunk differently from all other messages
process_final_chunk(&prev_buf[..prev_read]);
Ok(())
}
This strikes me as a very ugly way to do this, I shouldn't need to use two buffers here.
An alternative I can think of would be to impose Seek on input and use input.read_exact(). I could then check for an UnexpectedEof errorkind to determine that we've hit the end of input, and seek backwards to read the final chunk again (the seek & read again is necessary here because the contents of the buffer are undefined in the case of an UnexpectedEof error). But this doesn't seem idiomatic at all: Encountering an error, seeking back, and reading again just to detect we're at the end of a file is very strange.
My ideal solution would be something like this, using an imaginary input.feof() function that returns true if the last input.read() call reached EOF, like the feof syscall in C:
fn process_stream<I: Read>(mut input: I) -> Result<()> {
// Stores a chunk of input to be processed
let mut buf = [0; BUF_SIZE];
let mut bytes_read = 0;
loop {
bytes_read = input.read(&mut buf)?;
if input.feof() {
break;
}
process_chunk(&buf[..bytes_read]);
}
process_final_chunk(&buf[..bytes_read]);
Ok(())
}
Can anyone suggest a way to implement this that is more idiomatic? Thanks!
When read of std::io::Read returns Ok(n), not only does that mean that the buffer buf has been filled in with n bytes of data from this source., but it also indicates that the bytes after index n (inclusive) are left untouched. With this in mind, you actually don't need a prev_buf at all, because when n is 0, all bytes of the buffer would be left untoutched (leaving them to be those bytes of the previous read).
prog-fh's solution is what you want to go with for the kind of processing you want to do, because it will only hand off full chunks to process_chunk. With read potentially returning a value between 0 and BUF_SIZE, this is needed. For more info, see this part of the above link:
It is not an error if the returned value n is smaller than the buffer size, even when the reader is not at the end of the stream yet. This may happen for example because fewer bytes are actually available right now (e. g. being close to end-of-file) or because read() was interrupted by a signal.
However, I advise that you think about what should happen when you get a Ok(0) from read that does not represent end of file forever. See this part:
If n is 0, then it can indicate one of two scenarios:
This reader has reached its “end of file” and will likely no longer be able to produce bytes. Note that this does not mean that the reader will always no longer be able to produce bytes.
So if you were to get a sequence of reads that returned Ok(BUF_SIZE), Ok(BUF_SIZE), 0, Ok(BUF_SIZE) (which is entirely possible, it just represents a hitch in the IO), would you want to not consider the last Ok(BUF_SIZE) as a read chunk? If you treat Ok(0) as EOF forever, that may be a mistake here.
The only way to reliably determine what should be considered as the last chunk is to have the expected length (in bytes, not # of chunks) sent beforehand as part of the protocol. Given a variable expected_len, you could then determine the start index of the last chunk through expected_len - expected_len % BUF_SIZE, and the end index just being expected_len itself.
Since you consider read_exact() as a possible solution, then we can consider that a non-final chunk contains exactly BUF_SIZE bytes.
Then why not just read as much as we can to fill such a buffer and process it with a function, then, when it's absolutely not possible (because EOF is reached), process the incomplete last chunk with another function?
Note that feof() in C does not guess that EOF will be reached on the next read attempt; it just reports the EOF flag that could have been set during the previous read attempt.
Thus, for EOF to be set and feof() to report it, a read attempt returning 0 must have been encountered first (as in the example below).
use std::fs::File;
use std::io::{Read, Result};
const BUF_SIZE: usize = 0x1000;
fn process_chunk(bytes: &[u8]) {
println!("process_chunk {}", bytes.len());
}
fn process_final_chunk(bytes: &[u8]) {
println!("process_final_chunk {}", bytes.len());
}
fn process_stream<I: Read>(mut input: I) -> Result<()> {
// Stores a chunk of input to be processed
let mut buf = [0; BUF_SIZE];
loop {
let mut bytes_read = 0;
while bytes_read < BUF_SIZE {
let r = input.read(&mut buf[bytes_read..])?;
if r == 0 {
break;
}
bytes_read += r;
}
if bytes_read == BUF_SIZE {
process_chunk(&buf);
} else {
process_final_chunk(&buf[..bytes_read]);
break;
}
}
Ok(())
}
fn main() {
let file = File::open("data.bin").unwrap();
process_stream(file).unwrap();
}
/*
$ dd if=/dev/random of=data.bin bs=1024 count=10
$ cargo run
process_chunk 4096
process_chunk 4096
process_final_chunk 2048
$ dd if=/dev/random of=data.bin bs=1024 count=8
$ cargo run
process_chunk 4096
process_chunk 4096
process_final_chunk 0
*/

Why does rust's read_line function use a mutable reference instead of a return value?

Consider this code to read the user input in rust
use std::io;
fn main() {
let mut input = String::new();
io::stdin()
.read_line(&mut input)
.expect("error: unable to read user input");
println!("{}", input);
}
why is there no way to do it like this?
use std::io;
fn main() {
let mut input = io::stdin()
.read_line()
.expect("error: unable to read user input");
println!("{}", input);
}
it would be more convenient to other languages
TL;DR The closest you have is lines(), and the reason read_line works like it does is efficiency. The version that uses lines() looks like this:
use std::io::{self, BufRead};
fn main() {
// get next line from stdin and print it
let input = io::stdin().lock().lines().next().unwrap().expect("IO error");
println!("{}", input);
}
In general, read_line() is not designed for use in small interactive programs; there are better ways to implement those.
The read_line method comes from the generic io::BufRead trait, and its primary use is reading input, typically redirected from files or other programs, and possibly coming in large quantities. When processing large amounts of data, it is advantageous to minimize the number of allocations performed, which is why read_line is designed to reuse an existing string. A typical pattern would be:
let mut line = String::new();
while input.read_line(&mut line)? != 0 {
// do something with line
...
line.clear();
}
The number of (re-)allocations is kept minimal, as line will grow only as needed to accommodate the input lines. Once a typical size is reached, allocations will become very rare, and once the largest line is read, they will disappear altogether. If read_line() supported the "convenient" interface, then the above loop would indeed look nicer - for example:
while let Some(line) = read_new_line(some_input)? {
// process the line
...
}
...but would require a new allocation and deallocation for each line. In throw-away or learning programs this can be perfectly fine, but BufRead is intended as a building block for efficient IO, so its read_line method favors performance over convenience.

How can I write all of stdin to stdout without using lots of memory?

I want to write a Rust program that takes everything in stdin and copies it to stdout. So far I have this
fn main() {
let mut stdin: io::Stdin = io::stdin();
let mut stdout: io::Stdout = io::stdout();
let mut buffer: [u8; 1_000_000] = [0; 1_000_000];
let mut n_bytes_read: usize = 0;
let mut uninitialized: bool = true;
while uninitialized || n_bytes_read > 0
{
n_bytes_read = stdin.read(&mut buffer).expect("Could not read from STDIN.");
uninitialized = false;
}
}
I'm copying everything into a buffer of size one million so as not to blow up the memory if someone feeds my program a 3 gigabyte file. So now I want to copy this to stdout, but the only primitive write operation I can find is stdout.write(&mut buffer) - but this writes the whole buffer! I would need a way to write a specific number of bytes, like stdout.write_only(&mut buffer, n_bytes_read).
I'd like to do this in the most basic way possible, with a minimum of standard library imports.
If all you wanted to do was copy from stdin to stdout without using much memory, just use std::io::copy. It streams the data from a reader to a writer.
If your goal is to write part of a buffer, then take a slice of that buffer and pass that to write:
stdout.write(&buffer[0..n_bytes_read]);
A slice does not copy the data so you will not use any more memory.
Note however that write may not write everything you have asked - it returns the number of bytes actually written. If you use write_all it will write the whole slice.

Read an arbitrary number of bytes from type implementing Read

I have something that is Read; currently it's a File. I want to read a number of bytes from it that is only known at runtime (length prefix in a binary data structure).
So I tried this:
let mut vec = Vec::with_capacity(length);
let count = file.read(vec.as_mut_slice()).unwrap();
but count is zero because vec.as_mut_slice().len() is zero as well.
[0u8;length] of course doesn't work because the size must be known at compile time.
I wanted to do
let mut vec = Vec::with_capacity(length);
let count = file.take(length).read_to_end(vec).unwrap();
but take's receiver parameter is a T and I only have &mut T (and I'm not really sure why it's needed anyway).
I guess I can replace File with BufReader and dance around with fill_buf and consume which sounds complicated enough but I still wonder: Have I overlooked something?
Like the Iterator adaptors, the IO adaptors take self by value to be as efficient as possible. Also like the Iterator adaptors, a mutable reference to a Read is also a Read.
To solve your problem, you just need Read::by_ref:
use std::io::Read;
use std::fs::File;
fn main() {
let mut file = File::open("/etc/hosts").unwrap();
let length = 5;
let mut vec = Vec::with_capacity(length);
file.by_ref().take(length as u64).read_to_end(&mut vec).unwrap();
let mut the_rest = Vec::new();
file.read_to_end(&mut the_rest).unwrap();
}
1. Fill-this-vector version
Your first solution is close to work. You identified the problem but did not try to solve it! The problem is that whatever the capacity of the vector, it is still empty (vec.len() == 0). Instead, you could actually fill it with empty elements, such as:
let mut vec = vec![0u8; length];
The following full code works:
#![feature(convert)] // needed for `as_mut_slice()` as of 2015-07-19
use std::fs::File;
use std::io::Read;
fn main() {
let mut file = File::open("/usr/share/dict/words").unwrap();
let length: usize = 100;
let mut vec = vec![0u8; length];
let count = file.read(vec.as_mut_slice()).unwrap();
println!("read {} bytes.", count);
println!("vec = {:?}", vec);
}
Of course, you still have to check whether count == length, and read more data into the buffer if that's not the case.
2. Iterator version
Your second solution is better because you won't have to check how many bytes have been read, and you won't have to re-read in case count != length. You need to use the bytes() function on the Read trait (implemented by File). This transform the file into a stream (i.e an iterator). Because errors can still happen, you don't get an Iterator<Item=u8> but an Iterator<Item=Result<u8, R::Err>>. Hence you need to deal with failures explicitly within the iterator. We're going to use unwrap() here for simplicity:
use std::fs::File;
use std::io::Read;
fn main() {
let file = File::open("/usr/share/dict/words").unwrap();
let length: usize = 100;
let vec: Vec<u8> = file
.bytes()
.take(length)
.map(|r: Result<u8, _>| r.unwrap()) // or deal explicitly with failure!
.collect();
println!("vec = {:?}", vec);
}
You can always use a bit of unsafe to create a vector of uninitialized memory. It is perfectly safe to do with primitive types:
let mut v: Vec<u8> = Vec::with_capacity(length);
unsafe { v.set_len(length); }
let count = file.read(vec.as_mut_slice()).unwrap();
This way, vec.len() will be set to its capacity, and all bytes in it will be uninitialized (likely zeros, but possibly some garbage). This way you can avoid zeroing the memory, which is pretty safe for primitive types.
Note that read() method on Read is not guaranteed to fill the whole slice. It is possible for it to return with number of bytes less than the slice length. There are several RFCs on adding methods to fill this gap, for example, this one.

Convert image to bytes and then write to new file

I'm trying to take an image that is converted into a vector of bytes and write those bytes to a new file. The first part is working, and my code is compiling, but the new file that is created ends up empty (nothing is written to it). What am I missing?
Is there a cleaner way to convert Vec<u8> into &[u8] so that it can be written? The way I'm currently doing it seems kind of ridiculous...
use std::os;
use std::io::BufferedReader;
use std::io::File;
use std::io::BufferedWriter;
fn get_file_buffer(path_str: String) -> Vec<u8> {
let path = Path::new(path_str.as_bytes());
let file = File::open(&path);
let mut reader = BufferedReader::new(file);
match reader.read_to_end() {
Ok(x) => x,
Err(_) => vec![0],
}
}
fn main() {
let file = get_file_buffer(os::args()[1].clone());
let mut new_file = File::create(&Path::new("foo.png")).unwrap();
let mut writer = BufferedWriter::new(new_file);
writer.write(String::from_utf8(file).unwrap().as_bytes()).unwrap();
writer.flush().unwrap();
}
Given a Vec<T>, you can get a &[T] out of it in two ways:
Take a reference to a dereference of it, i.e. &*file; this works because Vec<T> implements Deref<[T]>, so *file is effectively of type [T] (though doing that without borrowing it, i.e. &*file, is not legal).
Call the as_slice() method.
As the BufWriter docs say, “the buffer will be written out when the writer is dropped”, so that writer.flush().unwrap() is not strictly necessary, serving only to make handling of errors explicit.
But as for the behaviour you describe, that I mostly do not observe. So long as you do not encounter any I/O errors, the version not using the String dance will work fine, while with the String dance it will panic if the input data is not legal UTF-8 (which if you’re dealing with images it probably won’t be). String::from_utf8 returns None in such cases, and so unwrapping that panics.

Resources