Rust appending unicode escape before creating new struct - rust

I am reading from a file and then creating a struct from those values.
let file = File::open(self.get_state_name().add(".sky")).unwrap();
let reader = BufReader::new(file);
for (_, line) in reader.lines().enumerate() {
let line = line.unwrap();
let key_value = line.split("`").collect::<Vec<&str>>();
let key = key_value[0].to_string();
let data = key_value[1].to_string();
self.set(key, data);
}
Set function creates a new struct named model
let model = Model::new(key, data);
New function just returns a struct named model:
pub fn new(key: String, data: String) -> Model {
Model { key, data }
}
Value of key is prefixed with unicode escapes like:
Model {
key: "\u{f}\u{0}\u{0}\u{0}\u{0}\u{0}\u{0}\u{0}Test",
data: "Test Data",
},
Update:
Tried saving only ascii characters:
pub fn new(key: String, data: String) -> Model {
let key = key.replace(|c: char| !c.is_ascii(), "");
println!("Key: {}", key);
Model { key, data }
}
Update:
Saving file
let mut file = File::create(name.add(".sky")).unwrap();
for i in &data {
let data = i.to_string();
let bytes = bincode::serialize(&data).unwrap();
file.write_all(&bytes).expect("Unable to write to file");
}
.to_string() methods on struct
pub(crate) fn to_string(&self) -> String {
format!("{}`{}\n", self.key, self.data)
}
Here is key is without unicode escapes. It happens during Model { key, data } line.
Same doesn't happen when directly setting value not reading from file.
How to remove this and why is this happening?

You are writing the code using bincode::serialize but read back the data not with bincode::deserialize but using BufReader.
In order to properly serialize the string in a binary fashion, the encoder adds additional information about the data it stores.
If you know that only strings compatible with BufReader#lines will be processed, you can also use String#as_bytes when writing it to a file. Note that this will cause problems for some inputs, notably newline characters and others.

Related

Rust with Datafusion - Trying to Write DataFrame to Json

*Repo w/ WIP code: https://github.com/jmelm93/rust-datafusion-csv-processing
Started programming with Rust 2 days ago, and have been trying to resolve this since ~3 hours into trying out Rust...
Any help would be appreciated.
My goal is to write a Dataframe from Datafusion to JSON (which will eventually be used to respond to HTTP requests in an API with the JSON string).
The DataFrame turns into an "datafusion::arrow::record_batch::RecordBatch" when you collect the data, and this data type is what I'm having trouble converting.
I've tried -
Using json::writer::record_batches_to_json_rows from Arrow, but it won't let me due to "struct datafusion::arrow::record_batch::RecordBatch and struct arrow::record_batch::RecordBatch have similar names, but are actually distinct types". Haven't been able to successfully convert the types to avoid this.
I tried during the Record Batch into a vec and pull out the headers and the values individually. I was able to get the headers out, but haven't had success with the values.
let mut header = Vec::new();
// let mut rows = Vec::new();
for record_batch in data_vec {
// get data
println!("record_batch.columns: : {:?}", record_batch.columns());
for col in record_batch.columns() {
for row in 0..col.len() {
// println!("Cow: {:?}", col);
// println!("Row: {:?}", row);
// let value = col.as_any().downcast_ref::<StringArray>().unwrap().value(row);
// rows.push(value);
}
}
// get headers
for field in record_batch.schema().fields() {
header.push(field.name().to_string());
}
};
Anyone know how to accomplish this?
The full script is below:
// datafusion examples: https://github.com/apache/arrow-datafusion/tree/master/datafusion-examples/examples
// datafusion docs: https://arrow.apache.org/datafusion/
use datafusion::prelude::*;
use datafusion::arrow::datatypes::{Schema};
use arrow::json;
// use serde::{ Deserialize };
use serde_json::to_string;
use std::sync::Arc;
use std::str;
use std::fs;
use std::ops::Deref;
type DFResult = Result<Arc<DataFrame>, datafusion::error::DataFusionError>;
struct FinalObject {
schema: Schema,
// columns: Vec<Column>,
num_rows: usize,
num_columns: usize,
}
// to allow debug logging for FinalObject
impl std::fmt::Debug for FinalObject {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
// write!(f, "FinalObject {{ schema: {:?}, columns: {:?}, num_rows: {:?}, num_columns: {:?} }}",
write!(f, "FinalObject {{ schema: {:?}, num_rows: {:?}, num_columns: {:?} }}",
// self.schema, self.columns, self.num_columns, self.num_rows)
self.schema, self.num_columns, self.num_rows)
}
}
fn create_or_delete_csv_file(path: String, content: Option<String>, operation: &str) {
match operation {
"create" => {
match content {
Some(c) => fs::write(path, c.as_bytes()).expect("Problem with writing file!"),
None => println!("The content is None, no file will be created"),
}
}
"delete" => {
// Delete the csv file
fs::remove_file(path).expect("Problem with deleting file!");
}
_ => println!("Invalid operation"),
}
}
async fn read_csv_file_with_inferred_schema(file_name_string: String) -> DFResult {
// create string csv data
let csv_data_string = "heading,value\nbasic,1\ncsv,2\nhere,3".to_string();
// Create a temporary file
create_or_delete_csv_file(file_name_string.clone(), Some(csv_data_string), "create");
// Create a session context
let ctx = SessionContext::new();
// Register a lazy DataFrame using the context
let df = ctx.read_csv(file_name_string.clone(), CsvReadOptions::default()).await.expect("An error occurred while reading the CSV string");
// return the dataframe
Ok(Arc::new(df))
}
#[tokio::main]
async fn main() {
let file_name_string = "temp_file.csv".to_string();
let arc_csv_df = read_csv_file_with_inferred_schema(file_name_string.clone()).await.expect("An error occurred while reading the CSV string (funct: read_csv_file_with_inferred_schema)");
// have to use ".clone()" each time I want to use this ref
let deref_df = arc_csv_df.deref();
// print to console
deref_df.clone().show().await.expect("An error occurred while showing the CSV DataFrame");
// collect to vec
let record_batches = deref_df.clone().collect().await.expect("An error occurred while collecting the CSV DataFrame");
// println!("Data: {:?}", data);
// record_batches == <Vec<RecordBatch>>. Convert to RecordBatch
let record_batch = record_batches[0].clone();
// let json_string = to_string(&record_batch).unwrap();
// let mut writer = datafusion::json::writer::RecordBatchJsonWriter::new(vec![]);
// writer.write(&record_batch).unwrap();
// let json_rows = writer.finish();
let json_rows = json::writer::record_batches_to_json_rows(&[record_batch]);
println!("JSON: {:?}", json_rows);
// get final values from recordbatch
// https://docs.rs/arrow/latest/arrow/record_batch/struct.RecordBatch.html
// https://users.rust-lang.org/t/how-to-use-recordbatch-in-arrow-when-using-datafusion/70057/2
// https://github.com/apache/arrow-rs/blob/6.5.0/arrow/src/util/pretty.rs
// let record_batches_vec = record_batches.to_vec();
let mut header = Vec::new();
// let mut rows = Vec::new();
for record_batch in data_vec {
// get data
println!("record_batch.columns: : {:?}", record_batch.columns());
for col in record_batch.columns() {
for row in 0..col.len() {
// println!("Cow: {:?}", col);
// println!("Row: {:?}", row);
// let value = col.as_any().downcast_ref::<StringArray>().unwrap().value(row);
// rows.push(value);
}
}
// get headers
for field in record_batch.schema().fields() {
header.push(field.name().to_string());
}
};
// println!("Header: {:?}", header);
// Delete temp csv
create_or_delete_csv_file(file_name_string.clone(), None, "delete");
}
I am not sure that Datafusion is the perfect place to convert CSV string into JSON string, however here is a working version of your code:
#[tokio::main]
async fn main() {
let file_name_string = "temp_file.csv".to_string();
let csv_data_string = "heading,value\nbasic,1\ncsv,2\nhere,3".to_string();
// Create a temporary file
create_or_delete_csv_file(file_name_string.clone(), Some(csv_data_string), "create");
// Create a session context
let ctx = SessionContext::new();
// Register the csv file
ctx.register_csv("t1", &file_name_string, CsvReadOptions::new().has_header(false))
.await.unwrap();
let df = ctx.sql("SELECT * FROM t1").await.unwrap();
// collect to vec
let record_batches = df.collect().await.unwrap();
// get json rows
let json_rows = datafusion::arrow::json::writer::record_batches_to_json_rows(&record_batches[..]).unwrap();
println!("JSON: {:?}", json_rows);
// Delete temp csv
create_or_delete_csv_file(file_name_string.clone(), None, "delete");
}
If you encounter arrow and datafusion struct conflicts, use datafusion::arrow instead of just the arrow library.

How to lazy load an iterator and store metadata for future iterations

I have a large CSV on disk that I need to iterate over. The CSV is too large to fit into memory, so I need to iterate over it lazily. The csv crate does a good job of this already. The issue I have is that I need to restart iteration without completing previous iteration passes, and I need to support random look ups of previously seen data without iterating over the file again.
My plan was to use the csv crate's iterator to progress through the file. Every time I reach a record, I was going to record metadata (record number, byte offset, length) so that I could fetch that record again. I could also use this metadata to restart iteration by reading directly from the file at the given byte offsets instead of creating a new csv iterator.
The issue I'm having is that I can't figure out how to get the lifecycles right. I'm new to Rust, and I honestly can't even begin to write down what I'd like to do with the appropriate lifecycles. Given the little experience I have with Rust, it's probably easier if I write what I'd like to do in pseudo-code:
struct IndexRecord {
byte_offset: u64,
length: u64,
}
struct CSVHandler {
csv_iterator: StringRecordIterator, // from csv crate
iterator_done: bool,
index: Vec<IndexRecord>,
}
impl CSVHandler {
pub fun get_iterator(&mut self) -> Iterator<String> {
return IndexIterator{ handler: self, index: 0 }
}
}
struct IndexIterator {
handler: CSVHandler,
index: u64,
}
impl Iterator for IndexIterator {
type Item = String
fn next(&mut self) -> Self::Item {
if (self.handler.index.len() > self.index) {
let entry = self.handler.index.get(self.index);
let line = // read from disk using entry.byte_offset and entry.length
self.index += 1;
return line;
}
if (self.handler.iterator_done) {
return None
}
match self.handler.csv_iterator.next() {
None => {
self.handler.iterator_done = true
return None
},
Some(entry) => {
self.handler.index.push(IndexRecord { byte_offset: entry.byte_offset, length: entry.length });
self.index += 1;
return entry.text();
}
}
}
}
From what you describe you would like to convert a sequential interface of a Reader/Iterator into a random-access interface, and because the whole data doesn't fit into memory, can't load it all, and the idea is to create an index over that data, but still keep the data on disk and load it on demand.
The first part, index creation, could be:
let reader = ...;
let it = reader.records();
let positions: Vec<csv::Position> = it.map(|record| {
record.position().unwrap().clone(); // TODO: check errors
}).collect();
It extracts csv::Position instead of your custom IndexRecord, because it will be needed for seek() in order to reuse the library for reading from disk.
Having this index you could implement your random access reader:
struct RandomAccessCSVReader {
reader: csv::Reader,
positions: Vec<csv::Position>,
}
impl RandomAccessCSVReader {
pub fn new() -> Self {
let mut reader = ...;
let positions = Self::read_positions(&mut reader);
Self { reader, positions }
}
fn read_positions(&mut csv::Reader) -> Vec<csv::Position> {
reader.records().map(|record| {
record.position().unwrap().clone(); // TODO: check errors
}).collect()
}
pub fn read_at(&mut self, i: usize) -> Result<StringRecord> {
let pos = self.positions[i]; // TODO: check for bounds
// seek into the file
self.reader.seek(pos)?;
// read from that position
let mut record = StringRecord::new();
self.reader.read_record(&mut record)?;
Ok(record)
}
}
This is a draft, please do the proper error checking.

How to manage the ownership of a file held within a struct in Rust?

Is there a good way to handle the ownership of a file held within a struct using Rust? As a stripped down example, consider:
// Buffered file IO
use std::io::{BufReader,BufRead};
use std::fs::File;
// Structure that contains a file
#[derive(Debug)]
struct Foo {
file : BufReader <File>,
data : Vec <f64>,
}
// Reads the file and strips the header
fn init_foo(fname : &str) -> Foo {
// Open the file
let mut file = BufReader::new(File::open(fname).unwrap());
// Dump the header
let mut header = String::new();
let _ = file.read_line(&mut header);
// Return our foo
Foo { file : file, data : Vec::new() }
}
// Read the remaining foo data and process it
fn read_foo(mut foo : Foo) -> Foo {
// Strip one more line
let mut header_alt = String::new();
let _ = foo.file.read_line(&mut header_alt);
// Read in the rest of the file line by line
let mut data = Vec::new();
for (lineno,line) in foo.file.lines().enumerate() {
// Strip the error
let line = line.unwrap();
// Print some diagnostic information
println!("Line {}: val {}",lineno,line);
// Save the element
data.push(line.parse::<f64>().unwrap());
}
// Export foo
Foo { data : data, ..foo}
}
fn main() {
// Initialize our foo
let foo = init_foo("foo.txt");
// Read in our data
let foo = read_foo(foo);
// Print out some debugging info
println!("{:?}",foo);
}
This currently gives the compilation error:
error[E0382]: use of moved value: `foo.file`
--> src/main.rs:48:5
|
35 | for (lineno,line) in foo.file.lines().enumerate() {
| -------- value moved here
...
48 | Foo { data : data, ..foo}
| ^^^^^^^^^^^^^^^^^^^^^^^^^ value used here after move
|
= note: move occurs because `foo.file` has type `std::io::BufReader<std::fs::File>`, which does not implement the `Copy` trait
error: aborting due to previous error
For more information about this error, try `rustc --explain E0382`.
error: Could not compile `rust_file_struct`.
To learn more, run the command again with --verbose.
And, to be sure, this makes sense. Here, lines() takes ownership of the buffered file, so we can't use the value in the return. What's confusing me is a better way to handle this situation. Certainly, after the for loop, the file is consumed, so it really can't be used. To better denote this, we could represent file as Option <BufReader <File>>. However, this causes some grief because the second read_line call, inside of read_foo, needs a mutable reference to file and I'm not sure how to obtain one it it's wrapped inside of an Option. Is there a good way of handling the ownership?
To be clear, this is a stripped down example. In the actual use case, there are several files as well as other data. I've things structured in this way because it represents a configuration that comes from the command line options. Some of the options are files, some are flags. In either case, I'd like to do some processing, but not all, of the files early in order to throw the appropriate errors.
I think you're on track with using the Option within the Foo struct. Assuming the struct becomes:
struct Foo {
file : Option<BufReader <File>>,
data : Vec <f64>,
}
The following code is a possible solution:
// Reads the file and strips the header
fn init_foo(fname : &str) -> Foo {
// Open the file
let mut file = BufReader::new(File::open(fname).unwrap());
// Dump the header
let mut header = String::new();
let _ = file.read_line(&mut header);
// Return our foo
Foo { file : Some(file), data : Vec::new() }
}
// Read the remaining foo data and process it
fn read_foo(foo : Foo) -> Option<Foo> {
let mut file = foo.file?;
// Strip one more line
let mut header_alt = String::new();
let _ = file.read_line(&mut header_alt);
// Read in the rest of the file line by line
let mut data = Vec::new();
for (lineno,line) in file.lines().enumerate() {
// Strip the error
let line = line.unwrap();
// Print some diagnostic information
println!("Line {}: val {}",lineno,line);
// Save the element
data.push(line.parse::<f64>().unwrap());
}
// Export foo
Some(Foo { data : data, file: None})
}
Note in this case that read_foo returns an optional Foo due to the fact that the file could be None.
On a side note, IMO, unless you absolutely need the BufReader to be travelling along with the Foo, I would discard it. As you've already found, calling lines causes a move, which makes it difficult to retain within another struct. As a suggestion, you could make the file field simply a String so that you could always derive the BufReader and read the file when needed.
For example, here's a solution where a file name (i.e. a &str) can be turned into a Foo with all the line processing done just before the construction of the struct.
// Buffered file IO
use std::io::{BufReader,BufRead};
use std::fs::File;
// Structure that contains a file
#[derive(Debug)]
struct Foo {
file : String,
data : Vec <f64>,
}
trait IntoFoo {
fn into_foo(self) -> Foo;
}
impl IntoFoo for &str {
fn into_foo(self) -> Foo {
// Open the file
let mut file = BufReader::new(File::open(self).unwrap());
// Dump the header
let mut header = String::new();
let _ = file.read_line(&mut header);
// Strip one more line
let mut header_alt = String::new();
let _ = file.read_line(&mut header_alt);
// Read in the rest of the file line by line
let mut data = Vec::new();
for (lineno,line) in file.lines().enumerate() {
// Strip the error
let line = line.unwrap();
// Print some diagnostic information
println!("Line {}: val {}",lineno,line);
// Save the element
data.push(line.parse::<f64>().unwrap());
}
Foo { file: self.to_string(), data }
}
}
fn main() {
// Read in our data from the file
let foo = "foo.txt".into_foo();
// Print out some debugging info
println!("{:?}",foo);
}
In this case, there's no need to worry about the ownership of the BufReader because it's created, used, and discarded in the same function. Of course, I don't fully know your use case, so this may not be suitable for your implementation.

How do I use include_str! for multiple files or an entire directory?

I would like to copy an entire directory to a location in a user's $HOME. Individually copying files to that directory is straightforward:
let contents = include_str!("resources/profiles/default.json");
let fpath = dpath.join(&fname);
fs::write(fpath, contents).expect(&format!("failed to create profile: {}", n));
I haven't found a way to adapt this to multiple files:
for n in ["default"] {
let fname = format!("{}{}", n, ".json");
let x = format!("resources/profiles/{}", fname).as_str();
let contents = include_str!(x);
let fpath = dpath.join(&fname);
fs::write(fpath, contents).expect(&format!("failed to create profile: {}", n));
}
...the compiler complains that x must be a string literal.
As far as I know, there are two options:
Write a custom macro.
Replicate the first code for each file I want to copy.
What is the best way of doing this?
I would create a build script that iterates through a directory, building up an array of tuples containing the name and another macro call to include the raw data:
use std::{
env,
error::Error,
fs::{self, File},
io::Write,
path::Path,
};
const SOURCE_DIR: &str = "some/path/to/include";
fn main() -> Result<(), Box<dyn Error>> {
let out_dir = env::var("OUT_DIR")?;
let dest_path = Path::new(&out_dir).join("all_the_files.rs");
let mut all_the_files = File::create(&dest_path)?;
writeln!(&mut all_the_files, r##"["##,)?;
for f in fs::read_dir(SOURCE_DIR)? {
let f = f?;
if !f.file_type()?.is_file() {
continue;
}
writeln!(
&mut all_the_files,
r##"("{name}", include_bytes!(r#"{name}"#)),"##,
name = f.path().display(),
)?;
}
writeln!(&mut all_the_files, r##"]"##,)?;
Ok(())
}
This has some weaknesses, namely that it requires the path to be expressible as a &str. Since you were already using include_string!, I don't think that's an extra requirement. This also means that the generated string has to be a valid Rust string. We use raw strings inside the generated file, but this can still fail if a filename were to contain the string "#. A better solution would probably use str::escape_default.
Since we are including files, I used include_bytes! instead of include_str!, but if you really needed to you can switch back. The raw bytes skips performing UTF-8 validation at compile time, so it's a small win.
Using it involves importing the generated value:
const ALL_THE_FILES: &[(&str, &[u8])] = &include!(concat!(env!("OUT_DIR"), "/all_the_files.rs"));
fn main() {
for (name, data) in ALL_THE_FILES {
println!("File {} is {} bytes", name, data.len());
}
}
See also:
How can I locate resources for testing with Cargo?
You can use include_dir macro.
use include_dir::{include_dir, Dir};
use std::path::Path;
const PROJECT_DIR: Dir = include_dir!(".");
// of course, you can retrieve a file by its full path
let lib_rs = PROJECT_DIR.get_file("src/lib.rs").unwrap();
// you can also inspect the file's contents
let body = lib_rs.contents_utf8().unwrap();
assert!(body.contains("SOME_INTERESTING_STRING"));
Using a macro:
macro_rules! incl_profiles {
( $( $x:expr ),* ) => {
{
let mut profs = Vec::new();
$(
profs.push(($x, include_str!(concat!("resources/profiles/", $x, ".json"))));
)*
profs
}
};
}
...
let prof_tups: Vec<(&str, &str)> = incl_profiles!("default", "python");
for (prof_name, prof_str) in prof_tups {
let fname = format!("{}{}", prof_name, ".json");
let fpath = dpath.join(&fname);
fs::write(fpath, prof_str).expect(&format!("failed to create profile: {}", prof_name));
}
Note: This is not dynamic. The files ("default" and "python") are specified in the call to the macro.
Updated: Use Vec instead of HashMap.

error: `line` does not live long enough but it's ok in playground

I can't figure it out why my local var line does not live long enough. You can see bellow my code. It work on the Rust's playground.
I may have an idea of the issue: I use a structure (load is a function of this structure). As I want to store the result of the line in a member of my struct, it could be the issue. But I don't see what should I do to resolve this problem.
pub struct Config<'a> {
file: &'a str,
params: HashMap<&'a str, &'a str>
}
impl<'a> Config<'a> {
pub fn new(file: &str) -> Config {
Config { file: file, params: HashMap::new() }
}
pub fn load(&mut self) -> () {
let f = match fs::File::open(self.file) {
Ok(e) => e,
Err(e) => {
println!("Failed to load {}, {}", self.file, e);
return;
}
};
let mut reader = io::BufReader::new(f);
let mut buffer = String::new();
loop {
let result = reader.read_line(&mut buffer);
if result.is_ok() && result.ok().unwrap() > 0 {
let line: Vec<String> = buffer.split("=").map(String::from).collect();
let key = line[0].trim();
let value = line[1].trim();
self.params.insert(key, value);
}
buffer.clear();
}
}
...
}
And I get this error:
src/conf.rs:33:27: 33:31 error: `line` does not live long enough
src/conf.rs:33 let key = line[0].trim();
^~~~
src/conf.rs:16:34: 41:6 note: reference must be valid for the lifetime 'a as defined on the block at 16:33...
src/conf.rs:16 pub fn load(&mut self) -> () {
src/conf.rs:17 let f = match fs::File::open(self.file) {
src/conf.rs:18 Ok(e) => e,
src/conf.rs:19 Err(e) => {
src/conf.rs:20 println!("Failed to load {}, {}", self.file, e);
src/conf.rs:21 return;
...
src/conf.rs:31:87: 37:14 note: ...but borrowed value is only valid for the block suffix following statement 0 at 31:86
src/conf.rs:31 let line: Vec<String> = buffer.split("=").map(String::from).collect();
src/conf.rs:32
src/conf.rs:33 let key = line[0].trim();
src/conf.rs:34 let value = line[1].trim();
src/conf.rs:35
src/conf.rs:36 self.params.insert(key, value);
...
There are three steps in realizing why this does not work.
let line: Vec<String> = buffer.split("=").map(String::from).collect();
let key = line[0].trim();
let value = line[1].trim();
self.params.insert(key, value);
line is a Vec of Strings, meaning the vector owns the strings its containing. An effect of this is that when the vector is freed from memory, the elements, the strings, are also freed.
If we look at string::trim here, we see that it takes and returns a &str. In other words, the function does not allocate anything, or transfer ownership - the string it returns is simply a slice of the original string. So if we were to free the original string, the trimmed string would not have valid data.
The signature of HashMap::insert is fn insert(&mut self, k: K, v: V) -> Option<V>. The function moves both the key and the value, because these needs to be valid for as long as they may be in the hashmap. We would like to give the hashmap the two strings. However, both key and value are just references to strings which is owned by the vector - we are just borrowing them - so we can't give them away.
The solution is simple: copy the strings after they have been split.
let line: Vec<String> = buffer.split("=").map(String::from).collect();
let key = line[0].trim().to_string();
let value = line[1].trim().to_string();
self.params.insert(key, value);
This will allocate two new strings, and copy the trimmed slices into the new strings.
We could have moved the string out of the vector(ie. with Vec::remove), if we didn't trim the strings afterwards; I was unable to find a easy way of trimming a string without allocating a new one.
In addition, as malbarbo mentions, we can avoid the extra allocation that is done with map(String::from), and the creation of the vector with collect(), by simply omitting them.
In this case you have to use String instead of &str. See this to understand the difference.
You can also eliminate the creation of the intermediate vector and use the iterator return by split direct
pub struct Config<'a> {
file: &'a str,
params: HashMap<String, String>
}
...
let mut line = buffer.split("=");
let key = line.next().unwrap().trim().to_string();
let value = line.next().unwrap().trim().to_string();

Resources