Splitting a Vec of strings into Vec<Vec<String>> - rust

I am attempting to relearn data-science in rust.
I have a Vec<String> that includes a delimiter "|" and a new line "!end".
What I'd like to end up with is Vec<Vec<String>> that can be put into a 2D ND array.
I have this python Code:
file = open('somefile.dat')
lst = []
for line in file:
lst += [line.split('|')]
df = pd.DataFrame(lst)
SAMV2FinalDataFrame = pd.DataFrame(lst,columns=column_names)
And i've recreated it here in rust:
fn lines_from_file(filename: impl AsRef<Path>) -> Vec<String> {
let file = File::open(filename).expect("no such file");
let buf = BufReader::new(file);
buf.lines()
.map(|l| l.expect("Could not parse line"))
.collect()
}
fn main() {
let lines = lines_from_file(".dat");
let mut new_arr = vec![];
//Here i get a lines immitable borrow
for line in lines{
new_arr.push([*line.split("!end")]);
}
// here i get expeected closure found str
let x = lines.split("!end");
let array = Array::from(lines)
what i have: ['1','1','1','end!','2','2','2','!end']
What i need: [['1','1','1'],['2','2','2']]
Edit: also why when i turbo fish does it make it disappear on Stack Overflow?

I think part of the issue you ran into was due how you worked with arrays. For example, Vec::push will only add a single element so you would want to use Vec::extend instead. I also ran into a few cases of empty strings due to splitting by "!end" would leave trailing '|' on the ends of substrings. The errors were quite strange, I am not completely sure where the closure came from.
let lines = vec!["1|1|1|!end|2|2|2|!end".to_string()];
let mut new_arr = Vec::new();
// Iterate over &lines so we don't consume lines and it can be used again later
for line in &lines {
new_arr.extend(line.split("!end")
// Remove trailing empty string
.filter(|x| !x.is_empty())
// Convert each &str into a Vec<String>
.map(|x| {
x.split('|')
// Remove empty strings from ends split (Ex split: "|2|2|2|")
.filter(|x| !x.is_empty())
// Convert &str into owned String
.map(|x| x.to_string())
// Turn iterator into Vec<String>
.collect::<Vec<_>>()
}));
}
println!("{:?}", new_arr);
I also came up with this other version which should handle your use case better. The earlier approach dropped all empty strings, while this one should preserve them while correctly handling the "!end".
use std::io::{self, BufRead, BufReader, Read, Cursor};
fn split_data<R: Read>(buffer: &mut R) -> io::Result<Vec<Vec<String>>> {
let mut sections = Vec::new();
let mut current_section = Vec::new();
for line in BufReader::new(buffer).lines() {
for item in line?.split('|') {
if item != "!end" {
current_section.push(item.to_string());
} else {
sections.push(current_section);
current_section = Vec::new();
}
}
}
Ok(sections)
}
In this example, I used Read for easier testing, but it will also work with a file.
let sample_input = b"1|1|1|!end|2|2|2|!end";
println!("{:?}", split_data(&mut Cursor::new(sample_input)));
// Output: Ok([["1", "1", "1"], ["2", "2", "2"]])
// You can also use a file instead
let mut file = File::new("somefile.dat");
let solution: Vec<Vec<String>> = split_data(&mut file).unwrap();
playground link

Related

What is the most efficient way to read the first line of a file separately to the rest of the file?

I am trying to figure out the best way to read the contents of a file. The problem is that I need to read the first line separately, because I need that to be parsed as a usize which I need for the dimension of a Array2 by ndarray.
I tried the following:
use ndarray::prelude::*;
use std::io:{BufRead,BufReader};
use std::fs;
fn read_inputfile(geom_filename: &str) -> (Vec<i32>, Array2<f64>, usize) {
//* Step 1: Read the coord data from input
println!("Inputfile: {geom_filename}");
let geom_file = fs::File::open(geom_filename).expect("Geometry file not found!");
let geom_file_reader = BufReader::new(geom_file);
let geom_file_lines: Vec<String> = geom_file_reader
.lines()
.map(|line| line.expect("Failed to read line!"))
.collect();
//* Read no of atoms first for array size
let no_atoms: usize = geom_file_lines[0].parse().unwrap();
let mut Z_vals: Vec<i32> = Vec::new();
let mut geom_matr: Array2<f64> = Array2::zeros((no_atoms, 3));
for (atom_idx, line) in geom_file_lines[1..].iter().enumerate() {
//* into_iter would do the same
let line_split: Vec<&str> = line.split_whitespace().collect();
Z_vals.push(line_split[0].parse().unwrap());
(0..3).for_each(|cart_coord| {
geom_matr[(atom_idx, cart_coord)] = line_split[cart_coord + 1].parse().unwrap();
});
}
(Z_vals, geom_matr, no_atoms)
}
Does this not kind of defeat the purpose of the BufReader? I am still relative new to Rust, so I might have misunderstood something, but I thought that one uses the BufReader so that the whole file does not need to be read into memory.
With the Vec<String> for geom_file_lines I am mostlike loading the whole file into memory again, right?
Does this not kind of defeat the purpose of the BufReader?
It very much does, yes. lines() gives you an iterator, so you can read them without loading all of them into memory at once. You force them all into memory, though, as you call collect().
Simply don't do that. Use the iterator as an iterator. Especially as you convert it back to an iterator later, via geom_file_lines[1..].iter().
Like this:
use ndarray::prelude::*;
use std::fs;
use std::io::{BufRead, BufReader};
pub fn read_inputfile(geom_filename: &str) -> (Vec<i32>, Array2<f64>, usize) {
//* Step 1: Read the coord data from input
println!("Inputfile: {geom_filename}");
let geom_file = fs::File::open(geom_filename).expect("Geometry file not found!");
let geom_file_reader = BufReader::new(geom_file);
let mut geom_file_lines = geom_file_reader
.lines()
.map(|line| line.expect("Failed to read line!"));
//* Read no of atoms first for array size
let no_atoms: usize = geom_file_lines.next().unwrap().parse().unwrap();
let mut z_vals: Vec<i32> = Vec::new();
let mut geom_matr: Array2<f64> = Array2::zeros((no_atoms, 3));
for (atom_idx, line) in geom_file_lines.enumerate() {
let line_split: Vec<&str> = line.split_whitespace().collect();
z_vals.push(line_split[0].parse().unwrap());
(0..3).for_each(|cart_coord| {
geom_matr[(atom_idx, cart_coord)] = line_split[cart_coord + 1].parse().unwrap();
});
}
(z_vals, geom_matr, no_atoms)
}
You can apply the same logic in your for loop:
for (atom_idx, line) in geom_file_lines.enumerate() {
let mut line_split = line.split_whitespace();
z_vals.push(line_split.next().unwrap().parse().unwrap());
(0..3).for_each(|cart_coord| {
geom_matr[(atom_idx, cart_coord)] = line_split.next().unwrap().parse().unwrap();
});
}

Read lines in pairs from stdin in rust

My input data is structured as follows:
label_1
value_1
label_2
value_2
...
And my end goal is to read that data into a HashMap
My current working approach is to put even and odd lines in two separate vectors and then read from both vectors to add to Hashmap.
use std::io;
use std::io::prelude::*;
use std::collections::HashMap;
fn main() {
let mut labels: Vec<String> = Vec::new();
let mut values: Vec<String> = Vec::new();
let stdin = io::stdin();
/* Read lines piped from stdin*/
for (i, line) in stdin.lock().lines().enumerate() {
if i % 2 == 0 {
/* store labels (even lines) in labels vector */
labels.push(line.unwrap());
} else {
/* Store values (odd lines) in values vector */
values.push(line.unwrap());
}
}
println!("number of labels: {}", labels.len());
println!("number of values: {}", values.len());
/* Zip labels and values into one iterator */
let double_iter = labels.iter().zip(values.iter());
/* insert (label: value) pairs into hashmap */
let mut records: HashMap<&String, &String> = HashMap::new();
for (label, value) in double_iter {
records.insert(label, value);
}
}
I would like ask how to achieve this result without going though an intermediary step with vectors ?
You can use .tuples() from the itertools crate:
use itertools::Itertools;
use std::io::{stdin, BufRead};
fn main() {
for (label, value) in stdin().lock().lines().tuples() {
println!("{}: {}", label.unwrap(), value.unwrap());
}
}
See also:
This answer on "Are there equivalents to slice::chunks/windows for iterators to loop over pairs, triplets etc?"
You can manually advance an iterator with .next()
use std::io;
use std::io::prelude::*;
use std::collections::HashMap;
fn main() {
let stdin = io::stdin();
let mut lines = stdin.lock().lines();
let mut records = HashMap::new();
while let Some(label) = lines.next() {
let value = lines.next().expect("No value for label");
records.insert(label.unwrap(), value.unwrap());
}
}
Playground
How about:
fn main() {
let lines = vec![1,2,3,4,5,6];
let mut records = std::collections::HashMap::new();
for i in (0..lines.len()).step_by(2) {
// make sure the `i+1` is existed
println!("{}{}", lines[i], lines[i + 1]);
records.insert(lines[i], lines[i + 1]);
}
}

Rust - Multiple Calls to Iterator Methods

I have this following rust code:
fn tokenize(line: &str) -> Vec<&str> {
let mut tokens = Vec::new();
let mut chars = line.char_indices();
for (i, c) in chars {
match c {
'"' => {
if let Some(pos) = chars.position(|(_, x)| x == '"') {
tokens.push(&line[i..=i+pos]);
} else {
// Not a complete string
}
}
// Other options...
}
}
tokens
}
I am trying to elegantly extract a string surrounded by double quotes from the line, but since chars.position takes a mutable reference and chars is moved into the for loop, I get a compilation error - "value borrowed after move". The compiler suggests borrowing chars in the for loop but this doesn't work because an immutable reference is not an iterator (and a mutable one would cause the original problem where I can't borrow mutably again for position).
I feel like there should be a simple solution to this.
Is there an idiomatic way to do this or do I need to regress to appending characters one by one?
Because a for loop will take ownership of chars (because it calls .into_iter() on it) you can instead manually iterate through chars using a while loop:
fn tokenize(line: &str) -> Vec<&str> {
let mut tokens = Vec::new();
let mut chars = line.char_indices();
while let Some((i, c)) = chars.next() {
match c {
'"' => {
if let Some(pos) = chars.position(|(_, x)| x == '"') {
tokens.push(&line[i..=i+pos]);
} else {
// Not a complete string
}
}
// Other options...
}
}
}
It works if you just desugar the for-loop:
fn tokenize(line: &str) -> Vec<&str> {
let mut tokens = Vec::new();
let mut chars = line.char_indices();
while let Some((i, c)) = chars.next() {
match c {
'"' => {
if let Some(pos) = chars.position(|(_, x)| x == '"') {
tokens.push(&line[i..=i+pos]);
} else {
// Not a complete string
}
},
_ => {},
}
}
tokens
}
The normal for-loop prevents additional modification of the iterator because this usually leads to surprising and hard-to-read code. Doing it as a while-loop has no such protection.
If all you want to do is find quoted strings, I would not, however, go with an iterator at all here.
fn tokenize(line: &str) -> Vec<&str> {
let mut tokens = Vec::new();
let mut line = line;
while let Some(pos) = line.find('"') {
line = &line[(pos+1)..];
if let Some(end) = line.find('"') {
tokens.push(&line[..end]);
line = &line[(end+1)..];
} else {
// Not a complete string
}
}
tokens
}

String concatenation in rust

I am trying to get a &str and &str to concatenate in a for loop withe intention of using the new combined string after a number of parts have been added to it. A general layout of the for loop can be seen below but I am having a lot of trouble combining strings due to numerous errors.
for line in reader.lines() {
let split_line = line.unwrap().split(",");
let mut edited_line = "";
for word in split_line {
if !word.contains("substring") {
let test_string = [edited_line, word].join(",");
edited_line = &test_string;
}
}
let _ = writeln!(outfile, "{}", edited_line).expect("Unable to write to file");
}
First error:
error[E0716]: temporary value dropped while borrowed
Comes when running the above.
Second error:
error[E0308]: mismatched types expected &str, found struct std::string::String
happens when you remove the & from test_string when it is assigned to edited_line
Note: format! and concat! macros both also give error 2.
It seems to be if I get error 2 and convert the std::string:String and convert it to &str I get the error stating the variables don't live long enough.
How am I supposed to go about building a string of many parts?
Note that Rust has two string types, String and &str (actually, there are more, but that's irrelevant here).
String is an owned string and can grow and shrink dynamically.
&str is a borrowed string and is immutable.
Calling [edited_line, word].join(",") creates a new String, which is allocated on the heap. edited_line = &test_string then borrows the String and implicitly converts it to a &str.
The problem is that its memory is freed as soon as the owner (test_string) goes out of scope, but the borrow lives longer than test_string. This is fundamentally impossible in Rust, since it would otherwise be a use-after-free bug.
The correct and most efficient way to do this is to create an empty String outside of the loop and only append to it in the loop:
let mut edited_line = String::new();
for word in split_line {
if !word.contains("substring") {
edited_line.push(',');
edited_line.push_str(word);
}
}
Note that the resulting string will start with a comma, which might not be desired. To avoid it, you can write
let mut edited_line = String::new();
for word in split_line {
if !word.contains("substring") {
if !edited_line.is_empty() {
edited_line.push(',');
}
edited_line.push_str(word);
}
}
This could be done more elegantly with the itertools crate, which provides a join method for iterators:
use itertools::Itertools;
let edited_line: String = line
.unwrap()
.split(",")
.filter(|word| !word.contains("substring"))
.join(",");
let mut edited_line = ""; makes edited_line a &str with a static lifetime.
To actually make edited_line a string, either append .to_owned(), or use String::new():
let mut edited_line = String::new();
// Or
let mut edited_line = "".to_owned();
See What are the differences between Rust's `String` and `str`? if you are unfamiliar with the differences.
Most importantly for your case, you can't extend a &str, but you can extend a String.
Once you switched edited_line to a String, using the method of setting edited_line to [edited_line, word].join(","); works:
for line in reader.lines() {
let split_line = line.unwrap().split(",");
let mut edited_line = String::new();
for word in split_line {
if !word.contains("substring") {
let test_string = [edited_line.as_str(), word].join(","); // Added .as_str() to edited_line
edited_line = test_string; // Removed the & here
}
}
let _ = writeln!(outfile, "{}", edited_line).expect("Unable to write to file");
}
Playground
However, this is both not very efficient, nor ergonomic. Also it has the (probably unintended) result of prepending each line with a ,.
Here is an alternative that uses only one String instance:
for line in reader.lines() {
let split_line = line.unwrap().split(",");
let mut edited_line = String::new();
for word in split_line {
if !word.contains("substring") {
edited_line.push(',');
edited_line.push_str(word);
}
}
let _ = writeln!(outfile, "{}", edited_line).expect("Unable to write to file");
}
This still prepends the , character before each line however. You can probably fix that by checking if edited_line is not empty before pushing the ,.
Playground
The third option is to change the for loop into an iterator:
for line in reader.lines() {
let edited_line = line.split(",")
.filter(|word| !word.contains("substring"))
.collect::<Vec<&str>>() // Collecting allows us to use the join function.
.join(",");
let _ = writeln!(outfile, "{}", edited_line).expect("Unable to write to file");
}
Playground
This way we can use the join function as intended, neatly eliminating the initial , at the start of each line.
PS: If you have trouble knowing what types each variable is, I suggest using an IDE like Intellij-rust, which shows type hints for each variable as you write them.

Replacing numbered placeholders with elements of a vector in Rust?

I have the following:
A Vec<&str>.
A &str that may contain $0, $1, etc. referencing the elements in the vector.
I want to get a version of my &str where all occurences of $i are replaced by the ith element of the vector. So if I have vec!["foo", "bar"] and $0$1, the result would be foobar.
My first naive approach was to iterate over i = 1..N and do a search and replace for every index. However, this is a quite ugly and inefficient solution. Also, it gives undesired outputs if any of the values in the vector contains the $ character.
Is there a better way to do this in Rust?
This solution is inspired (including copied test cases) by Shepmaster's, but simplifies things by using the replace_all method.
use regex::{Regex, Captures};
fn template_replace(template: &str, values: &[&str]) -> String {
let regex = Regex::new(r#"\$(\d+)"#).unwrap();
regex.replace_all(template, |captures: &Captures| {
values
.get(index(captures))
.unwrap_or(&"")
}).to_string()
}
fn index(captures: &Captures) -> usize {
captures.get(1)
.unwrap()
.as_str()
.parse()
.unwrap()
}
fn main() {
assert_eq!("ab", template_replace("$0$1", &["a", "b"]));
assert_eq!("$1b", template_replace("$0$1", &["$1", "b"]));
assert_eq!("moo", template_replace("moo", &[]));
assert_eq!("abc", template_replace("a$0b$0c", &[""]));
assert_eq!("abcde", template_replace("a$0c$1e", &["b", "d"]));
println!("It works!");
}
I would use a regex
use regex::Regex; // 1.1.0
fn example(s: &str, vals: &[&str]) -> String {
let r = Regex::new(r#"\$(\d+)"#).unwrap();
let mut start = 0;
let mut new = String::new();
for caps in r.captures_iter(s) {
let m = caps.get(0).expect("Regex group 0 missing");
let d = caps.get(1).expect("Regex group 1 missing");
let d: usize = d.as_str().parse().expect("Could not parse index");
// Copy non-placeholder
new.push_str(&s[start..m.start()]);
// Copy placeholder
new.push_str(&vals[d]);
start = m.end()
}
// Copy non-placeholder
new.push_str(&s[start..]);
new
}
fn main() {
assert_eq!("ab", example("$0$1", &["a", "b"]));
assert_eq!("$1b", example("$0$1", &["$1", "b"]));
assert_eq!("moo", example("moo", &[]));
assert_eq!("abc", example("a$0b$0c", &[""]));
}
See also:
Split a string keeping the separators

Resources