How to query data that is in a DeltaLake table using Rust with Delta-rs. In my case, the data is in multiple parquet files.
Thank you
Could you please give me a small code that is working from your side?
You will need either Polars or Datafusion to do so.
Here is a naive approach in rust:
use deltalake::delta::open_table;
use polars::prelude::*;
#[tokio::main]
async fn main() {
let lf = read_delta_table("delta_test_5m").await;
println!("{:?}", lf.select([count()]).collect());
}
async fn read_delta_table(path: &str) -> LazyFrame {
let dt = open_table(path).await.unwrap();
let files = dt.get_files();
let mut df_collection: Vec<DataFrame> = vec![];
for file_path in files.into_iter() {
let full_path = format!("{}/{}", path, file_path.as_ref());
let mut file = std::fs::File::open(full_path).unwrap();
let df = ParquetReader::new(&mut file).finish().unwrap();
df_collection.push(df);
}
let empty_head = df_collection[0].clone().lazy().limit(0);
df_collection.into_iter().fold(empty_head, |acc, df| concat([acc, df.lazy()], false, false).unwrap())
}
This code first get the list of the parquet files to take into account for the most recent version of the delta table.
Then one Dataframe is created for each file.
Finally, these Dataframe are concatenated to get a final Dataframe.
Note that Polars offers this feature out of the box in Python:
import polars as pl
print(pl.read_delta("path_to_delta"))
I did not find how to read Delta directly through Polars in Rust but it should be added soon I guess.
Related
Hello I want to do simple computations on a polar Dataframe Columns but I have no clue of how it works and documentation is not helping. How can I compute the mean of A_col for example?
fn main() {
let df: DataFrame = example().unwrap();
let A_col:Series = df["A"];
}
I'm trying to write a Rust program which uses Polars to read a CSV. This particular CSV encodes an array of floats as a string.
In my program, I want to load the CSV into a DataFrame and then parse this column into an array of floats. In Python you might write code that looks like this:
df = pd.read_csv("...")
df["babbage_search"] = df.babbage_search.apply(eval).apply(np.array)
However in my Rust program it's not clear how to go about this. I could take an approach like this:
let mut df = CsvReader::from_path("...")?
.has_header(true)
.finish()?;
df.apply("babbage_search", parse_vector)?;
However the parse_vector function is not really clear. I might write something like this, but this won't compile:
fn parse_vector(series: &Series) -> Series {
series
.utf8()
.unwrap()
.into_iter()
.map(|opt_v| match opt_v {
Some(v) => {
let vec: Vec<f64> = serde_json::from_str(v).unwrap();
let series: Series = vec.iter().collect();
series.f64().unwrap().to_owned() as ChunkedArray<Float64Type>
}
None => ChunkedArray::<Float64Type>::default(),
})
.collect()
}
I'd appreciate any help in figuring this out.
I want to concatenate an unknown number of files side by side, but I'm making a bit of a mess. The operation would be roughly similar to unix paste. I thought I could iterate line by line over every file and write every element from every file to stdout, but it is proving harder than expected. Maybe there is a far better approach?
Every file looks like
name1 value1
name2 value2
name3 value3
name4 value4
I want to treat the first file special, because each row has an identifier (name as in the example above). The files are known to be sorted and of the same length, so I don't have to check anything while pasting the files together. For every file after the first file I don't have to write the name field again, and I can just take the value field. I haven't even started to bother with splitting those columns because I'm stuck iterating over all files simultaneously.
The code below doesn't compile, since
use of moved value: `iterfiles`rustcE0382
combine.rs(17, 14): `iterfiles` moved due to this method call, in previous iteration of loop
use std::fs::File;
use std::io::{BufRead, BufReader};
use std::path::PathBuf;
pub fn combine(calls: Vec<PathBuf>) {
let file1 = File::open(calls[0].clone()).unwrap();
let reader = BufReader::new(file1).lines();
let mut files = Vec::new();
for file in &calls[1..] {
files.push(BufReader::new(File::open(file).unwrap()).lines());
}
let iterfiles = files.iter();
for line in reader {
let mut line_out = Vec::new();
line_out.push(line.unwrap());
let rest_of_files: Vec<String> = iterfiles
.map(|file2| file2.next().unwrap().unwrap())
.collect();
}
}
You need to move it into the loop body and then use iter_mut instead of iter:
for line in reader {
let mut line_out = Vec::new();
line_out.push(line.unwrap());
let rest_of_files: Vec<String> = files.iter_mut()
.map(|file2| file2.next().unwrap().unwrap())
.collect();
}
By the way, you can construct files like this:
let mut files: Vec<_> = calls[1..].iter()
.map(|file| BufReader::new(File::open(file).unwrap()).lines())
.collect();
And you don't need the clone for file1:
let file1 = File::open(&calls[0]).unwrap();
I previously asked How can I include an arbitrary set of Protobuf-built files without knowing their names? - this is a follow up question based on the results of that.
I now have a file that I include that contains the different modules on their own line - i.e.:
mod foo;
mod bar;
These modules and their names can be totally random depending on what the user has put in the directory for the proto files.
I need to perform operations on those random modules. For instance, the first thing I would like to do is get all the messages that exist in those new modules and present them back as strings that I can push onto a vector.
So really a 2 part question:
Is there a way I can not know the names of the modules that I am now including in this file with include! and use the structures inside them (generically - now that I have them included).
After the above, how to get all the possible messages inside a protobuf generated .rs file/module. Each .rs file has a FileDescriptorProto() method, which looking on the Google protobuf documentation, looks similar to this: Google Protobuf FileDescriptor
What about if you include a single file that is generated by the build.rs script. This script can scan the given directory and generate the proper file.
I do have an example I can link to, but it includes solutions to Project Euler solutions, so I'm not sure how people feel about that.
Here is the build.rs that I use:
// Generate the problem list based on available modules.
use std::env;
use std::fs;
use std::io::prelude::*;
use std::fs::File;
use std::path::Path;
use regex::Regex;
extern crate regex;
fn main() {
let odir = env::var("OUT_DIR").unwrap();
let cwd = env::current_dir().unwrap().to_str().unwrap().to_owned();
let dst = Path::new(&odir);
let gen_name = dst.join("plist.rs");
let mut f = File::create(&gen_name).unwrap();
writeln!(&mut f, "// Auto-generated, do not edit.").unwrap();
writeln!(&mut f, "").unwrap();
writeln!(&mut f, "pub use super::Problem;").unwrap();
writeln!(&mut f, "").unwrap();
let problems = get_problems();
// Generate the inputs.
for &p in problems.iter() {
writeln!(&mut f, "#[path=\"{1}/src/pr{0:03}.rs\"] mod pr{0:03};", p, cwd).unwrap();
}
writeln!(&mut f, "").unwrap();
// Make the problem set.
writeln!(&mut f, "pub fn make() -> Vec<Box<Problem + 'static>> {{").unwrap();
writeln!(&mut f, " let mut probs = Vec::new();").unwrap();
for &p in problems.iter() {
writeln!(&mut f, " add_problem!(probs, pr{:03}::Solution);", p).unwrap();
}
writeln!(&mut f, " probs").unwrap();
writeln!(&mut f, "}}").unwrap();
drop(f);
}
// Get all of the problems, based on standard filenames of "src/prxxx.rs" where xxx is the problem
// number. Returns the result, sorted.
fn get_problems() -> Vec<u32> {
let mut result = vec![];
let re = Regex::new(r"^.*/pr(\d\d\d)\.rs$").unwrap();
for entry in fs::read_dir(&Path::new("src")).unwrap() {
let entry = entry.unwrap();
let p = entry.path();
let n = p.as_os_str().to_str();
let name = match n {
Some(n) => n,
None => continue,
};
match re.captures(name) {
None => continue,
Some(cap) => {
let num: u32 = cap.at(1).unwrap().parse().unwrap();
result.push(num);
},
}
}
result.sort();
result
}
Another source file under src then has the following:
include!(concat!(env!("OUT_DIR"), "/plist.rs"));
I have figured out a way to do this, based on #Shepmaster's suggestion in the comment on the original post:
Since Rust doesn't support reflection (at the time of this post), I had to expand my cargo build script to write code in the file that is being generated to have symbols that I would know would always be there.
I generated specific functions for each of the modules that I was including (since I had their module names at that point), and then generated "aggregate" functions that had generic names, that I could call back in my main code.
Editor's note: This code example is from a version of Rust prior to 1.0 and is not syntactically valid Rust 1.0 code. Updated versions of this code produce different errors, but the answers still contain valuable information.
I've implemented the following method to return me the words from a file in a 2 dimensional data structure:
fn read_terms() -> Vec<Vec<String>> {
let path = Path::new("terms.txt");
let mut file = BufferedReader::new(File::open(&path));
return file.lines().map(|x| x.unwrap().as_slice().words().map(|x| x.to_string()).collect()).collect();
}
Is this the right, idiomatic and efficient way in Rust? I'm wondering if collect() needs to be called so often and whether it's necessary to call to_string() here to allocate memory. Maybe the return type should be defined differently to be more idiomatic and efficient?
There is a shorter and more readable way of getting words from a text file.
use std::io::{BufRead, BufReader};
use std::fs::File;
let reader = BufReader::new(File::open("file.txt").expect("Cannot open file.txt"));
for line in reader.lines() {
for word in line.unwrap().split_whitespace() {
println!("word '{}'", word);
}
}
You could instead read the entire file as a single String and then build a structure of references that points to the words inside:
use std::io::{self, Read};
use std::fs::File;
fn filename_to_string(s: &str) -> io::Result<String> {
let mut file = File::open(s)?;
let mut s = String::new();
file.read_to_string(&mut s)?;
Ok(s)
}
fn words_by_line<'a>(s: &'a str) -> Vec<Vec<&'a str>> {
s.lines().map(|line| {
line.split_whitespace().collect()
}).collect()
}
fn example_use() {
let whole_file = filename_to_string("terms.txt").unwrap();
let wbyl = words_by_line(&whole_file);
println!("{:?}", wbyl)
}
This will read the file with less overhead because it can slurp it into a single buffer, whereas reading lines with BufReader implies a lot of copying and allocating, first into the buffer inside BufReader, and then into a newly allocated String for each line, and then into a newly allocated the String for each word. It will also use less memory, because the single large String and vectors of references are more compact than many individual Strings.
A drawback is that you can't directly return the structure of references, because it can't live past the stack frame the holds the single large String. In example_use above, we have to put the large String into a let in order to call words_by_line. It is possible to get around this with unsafe code and wrapping the String and references in a private struct, but that is much more complicated.