Inserting row data as column data using Hashmaps and structs

Inserting row data as column data using Hashmaps and structs - rust

I'm reading data from a message queue, deserializing it with Serde and storing it into structs which I want to store that timeseries data into polars. Reading this my understanding is that polars is built arround storing the data in columns so as I work around I tried the following.
On start I call:
async fn init_dataframe() -> HashMap<String, Equities> {
let mut btc = Equities::new("BTCUSD".to_string());
btc.dataframe = Equities::instantiate_data(&btc);
let dataframes = HashMap::from([("BTCUSD".to_string(), btc)]);
dataframes
}
pub fn instantiate_data(&self) -> DataFrame {
let columns = vec![
Series::new("broker_time", &self.data.broker_time),
Series::new("timeframe", &self.data.timeframe),
Series::new("open", &self.data.open),
Series::new("high", &self.data.high),
Series::new("low", &self.data.low),
Series::new("close", &self.data.close),
Series::new("tick_volume", &self.data.tick_volume),
Series::new("spread", &self.data.spread),
];
DataFrame::new(columns).unwrap()
}
Equities and Candles:
#[derive(Debug)]
pub struct Equities {
pub symbol: String,
pub data: Candles,
pub dataframe: DataFrame,
}
#[derive(Debug)]
pub struct Candles {
pub broker_time: Vec<String>,
pub timeframe: Vec<String>,
pub open: Vec<f32>,
high: Vec<f32>,
low: Vec<f32>,
close: Vec<f32>,
tick_volume: Vec<f32>,
spread: Vec<f64>,
}
As I retrieve data I find the symbol location in the hashmap, access the equities object and try appending the data to each vec series to be stored in polars:
match candle {
candleEnum::CandleJson(candle) => {
dataframes[candle.symbol()].data.broker_time.push(candle.broker_time().to_string());
}
candleEnum::TickJson(candle) => {
dataframes[candle.symbol()].data.broker_time.push(candle.broker_time().to_string());
}
}
Error: trait `IndexMut` is required to modify indexed content, but it is not implemented for
`std::collections::HashMap<std::string::String, handlers::equities::Equities>`
This doesn't work as the HashMap requires IndexMut so I looked into using the .entry() method but am unable to select the dataframe object from the Equities struct to access data to insert the streamed data. Is there a better way to do this? How can I access and modify data from the Entries struct? I'm still very new to using rust coming from a python background so any pointers are much appreciated.
Edit I was able to fix the mut HashMap error thanks to #cdhowie. Now as I try and print the dataframe, it still shows an empty dataframe despite data being pushed to the Candles vecs.
match candle {
candleEnum::CandleJson(candle) => {
dataframes.get_mut(candle.symbol()).unwrap().data.broker_time.push(candle.broker_time().to_string());
}
candleEnum::TickJson(candle) => {
dataframes.get_mut(candle.symbol()).unwrap().data.broker_time.push(candle.broker_time().to_string());
}
}

Related

Rust find in Vec without creating new iterator each time

I am processing a lot of data. I am working with the cargo dumps, in hopes of creating interconnected graph of all crates. I get data from postgres dump, and try to connect it. I'm trying to connect each version of package, with latest possible version of dependency. I have to go through each dependency, find possible versions, parse version and see if the version of the package and required version of the dependency matches.
I can say that first found semver match is always the right one, since I get data ordered DESC from db. Now, to my problem. I programmed this logic, however then I found out that execution of my code is going to take a long time. Around 3 hours on my hardware, and a lot more in production. This is not acceptable. I've pinpointed my problem, being that I always create a new iterator and then proceed to consume it with using find method. Is there any way to find value inside a vec, without consuming the iterator?
I have the following structs:
#[derive(Debug, Clone, sqlx::FromRow)]
pub struct CargoCrateVersionRow {
pub id: i32,
pub crate_id: i32,
pub version_num: String,
pub created_at: sqlx::types::chrono::NaiveDateTime,
pub updated_at: sqlx::types::chrono::NaiveDateTime,
pub downloads: i32,
pub features: sqlx::types::Json<HashMap<String, Vec<String>>>,
pub yanked: bool,
pub license: Option<String>,
pub crate_size: Option<i32>,
pub published_by: Option<i32>,
}
#[derive(Debug, sqlx::FromRow)]
pub struct CargoDependenciesRow {
pub version_id: i32,
pub crate_id: i32,
pub req: String,
pub optional: bool,
pub features: Vec<String>,
pub kind: i32,
}
#[derive(Debug)]
pub struct CargoCrateVersionDependencyEdge {
pub version_id_from: i32,
pub version_id_to: i32,
pub optional: bool,
pub with_features: Vec<String>,
pub kind: i32,
}
SQL command for retrieving crate_dependencies:
select version_id, crate_id, req, optional, features, kind
from dependencies;
SQL command for retrieving crate_versions:
select
id, crate_id, num "version_num", created_at, updated_at,
downloads, features, yanked, license, crate_size, published_by
from
versions
order by
id desc;
My custom logic:
let mut cargo_crate_dependency_edges: Vec<CargoCrateVersionDependencyEdge> = vec![];
let mut i = 0;
for dep in crate_dependencies {
if i % 10_000 == 0 {
println!("done 10k");
}
let req = skip_fail!(VersionReq::parse(dep.req.as_str()));
// I do not wish to create new iter each time, as this includes millions of structs
let possible_crate_version = crate_versions.iter().find(|s| {
if s.crate_id != dep.crate_id {
return false;
}
// Here, I'm using semver crate
if let Ok(parsed_version) = Version::parse(&s.version_num) {
req.matches(&parsed_version)
} else {
false
}
});
if let Some(found_crate_version) = possible_crate_version {
cargo_crate_dependency_edges.push(
CargoCrateVersionDependencyEdge {
version_id_from: dep.version_id,
version_id_to: found_crate_version.id,
optional: dep.optional,
with_features: dep.features.clone(),
kind: dep.kind,
}
);
}
i += 1;
}
println!("{}", cargo_crate_dependency_edges.len());
After some time debugging, I've come to the conclusion that I need to somehow get rid of making the iterator, because that's my bottleneck. I've benchmarked the library, pushing onto the vec, basically everything.

Why can I not get a specific field from a Result of a Vec of a structure?

I have a structure and a function call:
pub fn get_exchanges(database: &str) -> Result<Vec<Exchanges>> {
todo!()
}
pub struct Exchanges {
pub rowid: u64,
pub table_name: String,
pub profile_id: String,
pub name: String,
pub etl_type: i16,
pub connection_info: String,
pub update_flag: bool,
pub date_last_update: String,
pub gui_show: bool,
pub reports_flag: bool,
}
I want to pull out the name within the structure and do some work on it. I have an iterator but cannot figure out how to get to the specific item called name.
let exchanges_list = get_exchanges("crypto.db");
for name in exchanges_list {
println!("exchange {:?}", name);
//ui.horizontal(|ui| ui.label(exchange));
}
The result is
Exchanges { rowid: 4, table_name: "exchanges", profile_id: "None", name: "coinbasepro_noFlag", etl_type: 2, connection_info: "{\"name\":\"coinbase_pro\",\"base_url\":\"https://api.pro.coinbase.com\",\"key\":\"[redacted]\",\"secret\":\"[redacted]\",\"passphrase\":\"[redacted]\"}", update_flag: false, date_last_update: "2009-01-02 00:00:00 UTC", gui_show: true, reports_flag: true }

get_exchanges returns a Result, which represents either success or failure. This type implements IntoIterator and is therefore a valid value to be iterated in a for loop, but in this case that's just iterating over the single Vec value. If you want to iterate the Vec itself then you need to unwrap the Result. The standard way to do this is using the ? operator, which will propagate any error out of your function and unwrap the Result into the contained value:
let exchanges_list = get_exchanges("crypto.db")?;
// Add this operator ^
If your function does not return a compatible Result type, you can instead use get_exchanges("crypto.db").unwrap(), which will panic and crash the whole program at runtime if the Result represents a failure. This may be an acceptable alternative if your program doesn't have anything useful to do when encountering an error at this point.

First, you need to call .unwrap() since the value is a Result; that’ll let you use the Vec inside. Then, you can iterate with .iter() if you want to get the names:
let exch_lst = get_exchanges("crypto.db").unwrap();
for exch in exch_lst.iter() {
println(“exchange: {}”, exch.name.as_str());
}
There are other methods besides for + iter; this is just the one that seemed to fit best with what you already had.

How do I load SQLX records to Vec of structs in Rust

I have a table named instruments with the following fields:
id,
instrument_token (integer)
tradingsymbol (nullable string field)
I have defined a Rust struct as below
pub struct Instrument {
pub id: i64,
pub instrument_token: i32,
pub tradingsymbol: Option<String>,
}
I query and create a Vec<Instrument> inside a function as follows using SQLX
let records = sqlx::query!(r"select * from instruments").fetch_all(&app_context.db_connection).await?;
let mut all_instruments: Vec<Instrument> = Vec::new();
for rec in records {
all_instruments.push(Instrument {
id: rec.id,
instrument_token: rec.instrument_token,
tradingsymbol: rec.tradingsymbol,
});
}
Here &app_context.db_connection is &pool instance.
Is this there a better way to load records to a struct using SQLX.
If yes how?

If your record and data type have the same field names and types, you can use query_as! instead:
let records: Vec<Instrument> =
sqlx::query_as!(Instrument, r"select * from instruments")
.fetch_all(&app_context.db_connection)
.await?;

Rust lifetime scoping in structs

So, I'm working on porting a string tokenizer that I wrote in Python over to Rust, and I've run into an issue I can't seem to get past with lifetimes and structs.
So, the process is basically:
Get an array of files
Convert each file to a Vec<String> of tokens
User a Counter and Unicase to get counts of individual instances of tokens from each vec
Save that count in a struct, along with some other data
(Future) do some processing on the set of Structs to accumulate the total data along side the per-file data
struct Corpus<'a> {
words: Counter<UniCase<&'a String>>,
parts: Vec<CorpusPart<'a>>
}
pub struct CorpusPart<'a> {
percent_of_total: f32,
word_count: usize,
words: Counter<UniCase<&'a String>>
}
fn process_file(entry: &DirEntry) -> CorpusPart {
let mut contents = read_to_string(entry.path())
.expect("Could not load contents.");
let tokens = tokenize(&mut contents);
let counted_words = collect(&tokens);
CorpusPart {
percent_of_total: 0.0,
word_count: tokens.len(),
words: counted_words
}
}
pub fn tokenize(normalized: &mut String) -> Vec<String> {
// snip ...
}
pub fn collect(results: &Vec<String>) -> Counter<UniCase<&'_ String>> {
results.iter()
.map(|w| UniCase::new(w))
.collect::<Counter<_>>()
}
However, when I try to return CorpusPart it complains that it is trying to reference a local variable tokens. How can/should I deal with this? I tried adding lifetime annotations, but couldn't figure it out...
Essentially, I no longer need the Vec<String>, but I do need some of the Strings that were in it for the counter.
Any help is appreciated, thank you!

The issue here is that you are throwing away Vec<String>, but still referencing the elements inside it. If you no longer need Vec<String>, but still require some of the contents inside, you have to transfer the ownership to something else.
I assume you want Corpus and CorpusPart to both point to the same Strings, so you are not duplicating Strings needlessly. If that is the case, either Corpus or CorpusPart must own the String, so that the one that don't own the String references the Strings owned by the other. (Sounds more complicated that it actually is)
I will assume CorpusPart owns the String, and Corpus just points to those strings
use std::fs::DirEntry;
use std::fs::read_to_string;
pub struct UniCase<a> {
test: a
}
impl<a> UniCase<a> {
fn new(item: a) -> UniCase<a> {
UniCase {
test: item
}
}
}
type Counter<a> = Vec<a>;
struct Corpus<'a> {
words: Counter<UniCase<&'a String>>, // Will reference the strings in CorpusPart (I assume you implemented this elsewhere)
parts: Vec<CorpusPart>
}
pub struct CorpusPart {
percent_of_total: f32,
word_count: usize,
words: Counter<UniCase<String>> // Has ownership of the strings
}
fn process_file(entry: &DirEntry) -> CorpusPart {
let mut contents = read_to_string(entry.path())
.expect("Could not load contents.");
let tokens = tokenize(&mut contents);
let length = tokens.len(); // Cache the length, as tokens will no longer be valid once passed to collect
let counted_words = collect(tokens);
CorpusPart {
percent_of_total: 0.0,
word_count: length,
words: counted_words
}
}
pub fn tokenize(normalized: &mut String) -> Vec<String> {
Vec::new()
}
pub fn collect(results: Vec<String>) -> Counter<UniCase<String>> {
results.into_iter() // Use into_iter() to consume the Vec that is passed in, and take ownership of the internal items
.map(|w| UniCase::new(w))
.collect::<Counter<_>>()
}
I aliased Counter<a> to Vec<a>, as I don't know what Counter you are using.
Playground

Print all struct fields in Rust

I have around 10 structs with between 5-10 fields each and I want to be able to print them out using the same format.
Most of my structs look like this:
struct Example {
a: Option<String>,
b: Option<i64>,
c: Option<String>,
... etc
}
I would like to be able to define a impl for fmt::Display without having to enumerate the fields again so there is no chance for missing one if a new one is added.
For the struct:
let eg = Example{
a: Some("test".to_string),
b: Some(123),
c: None,
}
I would like the output format:
a: test
b: 123
c: -
I currently am using #[derive(Debug)] but I don't like that it prints out Some(X) and None and a few other things.
If I know that all the values inside my structs are Option<T: fmt::Display> can I generate myself a Display method without having to list the fields again?

This may not be the most minimal implementation, but you can derive serialisable and use the serde crate. Here's an example of a custom serialiser: https://serde.rs/impl-serializer.html
In your case it may be much simpler (you need only a handful of types and can panic/ignore on anything unexpected).
Another approach could be to write a macro and create your own lightweight serialisation solution.

I ended up solving this with a macro. While it is not ideal it does the job.
My macro currently looks like this:
macro_rules! MyDisplay {
($struct:ident {$( $field:ident:$type:ty ),*,}) => {
#[derive(Debug)]
pub struct $struct { pub $($field: $type),*}
impl fmt::Display for $struct {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
$(
write!(f, "{}: {}\n",
stringify!($field).to_string(),
match &self.$field {
None => "-".to_string(),
Some(x) => format!("{:#?}", x)
}
)?;
)*
Ok(())
}
}
};
}
Which can be used like this:
MyDisplay! {
Example {
a: Option<String>,
b: Option<i64>,
c: Option<String>,
}
}
Playground with an example:
https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=cc089f8aecaa04ce86f3f9e0307f8785
My macro is based on the one here https://stackoverflow.com/a/54177889/1355121 provided by Cerberus

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Inserting row data as column data using Hashmaps and structs - rust

Related

Rust find in Vec without creating new iterator each time

Why can I not get a specific field from a Result of a Vec of a structure?

How do I load SQLX records to Vec of structs in Rust

Rust lifetime scoping in structs

Print all struct fields in Rust

Categories

Resources