Diagonally Concatenate Polars DataFrame but lift Null of nested structs rather than nulling out child fields - rust-polars

I have a heterogeneous container that has a timestamp, and one of an inner type that I am serializing in to json in batches and loading into dataframes.
#[derive(serde::Serialize)]
pub struct Container {
pub timestamp: Option<Timestamp>,
pub sender_uid: u64,
#[serde(flatten)]
pub msg: Option<Msg>,
}
pub enum Msg {
ComponentOneStatus(ComponentOneStatus),
ComponentTwoStatus(ComponentTwoStatus),
// ... several more options
}
struct ComponentOneStatus {
fieldA: Vec<f32>
fieldB: SomeOtherSruct
/// ....many more fields
}
struct ComponentTwoStatus {
fieldC: u64
fieldD: YetAnotherStruct
/// ....many more fields
}
The JSON looks like:
[
{
timestamp: {seconds: 21121212, nanos: 1212121}
ComponentOneStatus: {fieldA: [0.3, -2.3, 3.3], fieldB: {nestedOther1: 4 ...}}
},
{
timestamp: {seconds: 434334, nanos: 1212}
ComponentTwoStatus: {fieldC: 9, fieldD: {differentNestedProp: 4 ...}}
}
]
I immediately run some analytics on that batch in a dataframe, then I hold a reference to it and diagonally concatenate it with previous batches and store those in a parquet file:
let combined = diag_concat_lf([df1.lazy(), df2.lazy(), df3.lazy()], true, true)?
.collect()?;
let now = SystemTime::now().duration_since(UNIX_EPOCH)?.as_millis();
let buf = File::create(format!("data/parquet/{now}.parquet"))?;
ParquetWriter::new(buf).finish(&mut combined)?;
Post-diagnoal concatenation, all of the records in the parquet file have all of the fields for all of the nested, for instance, a record that contained Component2Status would look like
{
timestamp: {...},
ComponentOneStatus: { fieldA: NULL, fieldB: NULL},
ComponentTwoStatus: {fieldC: 9, fieldD: {differentNestedProp: 4 ...}}
}
is it possible to lift the nulls so it is stored as ComponentOneStatus: NULL?
Additionally, I get a stack overflow if I try and diag_concat more than ~10 batches (1024 records each)
Should I instead concatenate the json buffers and re-deserialiaze?
Is there a way to skip the json step?
Note that the nested objects are very complex, protobuf messages, so I would prefer not manually writing a schema because they are externally moving targets.
Thanks!

Related

How to repeat all fields key and set the new value in $project stage in aggregate mongodb?

I have a schema model with so many fields like below example, I want to clone root field in suggest_foo to $project stage of aggregation without having to manually rewrite those fields and then set those fields with new value as my logic.
Example:
schema model:
const fooSchema = new Schema({
suggest_foo:
foo1: String,
foo2: String,
foo3: String,
foo4: String,
foo5: String,
}
...
})
seeds data:
{
suggest_foo: {
foo1: 'Foo1',
foo2: 'Foo2',
foo3: 'Foo3',
foo4: 'Foo4',
foo5: 'Foo5',
}
}
aggregate query code:
fooSchema.aggregate([
...
{
$project: {
// I want to clone root in suggest_foo (eg: foo1, foo2, foo(n)...) to be here.
}
}
])
My output result that I expected look like:
{
foo1: 'Foo1 maybe you like',
foo2: 'Foo2 maybe you like',
foo3: 'Foo3 maybe you like',
foo4: 'Foo4 maybe you like',
foo5: 'Foo5 maybe you like',
}
One option is to use $replaceRoot with $arrayToObject and $objectToArray as this will allow you to manipulate array in a loop:
db.collection.aggregate([
{$replaceRoot: {
newRoot: {
$arrayToObject: {
$map: {
input: {$objectToArray: "$suggest_foo"},
in: {
v: {$concat: ["$$this.v", " maybe you like"]},
k: "$$this.k"
}
}
}
}
}}
])
See how it works on the playground example

Is it possible to group up all documents returned from a query into a dictionary-like structure based on one of their fields? [duplicate]

I have a collection in my MongoDB:
{ userId: 1234, name: 'Mike' }
{ userId: 1235, name: 'John' }
...
I want to get a result of the form
dict[userId] = document
in other words, I want a result that is a dictionary where the userId is the key and the rest of the document is the value.
How can I do that?
You can use $arrayToObject to do that, you just need to format it to array of k, v before.
It is not clear if you want one dictionary for all documents, or each document in a dictionary format. I guess you want the first option, but I'm showing both:
One dictionary with all data*, requires a $group (which also format the data):
db.collection.aggregate([
{
$group: {
_id: null,
data: {$push: {k: {$toString: "$userId"}, v: "$$ROOT"}}
}
},
{
$project: {data: {$arrayToObject: "$data"}}
},
{
$replaceRoot: {newRoot: "$data"}
}
])
See how it works on the playground example - one dict
*Notice that in this option, all the data is inserted to one document, and document as a limit size.
Dictionary format: If you want to get all documents as different results, but with a dictionary format, just replace the first step of the aggregation with this:
{
$project: {
data: [{k: {$toString: "$userId"}, v: "$$ROOT"}],
_id: 0
}
},
See how it works on the playground example - dict per document

Getting the value from a column or row after a query in Mysql or Mysql_async

The examples for all Rust mysql driver assume that data is ending up into a known struct.
The following we see a query mapping into a Payment struct:
struct Payment {
customer_id: i32,
amount: i32,
account_name: Option<String>,
}
// Load payments from database. Type inference will work here.
let loaded_payments = conn.exec_map(
"SELECT customer_id, amount, account_name FROM payment",
(),
|(customer_id, amount, account_name)| Payment { customer_id, amount, account_name },
).await?;
The table schema and the order is needed.
What if the unbelievable happens and a schema is unknown. Or we issued SELECT * FROM payment where the order is not followed.
Im using mysql_async although it seems that mysql is a very similar api.
I managed to get to this point noting the use Row since the the type is unknown.
let results:Result<Vec<Row>> = conn.query("SELECT * FROM person LIMIT 1");
for row in vec.iter() {
println!("Row: {:?}", row);
// Prints the row and shows the columns with Bytes value
// such as Row { firstname: Bytes("qCEgkGSJ.."), lastname: Bytes("lYmsvbhT..") ... }
let columns = row.columns();
for index in 0..columns.len() { // The index is needed for row.get below
let col = &columns[index];
println!("Type: {:?}", col.column_type()); // MYSQL_TYPE_STRING works
println!("Name: {:?}", col.name_str()); // "firstname" correct
// Now the difficulty. Not sure if we get the value from row
// we can use the same index
// And it panics
// And we ask for String because of the MYSQL_TYPE_STRING
let v:std::option::Option<String> = row.get(index);
panicked at 'Could not retrieve alloc::string::String from Value'
Im unsure whether to get the value from row.get(index) and whether the index from column is valid for row.
I had the same problem with NULL values from DB.
Here is a workaround that might be helpful for someone. Using take_opt you take control over panics.
let mut t:String = "".to_string();
let mut b:String = "".to_string();
if let Some(title) = title {
match title {
Ok(title) => t = title,
Err(e) => {t = "".to_string()}
}
}
let body = row.take_opt("description");
if let Some(body) = body {
match body {
Ok(body) => b = body,
Err(e) => {b = "".to_string()}
}
}

How to group results in ArangoDb into single record?

I have the list of events of certain type, structured on the following example:
{
createdAt: 123123132,
type: STARTED,
metadata: {
emailAddress: "foo#bar.com"
}
}
The number of types is predefined (START, STOP, REMOVE...). Users produce one or more events during time.
I want to get the following aggregation:
For each user, calculate the number of events for each type.
My AQL query looks like this:
FOR event IN events
COLLECT
email = event.metadata.emailAddress,
type = event.type WITH COUNT INTO count
LIMIT 10
RETURN {
email,
t: {type, count}
}
This produces the following output:
{ email: '_84#example.com', t: { type: 'CREATE', count: 203 } }
{ email: '_84#example.com', t: { type: 'DEPLOY', count: 214 } }
{ email: '_84#example.com', t: { type: 'REMOVE', count: 172 } }
{ email: '_84#example.com', t: { type: 'START', count: 204 } }
{ email: '_84#example.com', t: { type: 'STOP', count: 187 } }
{ email: '_95#example.com', t: { type: 'CREATE', count: 189 } }
{ email: '_95#example.com', t: { type: 'DEPLOY', count: 173 } }
{ email: '_95#example.com', t: { type: 'REMOVE', count: 194 } }
{ email: '_95#example.com', t: { type: 'START', count: 213 } }
{ email: '_95#example.com', t: { type: 'STOP', count: 208 } }
...
i.e. I got a row for each type. But I want results like this:
{ email: foo#bar.com, count1: 203, count2: 214, count3: 172 ...}
{ email: aaa#fff.com, count1: 189, count2: 173, count3: 194 ...}
...
OR
{ email: foo#bar.com, CREATE: 203, DEPLOY: 214, ... }
...
i.e. to group again the results.
I also need to sort the results (not the events) by the counts: to return e.g. the top 10 users with max number of CREATE events.
How to do that?
ONE SOLUTION
One solution is here, check the accepted answer for more.
FOR a in (FOR event IN events
COLLECT
emailAddress = event.metadata.emailAddress,
type = event.type WITH COUNT INTO count
COLLECT email = emailAddress INTO perUser KEEP type, count
RETURN MERGE(PUSH(perUser[* RETURN {[LOWER(CURRENT.type)]: CURRENT.count}], {email})))
SORT a.create desc
LIMIT 10
RETURN a
You could group by user and event type, then group again by user keeping only the type and already calculated event type counts. In the second aggregation, it is important to know into which groups the events fall to construct the result. An array inline projection can be used for that to keep the query short:
FOR event IN events
COLLECT
emailAddress = event.metadata.emailAddress,
type = event.type WITH COUNT INTO count
COLLECT email = emailAddress INTO perUser KEEP type, count
RETURN MERGE(PUSH(perUser[* RETURN {[CURRENT.type]: CURRENT.count}], {email}))
Another way would be to group by user and keep event types, then group the types in a subquery. But it is significantly slower in my test (without any indexes defined at least):
FOR event IN events
LET type = event.type
COLLECT
email = event.metadata.emailAddress INTO groups KEEP type
LET byType = (
FOR t IN groups[*].type
COLLECT t2 = t WITH COUNT INTO count
RETURN {[t2]: count}
)
RETURN MERGE(PUSH(byType, {email}))
Returning the top 10 users with the most CREATE events is much simpler. Filter for CREATE event type, then group by user and count the number of events, sort by this number in descending order and return the first 10 results:
FOR event IN events
FILTER event.type == "CREATE"
COLLECT email = event.metadata.emailAddress WITH COUNT INTO count
SORT count DESC
LIMIT 10
RETURN {email, count}
EDIT1: Return one document per user with event types grouped and counted (like in the first query), but capture the MERGE result, sort by the count of one particular event type (here: CREATE) and return the top 10 users for this type. The result is the same as with the solution given in the question. It spares the subquery a la FOR a IN (FOR event IN events ...) ... RETURN a however:
FOR event IN events
COLLECT
emailAddress = event.metadata.emailAddress,
type = event.type WITH COUNT INTO count
COLLECT email = emailAddress INTO perUser KEEP type, count
LET ret = MERGE(PUSH(perUser[* RETURN {[CURRENT.type]: CURRENT.count}], {email}))
SORT ret.CREATE DESC
LIMIT 10
RETURN ret
EDIT2: Query to generate example data (requires a collection events to exist):
FOR i IN 1..100
LET email = CONCAT(RANDOM_TOKEN(RAND()*4+4), "#example.com")
FOR j IN SPLIT("CREATE,DEPLOY,REMOVE,START,STOP", ",")
FOR k IN 1..RAND()*150+50
INSERT {metadata: {emailAddress: email}, type: j} INTO events RETURN NEW

Replacing an object in an object array in Redux Store using Javascript/Lodash

I have an object array in a reducer that looks like this:
[
{id:1, name:Mark, email:mark#email.com},
{id:2, name:Paul, email:paul#gmail.com},
{id:3,name:sally, email:sally#email.com}
]
Below is my reducer. So far, I can add a new object to the currentPeople reducer via the following:
const INITIAL_STATE = { currentPeople:[]};
export default function(state = INITIAL_STATE, action) {
switch (action.type) {
case ADD_PERSON:
return {...state, currentPeople: [ ...state.currentPeople, action.payload]};
}
return state;
}
But here is where I'm stuck. Can I UPDATE a person via the reducer using lodash?
If I sent an action payload that looked like this:
{id:1, name:Eric, email:Eric#email.com}
Would I be able to replace the object with the id of 1 with the new fields?
Yes you can absolutely update an object in an array like you want to. And you don't need to change your data structure if you don't want to. You could add a case like this to your reducer:
case UPDATE_PERSON:
return {
...state,
currentPeople: state.currentPeople.map(person => {
if (person.id === action.payload.id) {
return action.payload;
}
return person;
}),
};
This can be be shortened as well, using implicit returns and a ternary:
case UPDATE_PERSON:
return {
...state,
currentPeople: state.currentPeople.map(person => (person.id === action.payload.id) ? action.payload : person),
};
Mihir's idea about mapping your data to an object with normalizr is certainly a possibility and technically it'd be faster to update the user with the reference instead of doing the loop (after initial mapping was done). But if you want to keep your data structure, this approach will work.
Also, mapping like this is just one of many ways to update the object, and requires browser support for Array.prototype.map(). You could use lodash indexOf() to find the index of the user you want (this is nice because it breaks the loop when it succeeds instead of just continuing as the .map would do), once you have the index you could overwrite the object directly using it's index. Make sure you don't mutate the redux state though, you'll need to be working on a clone if you want to assign like this: clonedArray[foundIndex] = action.payload;.
This is a good candidate for data normalization. You can effectively replace your data with the new one, if you normalize the data before storing it in your state tree.
This example is straight from Normalizr.
[{
id: 1,
title: 'Some Article',
author: {
id: 1,
name: 'Dan'
}
}, {
id: 2,
title: 'Other Article',
author: {
id: 1,
name: 'Dan'
}
}]
Can be normalized this way-
{
result: [1, 2],
entities: {
articles: {
1: {
id: 1,
title: 'Some Article',
author: 1
},
2: {
id: 2,
title: 'Other Article',
author: 1
}
},
users: {
1: {
id: 1,
name: 'Dan'
}
}
}
}
What's the advantage of normalization?
You get to extract the exact part of your state tree that you want.
For instance- You have an array of objects containing information about the articles. If you want to select a particular object from that array, you'll have to iterate through entire array. Worst case is that the desired object is not present in the array. To overcome this, we normalize the data.
To normalize the data, store the unique identifiers of each object in a separate array. Let's call that array as results.
result: [1, 2, 3 ..]
And transform the array of objects into an object with keys as the id(See the second snippet). Call that object as entities.
Ultimately, to access the object with id 1, simply do this- entities.articles["1"].
If you want to replace the old data with new data, you can do this-
entities.articles["1"] = newObj;
Use native splice method of array:
/*Find item index using lodash*/
var index = _.indexOf(currentPeople, _.find(currentPeople, {id: 1}));
/*Replace item at index using splice*/
arr.splice(index, 1, {id:1, name:'Mark', email:'mark#email.com'});

Resources