Writing polars dataframe as nested JSON - rust-polars

Using polars in rust how would I go about writing a dataframe in a nested JSON.
ie:
{
"randomKey": "bla",
"dataframe": [{"a": 1, "b": 2}, {"a": 2, "b3"}]// <polars dataframe representation>
}
Perhaps this is also a rust serde_json question but I basically want to include the buffer passed to JsonWriter::new(buf) as a value in a larger JSON object.
Perhaps there is no way to do this and the best I can do is a poor mans approach and start concatenating buffers manually?

Related

Impute and Add new calculated column with Rust DataFusion?

Considering, I have a json datafile named test_file.json with the following content.
{"a": 1, "b": "hi", "c": 3}
{"a": 5, "b": null, "c": 7}
Here how I can read the file in With DataFrame API of DataFusion:
use datafusion::prelude::*;
#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
let file_path = "datalayers/landing/test_file.json";
let mut ctx = SessionContext::new();
let df = ctx.read_json(file_path, NdJsonReadOptions::default()).await?;
df.show().await?;
Ok(())
I would like to do the following operation:
Impute the null value in the b column with an empty "" string either using fill na or case when statement
Create a new calculated column with combining the column a and b col("a") + col("b")
I have tried to went through the api documentation but could not find any function like with_column which spark has to add a new column and also how to impute the null values.
To add two columns I can do that with column expression col("a").add(col("c")).alias("d") but I was curious to know if it is possible to use something like with_column to add a new column.
DataFusion's DataFrame does not currently have a with_column method but I think it would be good to add it. I filed an issue for this - https://github.com/apache/arrow-datafusion/issues/2844
Until that is added, you could call https://docs.rs/datafusion/9.0.0/datafusion/dataframe/struct.DataFrame.html#method.select to select the existing columns as well as the new expression:
df.select(vec![col("a"), col("b"), col("c"), col("a").add(col("c")).alias("d")]);

Dynamic Time Analysis using PySpark

Suppose we have a dataset with the following structure:
df = sc.parallelize([['a','2015-11-27', 1], ['a','2015-12-27',0], ['a','2016-01-29',0], ['b','2014-09-01', 1], ['b','2015-05-01', 1] ]).toDF(("user", "date", "category"))
What I want to analyze is the users' attributes with regard to their lifetime in months. For example, I want to sum up the column "category" for each month of a user's lifetime. For user 'a', this would look like:
output = sc.parallelize([['a',0, 1], ['a',1,0], ['a',2,0]]).toDF(("user", "user_lifetime_in_months", "sum(category)"))
What is the most efficient way in Spark to do that? E.g., window functions?

couchbase add subdocument unique array values

I have a couchbase document as
{
"last": 123,
"data": [
[0, 1.1],
[1, 2.3]
]
}
currently have code to upsert the document to change the last property and add values to the data array, however, cannot find a way to insert unique values only. I'd like to avoid fetching the whole document and doing the filtering in javascript. Is there any way in couchbase?
arrayAddUnique will fail, cause there are floats in the subarrays per couchbase docs.
.mutateIn(`document`)
.upsert("last", 234)
.arrayAppend("data", newDataArray)
.execute( ... )

How to compare complete JSON objects

Is there any way to compare 2 json objects using ChaiJS? I want to compare them deeply till the leaf nodes. However, it is better if the code ignore the order of the siblings within the JSON Object - would only validate structure and value and datatype of values. Any help is appreciated.
I just saw the following code, however not sure what is equalsRec
expect({ a: 3, b: {c: '2'} }).to.equalRec({ a: 3, b: {c: 2} }) //expecting false
expect({ a: 3, b: {c: '2'} }).to.equalRec({ b: {c: '2'}, a: 3 }) //expecting true
First of all there is no such thing as "JSON object". You have object literals. Then "ignore the order of the siblings" object keys have no order.
To compare two objects you could use deep flag
expect(obj1).to.be.deep.equal(obj2)

How to attach metadata to a double column in PySpark

I have a double-typed column in a dataframe that holds the class label for a Random Forest training set.
I would like to manually attach metadata to the column so that I don't have to pass the dataframe into a StringIndexer as suggested in another question.
The easiest method of doing this seems to be by using the as method of Column.
However, this method is not available in Python.
Is there an easy workaround?
If there is no easy workaround and the best approach is a Python port of as, then why is the method not ported in Python?
Is there a difficult technical reason and not simply because it conflicts with the as keyword in Python and that no one has volunteered to port it?
I looked at the source code and found that the alias method in Python internally calls the as method in Scala.
import json
from pyspark.sql.column import Column
def add_meta(col, metadata):
meta = sc._jvm.org.apache.spark.sql.types\
.Metadata.fromJson(json.dumps(metadata))
return Column(getattr(col._jc, "as")('', meta))
# sample invocation
df.withColumn('label',
add_meta(df.classification,
{"ml_attr": {
"name": "label",
"type": "nominal",
"vals": ["0.0", "1.0"]
}
}))\
.show()
This solution involves calling the as(alias: String, metadata: Metadata) Scala method in Python. It can be retrieved by getattr(col._jc, "as") where col is a dataframe column (Column object).
This returned function must then be called with two arguments. The first argument is just a string and the second argument is a Metadata. The object is created by calling Metadata.fromJson() which expects a JSON string as parameter. The method is retrieved via the _jvm attribute of the Spark context.
Spark 3.3+
df.withMetadata("col_name", meta_dict)
Spark 2.2+
df.withColumn("col_name", df.col_name.alias("", metadata=meta_dict))
meta_dict can be a complex dictionary, as provided in the other answer:
meta_dict = {
"ml_attr": {
"name": "label",
"type": "nominal",
"vals": ["0.0", "1.0"]
}
}

Resources