Impute and Add new calculated column with Rust DataFusion? - rust

Considering, I have a json datafile named test_file.json with the following content.
{"a": 1, "b": "hi", "c": 3}
{"a": 5, "b": null, "c": 7}
Here how I can read the file in With DataFrame API of DataFusion:
use datafusion::prelude::*;
#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
let file_path = "datalayers/landing/test_file.json";
let mut ctx = SessionContext::new();
let df = ctx.read_json(file_path, NdJsonReadOptions::default()).await?;
df.show().await?;
Ok(())
I would like to do the following operation:
Impute the null value in the b column with an empty "" string either using fill na or case when statement
Create a new calculated column with combining the column a and b col("a") + col("b")
I have tried to went through the api documentation but could not find any function like with_column which spark has to add a new column and also how to impute the null values.
To add two columns I can do that with column expression col("a").add(col("c")).alias("d") but I was curious to know if it is possible to use something like with_column to add a new column.

DataFusion's DataFrame does not currently have a with_column method but I think it would be good to add it. I filed an issue for this - https://github.com/apache/arrow-datafusion/issues/2844
Until that is added, you could call https://docs.rs/datafusion/9.0.0/datafusion/dataframe/struct.DataFrame.html#method.select to select the existing columns as well as the new expression:
df.select(vec![col("a"), col("b"), col("c"), col("a").add(col("c")).alias("d")]);

Related

Need to add a new column to a Dataset/Row in Spark, based on all existing columns

I have this (simplified) Spark dataset with these columns:
"col1", "col2", "col3", "col4"
And I would like to add a new column: "result".
The value of "result" is the return value of a function that takes all the other columns ("col1", "col2", ...) values as parameters.
map/foreach can't change the iterated row, and UDF functions don't take a whole row as a parameter, so I will have to collect all the column names as input, and I will also have to specify each column type in the UDF registration part.
Notes:
The dataset doesn't have a lot of rows, so I don't mind having a low performant solution.
The dataset does have a lot of columns with different types, so specifying all the columns in the UDF registration part doesn't seem like the most elegant solution.
The project is written in Java, so I'm using the Java API to interact with Spark.
How can I achieve that behavior?
You actually could add a new column with a map.
df.map { row =>
val col1 = row.getAs[String]("col1")
val col2 = row.getAs[String]("col2")
// etc, extract all your columns
....
val newColumns = col1 + col2
// do what you need to do to obtain value for a new column
(col1, col2, ..., newColumn)
}.toDF("col1", "col2", ..., "new")
In term of Java API this will be just the same with some adjustments:
data.map((MapFunction<Row, Tuple3<String, String, String>>) row -> {
String col1 = row.getAs("col1");
String col2 = row.getAs("col2");
// whatever you need
String newColumns = col1 + col2;
return new Tuple3<>(col1, col2, newColumns);
}, Encoders.tuple(Encoders.STRING(), Encoders.STRING(), Encoders.STRING()))
.toDF("col1", "col2", ..., "new")
Alternatively, you could collect all your columns to the array and then process this array in your UDF.
val transformer = udf { arr: Seq[Any] =>
// do your stuff but bevare of types
}
data.withColumn("array", array($"col1", $"col2", ..., $"colN"))
.select($"col1", $"col2",..., transformer($"array") as "newCol")
I've found a solution for my question:
String[] allColumnsAsStrings = dataset.columns();
final Column[] allColumns = Arrays.stream(allColumnsAsStrings).toArray(Column[]::new);
UserDefinedFunction addColumnUdf = udf((Row row) -> {
double score;
// Calculate stuff based on the row values
// ...
return score;
}, DataTypes.DoubleType
);
dataset = dataset.withColumn("score", addColumnUdf.apply(functions.struct(allColumns)));

Pandas - Count unique values in column A that satisfy condition in column B grouped by column C

I have a fake dataset presenting a list of areas. These areas contain members and each member has a value.
I would like to count for each area, the number of unique members whose value satisfies a condition. I managed to deal with the issue but I would like to know if there is a cleaner way to do so in Pandas.
Here is my attempt so far:
# Building the fake dataset
dummy_dict = {
"area": ["A","A", "A","A","B","B"],
"member" : ["O1","O2","O2","O3","O1","O1"],
"value" : [90, 200, 200, 150, 120, 120]
}
df = pd.DataFrame(dummy_dict)
# Counting the number of unique members that satisfy the condition by zone
value_cutoff = 100
df["nb_unique_members"] = df.groupby("area")["member"].transform("nunique")
df.loc[df["value"]>=value_cutoff,"tmp"] = df.loc[df["value"]>=value_cutoff].groupby("area")["member"].transform("nunique")
df["nb_unique_members_above_cutoff"] = df.groupby("area")["tmp"].transform("mean")
df.head()
Is there a better way to do so in Pandas ? Thanks in advance!

Pandas MultiIndex in which one factor is an Enum

I'm having trouble working with a dataframe whose columns are a multiindex in which one of the iterables is an Enum. Consider the code:
MyEnum = Enum("MyEnum", "A B")
df = pd.DataFrame(columns=pd.MultiIndex.from_product(iterables=[MyEnum, [1, 2]]))
This raises
TypeError: 'values' is not ordered, please explicitly specify the categories order by passing in a categories argument.
This can be worked around by instead putting:
df = pd.DataFrame(columns=pd.MultiIndex.from_product(iterables=[
pd.Series(MyEnum, dtype="category"),
[1, 2]
]))
but then appending a row with
df.append({(MyEnum.A, 1): "abc", (MyEnum.B, 2): "xyz"}, ignore_index=True)
raises the same TypeError as before.
I've tried various variations on this theme, with no success. (No problems occur if the columns is not a multiindex, but is an enum.)
(Note that I can dodge this by using an IntEnum instead of an Enum. But then, my columns, simply appear as numbers---this is why I wanted to use an Enum in the first place, as opposed to ints.)
Many thanks!

U-SQL Error in Naming the Column

I have a JSON where the order of fields is not fixed.
i.e. I can have [A, B, C] or [B, C, A]
All A, B, C are json objects are of the form {Name: x, Value:y}.
So, when I use USQL to extract the JSON (I don't know their order) and put it into a CSV (for which I will need column name):
#output =
SELECT
A["Value"] ?? "0" AS CAST ### (("System_" + A["Name"]) AS STRING),
B["Value"] ?? "0" AS "System_" + B["Name"],
System_da
So, I am trying to put column name as the "Name" field in the JSON.
But am getting the error at #### above:
Message
syntax error. Expected one of: FROM ',' EXCEPT GROUP HAVING INTERSECT OPTION ORDER OUTER UNION UNION WHERE ';' ')'
Resolution
Correct the script syntax, using expected token(s) as a guide.
Description
Invalid syntax found in the script.
Details
at token '(', line 74
near the ###:
**************
I am not allowed to put the correct column name "dynamically" and it is an absolute necessity of my issue.
Input: [A, B, C,], [C, B, A]
Output: A.name B.name C.name
Row 1's values
Row 2's values
This
#output =
SELECT
A["Value"] ?? "0" AS CAST ### (("System_" + A["Name"]) AS STRING),
B["Value"] ?? "0" AS "System_" + B["Name"],
System_da
is not a valid SELECT clause (neither in U-SQL nor any other SQL dialect I am aware of).
What is the JSON Array? Is it a key/value pair? Or positional? Or a single value in the array that you want to have a marker for whether it is present in the array?
From your example, it seems that you want something like:
Input:
[["A","B","C"],["C","D","B"]]
Output:
A B C D
true true true false
false true true true
If that is the case, I would write it as:
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
USING Microsoft.Analytics.Samples.Formats.Json;
#input =
SELECT "[[\"A\", \"B\", \"C\"],[\"C\", \"D\", \"B\"]]" AS json
FROM (VALUES (1)) AS T(x);
#data =
SELECT JsonFunctions.JsonTuple(arrstring) AS a
FROM #input CROSS APPLY EXPLODE( JsonFunctions.JsonTuple(json).Values) AS T(arrstring);
#data =
SELECT a.Contains("A") AS A, a.Contains("B") AS B, a.Contains("C") AS C, a.Contains("D") AS D
FROM (SELECT a.Values AS a FROM #data) AS t;
OUTPUT #data
TO "/output/data.csv"
USING Outputters.Csv(outputHeader : true);
If you need something more dynamic, either use the resulting SqlArray or SqlMap or use the above approach to generate the script.
However, I wonder why you would model your information this way in the first place. I would recommend finding a more appropriate way to mark the presence of the value in the JSON.
UPDATE: I missed your comment about that the inner array members are an object with two key-value pairs, where one is always called name (for property) and one is always called value for the property value. So here is the answer for that case.
First: Modelling key value pairs in JSON using {"Name": "propname", "Value" : "value"} is a complete misuse of the flexible modelling capabilities of JSON and should not be done. Use {"propname" : "value"} instead if you can.
So changing the input, the following will give you the pivoted values. Note that you will need to know the values ahead of time and there are several options on how to do the pivot. I do it in the statement where I create the new SqlMap instance to reduce the over-modelling, and then in the next SELECT where I get the values from the map.
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
USING Microsoft.Analytics.Samples.Formats.Json;
#input =
SELECT "[[{\"Name\":\"A\", \"Value\": 1}, {\"Name\": \"B\", \"Value\": 2}, {\"Name\": \"C\", \"Value\":3 }], [{\"Name\":\"C\", \"Value\": 4}, {\"Name\":\"D\", \"Value\": 5}, {\"Name\":\"B\", \"Value\": 6}]]" AS json
FROM (VALUES (1)) AS T(x);
#data =
SELECT JsonFunctions.JsonTuple(arrstring) AS a
FROM #input CROSS APPLY EXPLODE( JsonFunctions.JsonTuple(json)) AS T(rowid, arrstring);
#data =
SELECT new SqlMap<string, string>(
a.Values.Select((kvp) =>
new KeyValuePair<string, string>(
JsonFunctions.JsonTuple(kvp)["Name"]
, JsonFunctions.JsonTuple(kvp)["Value"])
)) AS kvp
FROM #data;
#data =
SELECT kvp["A"] AS A,
kvp["B"] AS B,
kvp["C"] AS C,
kvp["D"] AS D
FROM #data;
OUTPUT #data
TO "/output/data.csv"
USING Outputters.Csv(outputHeader : true);

How to compare complete JSON objects

Is there any way to compare 2 json objects using ChaiJS? I want to compare them deeply till the leaf nodes. However, it is better if the code ignore the order of the siblings within the JSON Object - would only validate structure and value and datatype of values. Any help is appreciated.
I just saw the following code, however not sure what is equalsRec
expect({ a: 3, b: {c: '2'} }).to.equalRec({ a: 3, b: {c: 2} }) //expecting false
expect({ a: 3, b: {c: '2'} }).to.equalRec({ b: {c: '2'}, a: 3 }) //expecting true
First of all there is no such thing as "JSON object". You have object literals. Then "ignore the order of the siblings" object keys have no order.
To compare two objects you could use deep flag
expect(obj1).to.be.deep.equal(obj2)

Resources