How to compare complete JSON objects - node.js

Is there any way to compare 2 json objects using ChaiJS? I want to compare them deeply till the leaf nodes. However, it is better if the code ignore the order of the siblings within the JSON Object - would only validate structure and value and datatype of values. Any help is appreciated.
I just saw the following code, however not sure what is equalsRec
expect({ a: 3, b: {c: '2'} }).to.equalRec({ a: 3, b: {c: 2} }) //expecting false
expect({ a: 3, b: {c: '2'} }).to.equalRec({ b: {c: '2'}, a: 3 }) //expecting true

First of all there is no such thing as "JSON object". You have object literals. Then "ignore the order of the siblings" object keys have no order.
To compare two objects you could use deep flag
expect(obj1).to.be.deep.equal(obj2)

Related

Impute and Add new calculated column with Rust DataFusion?

Considering, I have a json datafile named test_file.json with the following content.
{"a": 1, "b": "hi", "c": 3}
{"a": 5, "b": null, "c": 7}
Here how I can read the file in With DataFrame API of DataFusion:
use datafusion::prelude::*;
#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
let file_path = "datalayers/landing/test_file.json";
let mut ctx = SessionContext::new();
let df = ctx.read_json(file_path, NdJsonReadOptions::default()).await?;
df.show().await?;
Ok(())
I would like to do the following operation:
Impute the null value in the b column with an empty "" string either using fill na or case when statement
Create a new calculated column with combining the column a and b col("a") + col("b")
I have tried to went through the api documentation but could not find any function like with_column which spark has to add a new column and also how to impute the null values.
To add two columns I can do that with column expression col("a").add(col("c")).alias("d") but I was curious to know if it is possible to use something like with_column to add a new column.
DataFusion's DataFrame does not currently have a with_column method but I think it would be good to add it. I filed an issue for this - https://github.com/apache/arrow-datafusion/issues/2844
Until that is added, you could call https://docs.rs/datafusion/9.0.0/datafusion/dataframe/struct.DataFrame.html#method.select to select the existing columns as well as the new expression:
df.select(vec![col("a"), col("b"), col("c"), col("a").add(col("c")).alias("d")]);

pyspark cast multiple columns into different datatypes

Newbie to pyspark. I have a csv with multiple columns, with differing data-types, i.e. string, date, float, etc. I am reading all columns as StringType. How can I loop through the dataframe and cast these into respective data-types without having to write multiple withColumn statements. I have defined a dictionary like
conversions = {
"COL1": lambda c: f.col(c).cast("string"),
"COL2":lambda c: f.from_unixtime(f.unix_timestamp(c,dateFormat)).cast("date"),
"COL3":lambda c: f.from_unixtime(f.unix_timestamp(c,dateFormat)).cast("date"),
"COL4":lambda c: f.col(c).cast("float"),
"COL5":lambda c: f.col(c).cast("string"),
"COL6":lambda c: f.col(c).cast("string")
}
for k,v in conversions.items():
convDf = inputDf.withColumn(k,v(k))
This is not casting my input dates into correct format. What am I doing wrong here?

Pandas MultiIndex in which one factor is an Enum

I'm having trouble working with a dataframe whose columns are a multiindex in which one of the iterables is an Enum. Consider the code:
MyEnum = Enum("MyEnum", "A B")
df = pd.DataFrame(columns=pd.MultiIndex.from_product(iterables=[MyEnum, [1, 2]]))
This raises
TypeError: 'values' is not ordered, please explicitly specify the categories order by passing in a categories argument.
This can be worked around by instead putting:
df = pd.DataFrame(columns=pd.MultiIndex.from_product(iterables=[
pd.Series(MyEnum, dtype="category"),
[1, 2]
]))
but then appending a row with
df.append({(MyEnum.A, 1): "abc", (MyEnum.B, 2): "xyz"}, ignore_index=True)
raises the same TypeError as before.
I've tried various variations on this theme, with no success. (No problems occur if the columns is not a multiindex, but is an enum.)
(Note that I can dodge this by using an IntEnum instead of an Enum. But then, my columns, simply appear as numbers---this is why I wanted to use an Enum in the first place, as opposed to ints.)
Many thanks!

Is there a inbuilt function to compare RDDs on specific criteria or better to write a UDF

How to do I count the occurrences of elements in child RDD occurring in Parent RDD.
Say,
I have two RDDs
Parent RDD -
['2 3 5']
['4 5 7']
['5 4 2 3']
Child RDD
['2 3','5 3','4 7','5 7','5 3','2 3']
I need something like -
[['2 3',2],['5 3',2],['4 7',1],['5 7',1],['5 3',2] ...]
Its actually finding the frequent item candidate set from the parent set.
Now, the child RDD can contain initially string elements or even lists i.e
['1 2','2 3'] or [[1,2],[2,3]]
as that's the data structure that I would implement according to what fits the best.
Question -
Are there inbuild functions which could do something similar to what I am trying to achieve with these two RDDs? Any transformations?
Or writing a UDF that parses each element of child and compares it to parent is needed, now my data is a lot so I doubt this would be efficient.
In case I end up writing a UDF should I use the foreach function of RDD?
Or RDD framework is not a good idea for some custom operation like this and dataframes could work here?
I am trying to do this in PySpark. Help or guidance is greatly appreciated!
It's easy enough if you use sets, but the trick is with grouping as sets cannot be used as keys.
The alternative used here is ordering set elements and generating a string as the corresponding key:
rdd = sc.parallelize(['2 3 5', '4 5 7', '5 4 2 3'])\
.map(lambda l: l.split())\
.map(set)
childRdd = sc.parallelize(['2 3','5 3','4 7','5 7','5 3','2 3'])\
.map(lambda l: l.split())\
.map(set)
#A small utility function to make strings from sets
#the point is order so that grouping can match keys
#that's because sets aren't ordered.
def setToString(theset):
lst = list(theset)
lst.sort()
return ''.join(lst)
Now find pairs where child is subset of parent
childRdd.cartesian(rdd)\
.filter(lambda l: set(l[0]).issubset(set(l[1])))\
.map(lambda pair: (setToString(pair[0]), pair[1]))\
.countByKey()
For the above example, the last line returns:
defaultdict(int, {'23': 4, '35': 4, '47': 1, '57': 1})

U-SQL Error in Naming the Column

I have a JSON where the order of fields is not fixed.
i.e. I can have [A, B, C] or [B, C, A]
All A, B, C are json objects are of the form {Name: x, Value:y}.
So, when I use USQL to extract the JSON (I don't know their order) and put it into a CSV (for which I will need column name):
#output =
SELECT
A["Value"] ?? "0" AS CAST ### (("System_" + A["Name"]) AS STRING),
B["Value"] ?? "0" AS "System_" + B["Name"],
System_da
So, I am trying to put column name as the "Name" field in the JSON.
But am getting the error at #### above:
Message
syntax error. Expected one of: FROM ',' EXCEPT GROUP HAVING INTERSECT OPTION ORDER OUTER UNION UNION WHERE ';' ')'
Resolution
Correct the script syntax, using expected token(s) as a guide.
Description
Invalid syntax found in the script.
Details
at token '(', line 74
near the ###:
**************
I am not allowed to put the correct column name "dynamically" and it is an absolute necessity of my issue.
Input: [A, B, C,], [C, B, A]
Output: A.name B.name C.name
Row 1's values
Row 2's values
This
#output =
SELECT
A["Value"] ?? "0" AS CAST ### (("System_" + A["Name"]) AS STRING),
B["Value"] ?? "0" AS "System_" + B["Name"],
System_da
is not a valid SELECT clause (neither in U-SQL nor any other SQL dialect I am aware of).
What is the JSON Array? Is it a key/value pair? Or positional? Or a single value in the array that you want to have a marker for whether it is present in the array?
From your example, it seems that you want something like:
Input:
[["A","B","C"],["C","D","B"]]
Output:
A B C D
true true true false
false true true true
If that is the case, I would write it as:
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
USING Microsoft.Analytics.Samples.Formats.Json;
#input =
SELECT "[[\"A\", \"B\", \"C\"],[\"C\", \"D\", \"B\"]]" AS json
FROM (VALUES (1)) AS T(x);
#data =
SELECT JsonFunctions.JsonTuple(arrstring) AS a
FROM #input CROSS APPLY EXPLODE( JsonFunctions.JsonTuple(json).Values) AS T(arrstring);
#data =
SELECT a.Contains("A") AS A, a.Contains("B") AS B, a.Contains("C") AS C, a.Contains("D") AS D
FROM (SELECT a.Values AS a FROM #data) AS t;
OUTPUT #data
TO "/output/data.csv"
USING Outputters.Csv(outputHeader : true);
If you need something more dynamic, either use the resulting SqlArray or SqlMap or use the above approach to generate the script.
However, I wonder why you would model your information this way in the first place. I would recommend finding a more appropriate way to mark the presence of the value in the JSON.
UPDATE: I missed your comment about that the inner array members are an object with two key-value pairs, where one is always called name (for property) and one is always called value for the property value. So here is the answer for that case.
First: Modelling key value pairs in JSON using {"Name": "propname", "Value" : "value"} is a complete misuse of the flexible modelling capabilities of JSON and should not be done. Use {"propname" : "value"} instead if you can.
So changing the input, the following will give you the pivoted values. Note that you will need to know the values ahead of time and there are several options on how to do the pivot. I do it in the statement where I create the new SqlMap instance to reduce the over-modelling, and then in the next SELECT where I get the values from the map.
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
USING Microsoft.Analytics.Samples.Formats.Json;
#input =
SELECT "[[{\"Name\":\"A\", \"Value\": 1}, {\"Name\": \"B\", \"Value\": 2}, {\"Name\": \"C\", \"Value\":3 }], [{\"Name\":\"C\", \"Value\": 4}, {\"Name\":\"D\", \"Value\": 5}, {\"Name\":\"B\", \"Value\": 6}]]" AS json
FROM (VALUES (1)) AS T(x);
#data =
SELECT JsonFunctions.JsonTuple(arrstring) AS a
FROM #input CROSS APPLY EXPLODE( JsonFunctions.JsonTuple(json)) AS T(rowid, arrstring);
#data =
SELECT new SqlMap<string, string>(
a.Values.Select((kvp) =>
new KeyValuePair<string, string>(
JsonFunctions.JsonTuple(kvp)["Name"]
, JsonFunctions.JsonTuple(kvp)["Value"])
)) AS kvp
FROM #data;
#data =
SELECT kvp["A"] AS A,
kvp["B"] AS B,
kvp["C"] AS C,
kvp["D"] AS D
FROM #data;
OUTPUT #data
TO "/output/data.csv"
USING Outputters.Csv(outputHeader : true);

Resources