How to explode structs array? - apache-spark

I am working with a JSON object, and want to convert object.hours to relational table, based on Spark SQL dataframe/dataset.
I tried to use "explode", which is not really supporting the "structs array".
The json object is below:
{
"business_id": "abc",
"full_address": "random_address",
"hours": {
"Monday": {
"close": "02:00",
"open": "11:00"
},
"Tuesday": {
"close": "02:00",
"open": "11:00"
},
"Friday": {
"close": "02:00",
"open": "11:00"
},
"Wednesday": {
"close": "02:00",
"open": "11:00"
},
"Thursday": {
"close": "02:00",
"open": "11:00"
},
"Sunday": {
"close": "00:00",
"open": "11:00"
},
"Saturday": {
"close": "02:00",
"open": "11:00"
}
}
}
To a relational table like below,
CREATE TABLE "business_hours" (
"id" integer NOT NULL PRIMARY KEY,
"business_id" integer NOT NULL FOREIGN KEY REFERENCES "businesses",
"day" integer NOT NULL,
"open_time" time,
"close_time" time
)

You can do this using this trick:
import org.apache.spark.sql.types.StructType
val days = df.schema
.fields
.filter(_.name=="hours")
.head
.dataType
.asInstanceOf[StructType]
.fieldNames
val solution = df
.select(
$"business_id",
$"full_address",
explode(
array(
days.map(d => struct(
lit(d).as("day"),
col(s"hours.$d.open").as("open_time"),
col(s"hours.$d.close").as("close_time")
)):_*
)
)
)
.select($"business_id",$"full_address",$"col.*")
scala> solution.show
+-----------+--------------+---------+---------+----------+
|business_id| full_address| day|open_time|close_time|
+-----------+--------------+---------+---------+----------+
| abc|random_address| Friday| 11:00| 02:00|
| abc|random_address| Monday| 11:00| 02:00|
| abc|random_address| Saturday| 11:00| 02:00|
| abc|random_address| Sunday| 11:00| 00:00|
| abc|random_address| Thursday| 11:00| 02:00|
| abc|random_address| Tuesday| 11:00| 02:00|
| abc|random_address|Wednesday| 11:00| 02:00|
+-----------+--------------+---------+---------+----------+

Related

spark find which persons often go to the same counties

The data source is:
val spark = SparkSession.builder().master("local[1,1]").config("spark.sql.shuffle.partitions", "1").config("spark.sql.crossJoin.enabled","true").getOrCreate()
spark.sparkContext.setLogLevel("error")
import spark.implicits._
val df=Seq(
("tom","America","2019"),
("jim","America","2019"),
("jack","America","2019"),
("tom","Russia","2019"),
("jim","Russia","2019"),
("jack","Russia","2019"),
("alex","Russia","2019"),
("tom","America","2018"),
("jim","America","2018"),
("tom","Germany","2018"),
("jim","England","2018")
).toDF("person","country","year")
I want to find which persons often go to the same countries for each year,and where they gone together,so what I expect is a json like this:
[{
"year": "2019",
"items": [{
"persons": ["tom", "jim", "jack"],
"common": ["America", "Russia"],
"times": 2
}, {
"persons": ["tom", "jack"],
"common": ["America", "Russia"],
"times": 2
}, {
"persons": ["tom", "jim"],
"common": ["America", "Russia"],
"times": 2
}, {
"persons": ["jack", "jim"],
"common": ["America", "Russia"],
"times": 2
}]
},
{
"year": "2018",
"items": [{
"persons": ["tom", "jim"],
"common": ["America"],
"times": 1
}]
}
]
Then I am not sure what model shall I use?
I tried Frequent Items Pattern:
val df1=df.where('year===2019)
val rdd1= df1.groupBy("country").agg(collect_set('person)).drop("country","year")
.as[Array[String]].rdd
val fpg = new FPGrowth()
.setMinSupport(0.3)
.setNumPartitions(10)
val schema = new StructType().add(new StructField("items", ArrayType(StringType))).add(new StructField("freq", LongType))
val model = fpg.run(rdd1);
val rdd2 = model.freqItemsets.map(itemset => Row(itemset.items, itemset.freq))
val df1 = spark.createDataFrame(rdd2, schema).where(size('items)>1)
.show()
loop for every year
val df2=df.where('year===2018)
val rdd2= df1.groupBy("country").agg(collect_set('person)).drop("country","year")
.as[Array[String]].rdd
....
val model = fpg.run(rdd12);
....
The result is :
for 2019
+----------------+----+
| items|freq|
+----------------+----+
| [jack, tom]| 2|
|[jack, tom, jim]| 2|
| [jack, jim]| 2|
| [tom, jim]| 2|
+----------------+----+
for 2018:
+----------+----+
| items|freq|
+----------+----+
|[tom, jim]| 1|
+----------+----+
But I can not get when and where they gone together,becuase the rdd I give to FPGRowth must be a RDD[Array[String]],no more columns allowed.
Is there any other better model?How can I achieve it?
I also want to know how many times each person group go together
Maybe what I should use collaborative filtering
Just self-join and aggregate
import org.apache.spark.sql.functions._
df.alias("left")
.join(df.alias("right"), Seq("country", "year"))
.where($"left.person" < $"right.person")
.groupBy(array($"left.person", $"right.person").alias("persons"))
.agg(collect_set(struct($"country", $"year")).alias("common"))
Try this:
val window = Window.partitionBy("country", "year")
df
.withColumn("persons", collect_set('person) over window)
.drop('person)
.distinct()
.groupBy('persons)
.agg(collect_set(struct('country, 'year)).alias("common"))
Output (tested):
+----------+----------------------------------+
|persons |common |
+----------+----------------------------------+
|[jim, tom]|[[America, 2019], [Russia, 2019]] |
|[tom] |[[Germany, 2018], [America, 2018]]|
|[jim] |[[Russia, 2018], [England, 2018]] |
+----------+----------------------------------+

How to reduceByKey in PySpark with custom grouping of rows?

I have a dataframe that looks as below:
items_df
======================================================
| customer item_type brand price quantity |
|====================================================|
| 1 bread reems 20 10 |
| 2 butter spencers 10 21 |
| 3 jam niles 10 22 |
| 1 bread marks 16 18 |
| 1 butter jims 19 12 |
| 1 jam jills 16 6 |
| 2 bread marks 16 18 |
======================================================
I create an rdd that converts the above to a dict:
rdd = items_df.rdd.map(lambda row: row.asDict())
The result looks like:
[
{ "customer": 1, "item_type": "bread", "brand": "reems", "price": 20, "quantity": 10 },
{ "customer": 2, "item_type": "butter", "brand": "spencers", "price": 10, "quantity": 21 },
{ "customer": 3, "item_type": "jam", "brand": "niles", "price": 10, "quantity": 22 },
{ "customer": 1, "item_type": "bread", "brand": "marks", "price": 16, "quantity": 18 },
{ "customer": 1, "item_type": "butter", "brand": "jims", "price": 19, "quantity": 12 },
{ "customer": 1, "item_type": "jam", "brand": "jills", "price": 16, "quantity": 6 },
{ "customer": 2, "item_type": "bread", "brand": "marks", "price": 16, "quantity": 18 }
]
I would like to group the above rows first by customer. Then I would like to introduce custom new keys "breads", "butters", "jams" and group all these rows for that customer. So my rdd reduces from 7 rows to 3 rows.
The output would look as below:
[
{
"customer": 1,
"breads": [
{"item_type": "bread", "brand": "reems", "price": 20, "quantity": 10},
{"item_type": "bread", "brand": "marks", "price": 16, "quantity": 18},
],
"butters": [
{"item_type": "butter", "brand": "jims", "price": 19, "quantity": 12}
],
"jams": [
{"item_type": "jam", "brand": "jills", "price": 16, "quantity": 6}
]
},
{
"customer": 2,
"breads": [
{"item_type": "bread", "brand": "marks", "price": 16, "quantity": 18}
],
"butters": [
{"item_type": "butter", "brand": "spencers", "price": 10, "quantity": 21}
],
"jams": []
},
{
"customer": 3,
"breads": [],
"butters": [],
"jams": [
{"item_type": "jam", "brand": "niles", "price": 10, "quantity": 22}
]
}
]
Would anyone know how to achieve the above using PySpark? I would like to know if there is a solution using reduceByKey() or something similar. I am hoping to avoid the use of groupByKey() if possible.
First add a column item_types to pivot dataframe.
items_df = items_df.withColumn('item_types', F.concat(F.col('item_type'),F.lit('s')))
items_df.show()
+--------+---------+--------+-----+--------+----------+
|customer|item_type| brand|price|quantity|item_types|
+--------+---------+--------+-----+--------+----------+
| 1| bread| reems| 20| 10| breads|
| 2| butter|spencers| 10| 21| butters|
| 3| jam| niles| 10| 22| jams|
| 1| bread| marks| 16| 18| breads|
| 1| butter| jims| 19| 12| butters|
| 1| jam| jills| 16| 6| jams|
| 2| bread| marks| 16| 18| breads|
+--------+---------+--------+-----+--------+----------+
Then you can pivot table with customer group and use F.collect_list() to aggregate other columns at the same time.
items_df = items_df.groupby(['customer']).pivot("item_types").agg(
F.collect_list(F.struct(F.col("item_type"),F.col("brand"), F.col("price"),F.col("quantity")))
).sort('customer')
items_df.show()
+--------+--------------------+--------------------+--------------------+
|customer| breads| butters| jams|
+--------+--------------------+--------------------+--------------------+
| 1|[[bread, reems, 2...|[[butter, jims, 1...|[[jam, jills, 16,...|
| 2|[[bread, marks, 1...|[[butter, spencer...| []|
| 3| []| []|[[jam, niles, 10,...|
+--------+--------------------+--------------------+--------------------+
Finally you need set recursive=True to convert the nested Row into dict.
rdd = items_df.rdd.map(lambda row: row.asDict(recursive=True))
print(rdd.take(10))
[{'customer': 1,
'breads': [{'item_type': u'bread', 'brand': u'reems', 'price': 20, 'quantity': 10},
{'item_type': u'bread', 'brand': u'marks', 'price': 16, 'quantity': 18}],
'butters': [{'item_type': u'butter', 'brand': u'jims', 'price': 19, 'quantity': 12}],
'jams': [{'item_type': u'jam', 'brand': u'jills', 'price': 16, 'quantity': 6}]},
{'customer': 2,
'breads': [{'item_type': u'bread', 'brand': u'marks', 'price': 16, 'quantity': 18}],
'butters': [{'item_type': u'butter', 'brand': u'spencers', 'price': 10, 'quantity': 21}],
'jams': []},
{'customer': 3,
'breads': [],
'butters': [],
'jams': [{'item_type': u'jam', 'brand': u'niles', 'price': 10, 'quantity': 22}]}]
I used another approach as well using reduceByKey() in rdd. Given the dataframe items_df, first convert it to rdd:
rdd = items_df.rdd.map(lambda row: row.asDict())
Transform each row to have tuple (customer, [row_obj]) where we have row_obj is in a list:
rdd = rdd.map(lambda row: ( row["customer"], [row] ) )
Group by customer using reduceByKey, where the lists are concatenated for a given customer:
rdd = rdd.reduceByKey(lambda x,y: x+y)
Transform the tuple back to dict where key is customer and value is list of all rows associated:
rdd = rdd.map(lambda tup: { tup[0]: tup[1] } )
Since each customer data is all now in a row, we can segregate the data as breads, butters, jams using a custom function:
def organize_items_in_customer(row):
cust_id = list(row.keys())[0]
items = row[cust_id]
new_cust_obj = { "customer": cust_id, "breads": [], "butters": [], "jams": [] }
plurals = { "bread":"breads", "butter":"butters", "jam":"jams" }
for item in items:
item_type = item["item_type"]
key = plurals[item_type]
new_cust_obj[key].append(item)
return new_cust_obj
Call the above function to transform rdd:
rdd = rdd.map(organize_items_in_customer)

Parse nested JSON structure with a vary schema with Spark DataFrame or RDD API

I have many jsons with structure like that
{
"parent_id": "parent_id1",
"devices" : "HERE_IS_STRUCT_SERIALIZED_AS_STRING_SEE BELOW"
}
{
"0x0034" : { "id": "0x0034", "p1": "p1v1", "p2": "p2v1" },
"0xAB34" : { "id": "0xAB34", "p1": "p1v2", "p2": "p2v2" },
"0xCC34" : { "id": "0xCC34", "p1": "p1v3", "p2": "p2v3" },
"0xFFFF" : { "id": "0xFFFF", "p1": "p1v4", "p2": "p2v4" },
....
"0x0023" : { "id": "0x0023", "p1": "p1vN", "p2": "p2vN" },
}
As you can see instead of making array of objects, the telemetry developers serialize every element as a property of object,
also the property names vary depending on id.
Using Spark DataFrame or RDD API, I want to transform it into a table like that
parent_id1, 0x0034, p1v1, p2v1
parent_id1, 0xAB34, p1v2, p2v2
parent_id1, 0xCC34, p1v3, p2v3
parent_id1, 0xFFFF, p1v4, p2v4
parent_id1, 0x0023, p1v5, p2v5
Here is sample data:
{
"parent_1": "parent_v1",
"devices" : "{ \"0x0034\" : { \"id\": \"0x0034\", \"p1\": \"p1v1\", \"p2\": \"p2v1\" }, \"0xAB34\" : { \"id\": \"0xAB34\", \"p1\": \"p1v2\", \"p2\": \"p2v2\" }, \"0xCC34\" : { \"id\": \"0xCC34\", \"p1\": \"p1v3\", \"p2\": \"p2v3\" }, \"0xFFFF\" : { \"id\": \"0xFFFF\", \"p1\": \"p1v4\", \"p2\": \"p2v4\" }, \"0x0023\" : { \"id\": \"0x0023\", \"p1\": \"p1vN\", \"p2\": \"p2vN\" }}"
}
{
"parent_2": "parent_v1",
"devices" : "{ \"0x0045\" : { \"id\": \"0x0045\", \"p1\": \"p1v1\", \"p2\": \"p2v1\" }, \"0xC5C1\" : { \"id\": \"0xC5C1\", \"p1\": \"p1v2\", \"p2\": \"p2v2\" }}"
}
Desired output
parent_id1, 0x0034, p1v1, p2v1
parent_id1, 0xAB34, p1v2, p2v2
parent_id1, 0xCC34, p1v3, p2v3
parent_id1, 0xFFFF, p1v4, p2v4
parent_id1, 0x0023, p1v5, p2v5
parent_id2, 0x0045, p1v1, p2v1
parent_id2, 0xC5C1, p1v2, p2v2
I thought about passing devices as parameter of from_json function and than somehow transform the returned object into a JSON array and then explode it....
But from_json wants schema as input, but the schema tends to vary...
There is probably a more pythonic or sparkian way to do this but this worked for me:
Input Data
data = {
"parent_id": "parent_v1",
"devices" : "{ \"0x0034\" : { \"id\": \"0x0034\", \"p1\": \"p1v1\", \"p2\": \"p2v1\" }, \"0xAB34\" : { \"id\": \"0xAB34\", \"p1\": \"p1v2\", \"p2\": \"p2v2\" }, \"0xCC34\" : { \"id\": \"0xCC34\", \"p1\": \"p1v3\", \"p2\": \"p2v3\" }, \"0xFFFF\" : { \"id\": \"0xFFFF\", \"p1\": \"p1v4\", \"p2\": \"p2v4\" }, \"0x0023\" : { \"id\": \"0x0023\", \"p1\": \"p1vN\", \"p2\": \"p2vN\" }}"
}
Get Dataframe
import json
def get_df_from_json(json_data):
#convert string to json
json_data['devices'] = json.loads(json_data['devices'])
list_of_dicts = []
for device_name, device_details in json_data['devices'].items():
row = {
"parent_id": json_data['parent_id'],
"device": device_name
}
for key in device_details.keys():
row[key] = device_details[key]
list_of_dicts.append(row)
return spark.read.json(sc.parallelize(list_of_dicts), multiLine=True)
display(get_df_from_json(data))
Output
+--------+--------+------+------+-----------+
| device | id | p1 | p2 | parent_id |
+--------+--------+------+------+-----------+
| 0x0034 | 0x0034 | p1v1 | p2v1 | parent_v1 |
| 0x0023 | 0x0023 | p1vN | p2vN | parent_v1 |
| 0xFFFF | 0xFFFF | p1v4 | p2v4 | parent_v1 |
| 0xCC34 | 0xCC34 | p1v3 | p2v3 | parent_v1 |
| 0xAB34 | 0xAB34 | p1v2 | p2v2 | parent_v1 |
+--------+--------+------+------+-----------+

iterate nested list in json msg by cql stream analytics

I have a json msg coming from iotHub like:
{
"deviceId": "abc",
"topic": "data",
"data": {
"varname1": [{
"t": "timestamp1",
"v": "value1",
"f": "respondFrame1"
},
{
"t": "timestamp2",
"v": "value2",
"f": "respondFrame2"
}],
"varname2": [{
"t": "timestamp1",
"v": "value1",
"f": "respondFrame1"
},
{
"t": "timestamp2",
"v": "value2",
"f": "respondFrame2"
}]
}
}
and want to store this by azure stream analytics job into a transact sql like this:
ID | deviceId | varname | timestamp | respondFrame | value
-----+------------+-----------+-------------+----------------+--------
1 | abc | varname1 | timestamp1 | respondFrame1 | value1
2 | abc | varname1 | timestamp2 | respondFrame2 | value2
3 | abc | varname2 | timestamp1 | respondFrame1 | value1
4 | abc | varname2 | timestamp2 | respondFrame2 | value2
does anaybody knwo how to handle this stacked iterations and combine it (cross apply)?
something like this "phantomCode":
deviceId = msg.deviceId
for d in msg.data:
for key in d:
varname = key.name
timestamp = key[varname].t
frame = key[varname].f
value = key[varname].v
UPDATE regarding to JS Azure answer:
with the code
WITH datalist AS
(
SELECT
iotHubAlias.deviceId,
data.PropertyName as varname,
data.PropertyValue as arrayData
FROM [iotHub] as iotHubAlias
CROSS APPLY GetRecordProperties(iotHubAlias.data) AS data
WHERE iotHubAlias.topic = 'data'
)
SELECT
datalist.deviceId,
datalist.varname,
arrayElement.ArrayValue.t as [timestamp],
arrayElement.ArrayValue.f as respondFrame,
arrayElement.ArrayValue.v as value
INTO [temporary]
FROM datalist
CROSS APPLY GetArrayElements(datalist.arrayData) AS arrayElement
I always get an error:
{
"channels": "Operation",
"correlationId": "f9d4437b-707e-4892-a37b-8ad721eb1bb2",
"description": "",
"eventDataId": "ef5a5f2b-8c2f-49c2-91f0-16213aaa959d",
"eventName": {
"value": "streamingNode0",
"localizedValue": "streamingNode0"
},
"category": {
"value": "Administrative",
"localizedValue": "Administrative"
},
"eventTimestamp": "2018-08-21T18:23:39.1804989Z",
"id": "/subscriptions/46cd2f8f-b46b-4428-8f7b-c7d942ff745d/resourceGroups/fieldtest/providers/Microsoft.StreamAnalytics/streamingjobs/streamAnalytics4fieldtest/events/ef5a5f2b-8c2f-49c2-91f0-16213aaa959d/ticks/636704726191804989",
"level": "Error",
"operationId": "7a38a957-1a51-4da1-a679-eae1c7e3a65b",
"operationName": {
"value": "Process Events: Processing events Runtime Error",
"localizedValue": "Process Events: Processing events Runtime Error"
},
"resourceGroupName": "fieldtest",
"resourceProviderName": {
"value": "Microsoft.StreamAnalytics",
"localizedValue": "Microsoft.StreamAnalytics"
},
"resourceType": {
"value": "Microsoft.StreamAnalytics/streamingjobs",
"localizedValue": "Microsoft.StreamAnalytics/streamingjobs"
},
"resourceId": "/subscriptions/46cd2f8f-b46b-4428-8f7b-c7d942ff745d/resourceGroups/fieldtest/providers/Microsoft.StreamAnalytics/streamingjobs/streamAnalytics4fieldtest",
"status": {
"value": "Failed",
"localizedValue": "Failed"
},
"subStatus": {
"value": "",
"localizedValue": ""
},
"submissionTimestamp": "2018-08-21T18:24:34.0981187Z",
"subscriptionId": "46cd2f8f-b46b-4428-8f7b-c7d942ff745d",
"properties": {
"Message Time": "2018-08-21 18:23:39Z",
"Error": "- Unable to cast object of type 'Microsoft.EventProcessing.RuntimeTypes.ValueArray' to type 'Microsoft.EventProcessing.RuntimeTypes.IRecord'.\r\n",
"Message": "Runtime exception occurred while processing events, - Unable to cast object of type 'Microsoft.EventProcessing.RuntimeTypes.ValueArray' to type 'Microsoft.EventProcessing.RuntimeTypes.IRecord'.\r\n, : OutputSourceAlias:temporary;",
"Type": "SqlRuntimeError",
"Correlation ID": "f9d4437b-707e-4892-a37b-8ad721eb1bb2"
},
"relatedEvents": []
}
and here an example of a real json msg coming from a device:
{
"topic": "data",
"data": {
"ExternalFlowTemperatureSensor": [{
"t": "2018-08-22T11:00:11.955381",
"v": 16.64103,
"f": "Q6ES8KJIN1NX2DRGH36RX1WDT"
}],
"AdaStartsP2": [{
"t": "2018-08-22T11:00:12.863383",
"v": 382.363138,
"f": "9IY7B4DFBAMOLH3GNKRUNUQNUX"
},
{
"t": "2018-08-22T11:00:54.172501",
"v": 104.0,
"f": "IUJMP20CYQK60B"
}],
"s_DriftData[4].c32_ZeitLetzterTest": [{
"t": "2018-08-22T11:01:01.829568",
"v": 348.2916,
"f": "MMTPWQVLL02CA"
}]
},
"deviceId": "test_3c27db"
}
and (to have it complete) the creation code for the sql table:
create table temporary (
id int NOT NULL IDENTITY PRIMARY KEY,
deviceId nvarchar(20) NOT NULL,
timestamp datetime NOT NULL,
varname nvarchar(100) NOT NULL,
value float,
respondFrame nvarchar(50)
)
the following query will give you the expected output
WITH step1 AS
(
SELECT
event.deviceID,
data.PropertyName as varname,
data.PropertyValue as arrayData
FROM blobtest as event
CROSS APPLY GetRecordProperties(event.data) AS data
)
SELECT
event.deviceId,
event.varname,
arrayElement.ArrayValue.t as [timestamp],
arrayElement.ArrayValue.f as frame,
arrayElement.ArrayValue.v as value
FROM step1 as event
CROSS APPLY GetArrayElements(event.arrayData) AS arrayElement
You can find more info about JSON parsing on our documentation page "Parse JSON and Avro data in Azure Stream Analytics"
Let me know if you have any other question.
JS (Azure Stream Analytics)

how to pptable the dynamic json String useing aeson in haskell

I want to use the pptable and the aeson lib to change the json string to table display in console.
The json string come from the es table, it just like
{
"hits": {
"hits": [
{
"_type": "tableName",
"_routing": "key",
"_source": {
"col1": 1,
"col2": 0,
"col3": "1",
"col4": "2",
"col5": 2824,
"col6": "2018-05-26 22:49:24"
},
"_score": 11.97,
"_index": "mysql_",
"_id": "9"
}
],
"total": 1,
"max_score": 11.97
},
"_shards": {
"successful": 30,
"failed": 0,
"total": 30
},
"took": 60,
"timed_out": false
}
And I want to display a table just like
+----------------------------------------------------------+
|col1| col2 | col3 | col4 | col5 | col6 |
| 1 | 0 | 1 | 2 | 2824 | 2018-05-26 22:49:24|
+----------------------------------------------------------+
I can pase the json String to aeson Object and filter the _source sub object. But the Object type not deriving Generic.
So I have no idea how to do with it.

Resources