env: spark2.4.5
source.json:
{
"a_key": "1",
"a_pro": "2",
"a_con": "3",
"b_key": "4",
"b_pro": "5",
"b_con": "6",
"c_key": "7",
"c_pro": "8",
"c_con": "9",
...
}
traget.json:
{
"factors": [
{
"name": "a",
"key": "1",
"pros": "2",
"cons": "3"
},
{
"name": "b",
"key": "4",
"pros": "5",
"cons": "6"
},
{
"name": "c",
"key": "7",
"pros": "8",
"cons": "9"
},
...
]
}
As you can see the target 'name' is a part of key of sources. For instance, 'a' is the 'name' of 'a_key', 'a_pro', 'a_con'. I really don't know how to extract a value from key, and do some 'group by' transforming. Can anybody give me some suggestion?
IIUC first create the dataframe from the input json
json_data = {
"a_key": "1",
"a_pro": "2",
"a_con": "3",
"b_key": "4",
"b_pro": "5",
"b_con": "6",
"c_key": "7",
"c_pro": "8",
"c_con": "9"
}
df=spark.createDataFrame(list(map(list,json_data.items())),['key','value'])
df.show()
+-----+-----+
| key|value|
+-----+-----+
|a_key| 1|
|a_pro| 2|
|a_con| 3|
|b_key| 4|
|b_pro| 5|
|b_con| 6|
|c_key| 7|
|c_pro| 8|
|c_con| 9|
+-----+-----+
Now create the required columns from existing column
import pyspark.sql.functions as f
df2 = df.withColumn('Name', f.substring('key',1,1)).\
withColumn('Attributes', f.concat(f.split('key','_')[1],f.lit('s')))
df2.show()
+-----+-----+----+----------+
| key|value|Name|Attributes|
+-----+-----+----+----------+
|a_key| 1| a| keys|
|a_pro| 2| a| pros|
|a_con| 3| a| cons|
|b_key| 4| b| keys|
|b_pro| 5| b| pros|
|b_con| 6| b| cons|
|c_key| 7| c| keys|
|c_pro| 8| c| pros|
|c_con| 9| c| cons|
+-----+-----+----+----------+
Now pivot the dataframe and collect result as json object
output_json = df2.groupBy('Name').\
pivot('Attributes').\
agg(f.min('value')).\
select(f.collect_list(f.struct('Name','keys','cons','pros')).alias('factors')).\
toJSON().collect()
import json
print(json.dumps(json.loads(output_json[0]),indent=4))
{
"factors": [
{
"Name": "c",
"keys": "7",
"cons": "9",
"pros": "8"
},
{
"Name": "b",
"keys": "4",
"cons": "6",
"pros": "5"
},
{
"Name": "a",
"keys": "1",
"cons": "3",
"pros": "2"
}
]
}
No need to involve dataframes for this, some simple string and dictionary manipulation will do:
import json
source = {
"a_key": "1",
"a_pro": "2",
"a_con": "3",
"b_key": "4",
"b_pro": "5",
"b_con": "6",
"c_key": "7",
"c_pro": "8",
"c_con": "9",
}
factors = {}
# Prepare each factor dictionary
for k, v in source.items():
factor, item = k.split('_')
d = factors.get(factor, {})
d[item] = v
factors[factor] = d
# Prepare result dictionary
target = {
'factors': []
}
# Move name attribute into dictionary & append
for k, v in factors.items():
d = v
d['name'] = k
target['factors'].append(d)
result = json.dumps(target)
Your data is strange, but the following code can help you solve it:
source.json:
{
"a_key": "1",
"a_pro": "2",
"a_con": "3",
"b_key": "4",
"b_pro": "5",
"b_con": "6",
"c_key": "7",
"c_pro": "8",
"c_con": "9"
}
code:
val sparkSession = SparkSession.builder()
.appName("readAndWriteJsonTest")
.master("local[*]").getOrCreate()
val dataFrame = sparkSession.read.format("json").load("R:\\data\\source.json")
// println(dataFrame.rdd.count())
val mapRdd: RDD[(String, (String, String))] = dataFrame.rdd.map(_.getString(0))
.filter(_.split("\\:").length == 2)
.map(line => {
val Array(key1, value1) = line.split("\\:")
val Array(name, key2) = key1.replace("\"", "").trim.split("\\_")
val value2 = value1.replace("\"", "").replace(",", "").trim
(name, (key2, value2))
})
// mapRdd.collect().foreach(println)
val initVale = new ArrayBuffer[(String, String)]
val function1 = (buffer1: ArrayBuffer[(String, String)], t1: (String, String)) => buffer1.+=(t1)
val function2 = (buffer1: ArrayBuffer[(String, String)], buffer2: ArrayBuffer[(String, String)]) => buffer1.++(buffer2)
val aggRdd: RDD[(String, ArrayBuffer[(String, String)])] = mapRdd.aggregateByKey(initVale)(function1, function2)
// aggRdd.collect().foreach(println)
import scala.collection.JavaConverters._
val persons: util.List[Person] = aggRdd.map(line => {
val name = line._1
val keyValue = line._2(0)._2
val prosValue = line._2(1)._2
val consvalue = line._2(2)._2
Person(name, keyValue, prosValue, consvalue)
}).collect().toList.asJava
import com.google.gson.GsonBuilder
val gson = new GsonBuilder().create
val factors = Factors(persons)
val targetJsonStr = gson.toJson(factors)
println(targetJsonStr)
traget.json:
{
"factors": [
{
"name": "a",
"key": "1",
"pros": "2",
"cons": "3"
},
{
"name": "b",
"key": "4",
"pros": "5",
"cons": "6"
},
{
"name": "c",
"key": "7",
"pros": "8",
"cons": "9"
}
]
}
You can put the above code into the test method and run it to see the result you want.
#Test
def readAndSaveJsonTest: Unit = {}
Hope it can help you.
Related
I am trying to parse/flatten a JSON data, containing an array and a struct.
For every "Id" in "data_array" column, I need to get the "EstValue" from "data_struct" column. Column name in "data_struct" is the actual id (from "data_array"). Tried my best to use a dynamic join, but getting error "Column is not iterable". Can't we use dynamic join conditions in PySpark, like we can in SQL? Is there any better way for achieving this?
JSON Input file:
{
"data_array": [
{
"id": 1,
"name": "ABC"
},
{
"id": 2,
"name": "DEF"
}
],
"data_struct": {
"1": {
"estimated": {
"value": 123
},
"completed": {
"value": 1234
}
},
"2": {
"estimated": {
"value": 456
},
"completed": {
"value": 4567
}
}
}
}
Desired output:
Id Name EstValue CompValue
1 ABC 123 1234
2 DEF 456 4567
My PySpark code:
from pyspark.sql.functions import *
rawDF = spark.read.json([f"abfss://{pADLSContainer}#{pADLSGen2}.dfs.core.windows.net/{pADLSDirectory}/InputFile.json"], multiLine = "true")
idDF = rawDF.select(explode("data_array").alias("data_array")) \
.select(col("data_array.id").alias("id"))
idDF.show(n=2,vertical=True,truncate=150)
finalDF = idDF.join(rawDF, (idDF.id == rawDF.select(col("data_struct." + idDF.Id))) )
finalDF.show(n=2,vertical=True,truncate=150)
Error:
def __iter__(self): raise TypeError("Column is not iterable")
Self joins create problems. In this case, you can avoid the join.
You could make arrays from both columns, zip them together and use inline to extract into columns. The most difficult part is creating array from "data_struct" column. Maybe there's a better way, but I only could think of first transforming it into map type.
Input:
s = """
{
"data_array": [
{
"id": 1,
"name": "ABC"
},
{
"id": 2,
"name": "DEF"
}
],
"data_struct": {
"1": {
"estimated": {
"value": 123
},
"completed": {
"value": 1234
}
},
"2": {
"estimated": {
"value": 456
},
"completed": {
"value": 4567
}
}
}
}
"""
rawDF = spark.read.json(sc.parallelize([s]), multiLine = "true")
Script:
id = F.transform('data_array', lambda x: x.id).alias('Id')
name = F.transform('data_array', lambda x: x['name']).alias('Name')
map = F.from_json(F.to_json("data_struct"), 'map<string, struct<estimated:struct<value:long>,completed:struct<value:long>>>')
est_val = F.transform(id, lambda x: map[x].estimated.value).alias('EstValue')
comp_val = F.transform(id, lambda x: map[x].completed.value).alias('CompValue')
df = rawDF.withColumn('y', F.arrays_zip(id, name, est_val, comp_val))
df = df.selectExpr("inline(y)")
df.show()
# +---+----+--------+---------+
# | Id|Name|EstValue|CompValue|
# +---+----+--------+---------+
# | 1| ABC| 123| 1234|
# | 2| DEF| 456| 4567|
# +---+----+--------+---------+
I have a sample database as below:
SNO
Name
Address
99123
Mike
Texas
88124
Tom
California
I want to keep my SNO in elastic search _id to make it easier to update documents according to my SNO.
Python code to create an index:
abc = {
"settings": {
"number_of_shards": 2,
"number_of_replicas": 2
}
}
es.indices.create(index='test',body = abc)
I fetched data from postman as below:
{
"_index": "test",
"_id": "13",
"_data": {
"FirstName": "Sample4",
"LastName": "ABCDEFG",
"Designation": "ABCDEF",
"Salary": "99",
"DateOfJoining": "2020-05-05",
"Address": "ABCDE",
"Gender": "ABCDE",
"Age": "21",
"MaritalStatus": "ABCDE",
"Interests": "ABCDEF",
"timestamp": "2020-05-05T14:42:46.394115",
"country": "Nepal"
}
}
And Insert code in python is below:
req_JSON = request.json
input_index = req_JSON['_index']
input_id = req_JSON['_id']
input_data = req_JSON['_data']
doc = input_data
res = es.index(index=input_index, body=doc)
I thought _id will remain the same as what I had given but it generated the auto _id.
You can simply do it like this:
res = es.index(index=input_index, body=doc, id=input_id)
^
|
add this
I have an entity:
{
"id": "123",
"col_1": null,
"sub_entities": [
{ "sub_entity_id": "s-1", "col_2": null },
{ "sub_entity_id": "s-2", "col_2": null }
]
}
and I loaded it to spark: val entities = spark.read.json("...").
entities.filter(size($"sub_entities.col_2") === 0) returns nothing. The behavior seems weird because all the col_2 are null but the null value is counted.
I then tried select col_2 and noticed it returns an array of null values (2 null values in this case).
entities.select($"col_1", $"sub_entities.col_2").show(false)
+--------+------------------+
|col_1 |sub_entities.col_2|
+--------+------------------+
|null |[,] |
+--------+------------------+
How to write a query that returns only objects from the array where col_2 is not null?
To query array objects we need to first flatten out the array using explode function then query the dataframe!
Example:
val df=spark.read.json(Seq("""{"id": "123","col_1": null,"sub_entities": [ { "sub_entity_id": "s-1", "col_2": null }, { "sub_entity_id": "s-2", "col_2": null }]}""").toDS)
df.selectExpr("explode(sub_entities)","*").select("col.*","id","col_1").show()
//+-----+-------------+---+-----+
//|col_2|sub_entity_id| id|col_1|
//+-----+-------------+---+-----+
//| null| s-1|123| null|
//| null| s-2|123| null|
//+-----+-------------+---+-----+
df.selectExpr("explode(sub_entities)","*").select("col.*","id","col_1").filter(col("col_2").isNull).show()
//+-----+-------------+---+-----+
//|col_2|sub_entity_id| id|col_1|
//+-----+-------------+---+-----+
//| null| s-1|123| null|
//| null| s-2|123| null|
//+-----+-------------+---+-----+
This filters out only the array of col_2 as you mentioned, if you need different output when you do df.select($"col_1", $"sub_entities").show, I can update the answer:
val json =
"""
{
"id": "123",
"col_1": null,
"sub_entities": [
{ "sub_entity_id": "s-1", "col_2": null },
{ "sub_entity_id": "s-2", "col_2": null }
]
}
"""
val df = spark.read.json(Seq(json).toDS)
val removeNulls = udf((arr : Seq[String]) => arr.filter((x: String) => x != null))
df.select($"col_1", removeNulls($"sub_entities.col_2").as("sub_entities.col_2")).show(false)
+-----+------------------+
|col_1|sub_entities.col_2|
+-----+------------------+
|null |[] |
+-----+------------------+
The data source is:
val spark = SparkSession.builder().master("local[1,1]").config("spark.sql.shuffle.partitions", "1").config("spark.sql.crossJoin.enabled","true").getOrCreate()
spark.sparkContext.setLogLevel("error")
import spark.implicits._
val df=Seq(
("tom","America","2019"),
("jim","America","2019"),
("jack","America","2019"),
("tom","Russia","2019"),
("jim","Russia","2019"),
("jack","Russia","2019"),
("alex","Russia","2019"),
("tom","America","2018"),
("jim","America","2018"),
("tom","Germany","2018"),
("jim","England","2018")
).toDF("person","country","year")
I want to find which persons often go to the same countries for each year,and where they gone together,so what I expect is a json like this:
[{
"year": "2019",
"items": [{
"persons": ["tom", "jim", "jack"],
"common": ["America", "Russia"],
"times": 2
}, {
"persons": ["tom", "jack"],
"common": ["America", "Russia"],
"times": 2
}, {
"persons": ["tom", "jim"],
"common": ["America", "Russia"],
"times": 2
}, {
"persons": ["jack", "jim"],
"common": ["America", "Russia"],
"times": 2
}]
},
{
"year": "2018",
"items": [{
"persons": ["tom", "jim"],
"common": ["America"],
"times": 1
}]
}
]
Then I am not sure what model shall I use?
I tried Frequent Items Pattern:
val df1=df.where('year===2019)
val rdd1= df1.groupBy("country").agg(collect_set('person)).drop("country","year")
.as[Array[String]].rdd
val fpg = new FPGrowth()
.setMinSupport(0.3)
.setNumPartitions(10)
val schema = new StructType().add(new StructField("items", ArrayType(StringType))).add(new StructField("freq", LongType))
val model = fpg.run(rdd1);
val rdd2 = model.freqItemsets.map(itemset => Row(itemset.items, itemset.freq))
val df1 = spark.createDataFrame(rdd2, schema).where(size('items)>1)
.show()
loop for every year
val df2=df.where('year===2018)
val rdd2= df1.groupBy("country").agg(collect_set('person)).drop("country","year")
.as[Array[String]].rdd
....
val model = fpg.run(rdd12);
....
The result is :
for 2019
+----------------+----+
| items|freq|
+----------------+----+
| [jack, tom]| 2|
|[jack, tom, jim]| 2|
| [jack, jim]| 2|
| [tom, jim]| 2|
+----------------+----+
for 2018:
+----------+----+
| items|freq|
+----------+----+
|[tom, jim]| 1|
+----------+----+
But I can not get when and where they gone together,becuase the rdd I give to FPGRowth must be a RDD[Array[String]],no more columns allowed.
Is there any other better model?How can I achieve it?
I also want to know how many times each person group go together
Maybe what I should use collaborative filtering
Just self-join and aggregate
import org.apache.spark.sql.functions._
df.alias("left")
.join(df.alias("right"), Seq("country", "year"))
.where($"left.person" < $"right.person")
.groupBy(array($"left.person", $"right.person").alias("persons"))
.agg(collect_set(struct($"country", $"year")).alias("common"))
Try this:
val window = Window.partitionBy("country", "year")
df
.withColumn("persons", collect_set('person) over window)
.drop('person)
.distinct()
.groupBy('persons)
.agg(collect_set(struct('country, 'year)).alias("common"))
Output (tested):
+----------+----------------------------------+
|persons |common |
+----------+----------------------------------+
|[jim, tom]|[[America, 2019], [Russia, 2019]] |
|[tom] |[[Germany, 2018], [America, 2018]]|
|[jim] |[[Russia, 2018], [England, 2018]] |
+----------+----------------------------------+
I have a dataframe that looks as below:
items_df
======================================================
| customer item_type brand price quantity |
|====================================================|
| 1 bread reems 20 10 |
| 2 butter spencers 10 21 |
| 3 jam niles 10 22 |
| 1 bread marks 16 18 |
| 1 butter jims 19 12 |
| 1 jam jills 16 6 |
| 2 bread marks 16 18 |
======================================================
I create an rdd that converts the above to a dict:
rdd = items_df.rdd.map(lambda row: row.asDict())
The result looks like:
[
{ "customer": 1, "item_type": "bread", "brand": "reems", "price": 20, "quantity": 10 },
{ "customer": 2, "item_type": "butter", "brand": "spencers", "price": 10, "quantity": 21 },
{ "customer": 3, "item_type": "jam", "brand": "niles", "price": 10, "quantity": 22 },
{ "customer": 1, "item_type": "bread", "brand": "marks", "price": 16, "quantity": 18 },
{ "customer": 1, "item_type": "butter", "brand": "jims", "price": 19, "quantity": 12 },
{ "customer": 1, "item_type": "jam", "brand": "jills", "price": 16, "quantity": 6 },
{ "customer": 2, "item_type": "bread", "brand": "marks", "price": 16, "quantity": 18 }
]
I would like to group the above rows first by customer. Then I would like to introduce custom new keys "breads", "butters", "jams" and group all these rows for that customer. So my rdd reduces from 7 rows to 3 rows.
The output would look as below:
[
{
"customer": 1,
"breads": [
{"item_type": "bread", "brand": "reems", "price": 20, "quantity": 10},
{"item_type": "bread", "brand": "marks", "price": 16, "quantity": 18},
],
"butters": [
{"item_type": "butter", "brand": "jims", "price": 19, "quantity": 12}
],
"jams": [
{"item_type": "jam", "brand": "jills", "price": 16, "quantity": 6}
]
},
{
"customer": 2,
"breads": [
{"item_type": "bread", "brand": "marks", "price": 16, "quantity": 18}
],
"butters": [
{"item_type": "butter", "brand": "spencers", "price": 10, "quantity": 21}
],
"jams": []
},
{
"customer": 3,
"breads": [],
"butters": [],
"jams": [
{"item_type": "jam", "brand": "niles", "price": 10, "quantity": 22}
]
}
]
Would anyone know how to achieve the above using PySpark? I would like to know if there is a solution using reduceByKey() or something similar. I am hoping to avoid the use of groupByKey() if possible.
First add a column item_types to pivot dataframe.
items_df = items_df.withColumn('item_types', F.concat(F.col('item_type'),F.lit('s')))
items_df.show()
+--------+---------+--------+-----+--------+----------+
|customer|item_type| brand|price|quantity|item_types|
+--------+---------+--------+-----+--------+----------+
| 1| bread| reems| 20| 10| breads|
| 2| butter|spencers| 10| 21| butters|
| 3| jam| niles| 10| 22| jams|
| 1| bread| marks| 16| 18| breads|
| 1| butter| jims| 19| 12| butters|
| 1| jam| jills| 16| 6| jams|
| 2| bread| marks| 16| 18| breads|
+--------+---------+--------+-----+--------+----------+
Then you can pivot table with customer group and use F.collect_list() to aggregate other columns at the same time.
items_df = items_df.groupby(['customer']).pivot("item_types").agg(
F.collect_list(F.struct(F.col("item_type"),F.col("brand"), F.col("price"),F.col("quantity")))
).sort('customer')
items_df.show()
+--------+--------------------+--------------------+--------------------+
|customer| breads| butters| jams|
+--------+--------------------+--------------------+--------------------+
| 1|[[bread, reems, 2...|[[butter, jims, 1...|[[jam, jills, 16,...|
| 2|[[bread, marks, 1...|[[butter, spencer...| []|
| 3| []| []|[[jam, niles, 10,...|
+--------+--------------------+--------------------+--------------------+
Finally you need set recursive=True to convert the nested Row into dict.
rdd = items_df.rdd.map(lambda row: row.asDict(recursive=True))
print(rdd.take(10))
[{'customer': 1,
'breads': [{'item_type': u'bread', 'brand': u'reems', 'price': 20, 'quantity': 10},
{'item_type': u'bread', 'brand': u'marks', 'price': 16, 'quantity': 18}],
'butters': [{'item_type': u'butter', 'brand': u'jims', 'price': 19, 'quantity': 12}],
'jams': [{'item_type': u'jam', 'brand': u'jills', 'price': 16, 'quantity': 6}]},
{'customer': 2,
'breads': [{'item_type': u'bread', 'brand': u'marks', 'price': 16, 'quantity': 18}],
'butters': [{'item_type': u'butter', 'brand': u'spencers', 'price': 10, 'quantity': 21}],
'jams': []},
{'customer': 3,
'breads': [],
'butters': [],
'jams': [{'item_type': u'jam', 'brand': u'niles', 'price': 10, 'quantity': 22}]}]
I used another approach as well using reduceByKey() in rdd. Given the dataframe items_df, first convert it to rdd:
rdd = items_df.rdd.map(lambda row: row.asDict())
Transform each row to have tuple (customer, [row_obj]) where we have row_obj is in a list:
rdd = rdd.map(lambda row: ( row["customer"], [row] ) )
Group by customer using reduceByKey, where the lists are concatenated for a given customer:
rdd = rdd.reduceByKey(lambda x,y: x+y)
Transform the tuple back to dict where key is customer and value is list of all rows associated:
rdd = rdd.map(lambda tup: { tup[0]: tup[1] } )
Since each customer data is all now in a row, we can segregate the data as breads, butters, jams using a custom function:
def organize_items_in_customer(row):
cust_id = list(row.keys())[0]
items = row[cust_id]
new_cust_obj = { "customer": cust_id, "breads": [], "butters": [], "jams": [] }
plurals = { "bread":"breads", "butter":"butters", "jam":"jams" }
for item in items:
item_type = item["item_type"]
key = plurals[item_type]
new_cust_obj[key].append(item)
return new_cust_obj
Call the above function to transform rdd:
rdd = rdd.map(organize_items_in_customer)