How to reduceByKey in PySpark with custom grouping of rows?

How to reduceByKey in PySpark with custom grouping of rows? - apache-spark

I have a dataframe that looks as below:
items_df
======================================================
| customer item_type brand price quantity |
|====================================================|
| 1 bread reems 20 10 |
| 2 butter spencers 10 21 |
| 3 jam niles 10 22 |
| 1 bread marks 16 18 |
| 1 butter jims 19 12 |
| 1 jam jills 16 6 |
| 2 bread marks 16 18 |
======================================================
I create an rdd that converts the above to a dict:
rdd = items_df.rdd.map(lambda row: row.asDict())
The result looks like:
[
{ "customer": 1, "item_type": "bread", "brand": "reems", "price": 20, "quantity": 10 },
{ "customer": 2, "item_type": "butter", "brand": "spencers", "price": 10, "quantity": 21 },
{ "customer": 3, "item_type": "jam", "brand": "niles", "price": 10, "quantity": 22 },
{ "customer": 1, "item_type": "bread", "brand": "marks", "price": 16, "quantity": 18 },
{ "customer": 1, "item_type": "butter", "brand": "jims", "price": 19, "quantity": 12 },
{ "customer": 1, "item_type": "jam", "brand": "jills", "price": 16, "quantity": 6 },
{ "customer": 2, "item_type": "bread", "brand": "marks", "price": 16, "quantity": 18 }
]
I would like to group the above rows first by customer. Then I would like to introduce custom new keys "breads", "butters", "jams" and group all these rows for that customer. So my rdd reduces from 7 rows to 3 rows.
The output would look as below:
[
{
"customer": 1,
"breads": [
{"item_type": "bread", "brand": "reems", "price": 20, "quantity": 10},
{"item_type": "bread", "brand": "marks", "price": 16, "quantity": 18},
],
"butters": [
{"item_type": "butter", "brand": "jims", "price": 19, "quantity": 12}
],
"jams": [
{"item_type": "jam", "brand": "jills", "price": 16, "quantity": 6}
]
},
{
"customer": 2,
"breads": [
{"item_type": "bread", "brand": "marks", "price": 16, "quantity": 18}
],
"butters": [
{"item_type": "butter", "brand": "spencers", "price": 10, "quantity": 21}
],
"jams": []
},
{
"customer": 3,
"breads": [],
"butters": [],
"jams": [
{"item_type": "jam", "brand": "niles", "price": 10, "quantity": 22}
]
}
]
Would anyone know how to achieve the above using PySpark? I would like to know if there is a solution using reduceByKey() or something similar. I am hoping to avoid the use of groupByKey() if possible.

First add a column item_types to pivot dataframe.
items_df = items_df.withColumn('item_types', F.concat(F.col('item_type'),F.lit('s')))
items_df.show()
+--------+---------+--------+-----+--------+----------+
|customer|item_type| brand|price|quantity|item_types|
+--------+---------+--------+-----+--------+----------+
| 1| bread| reems| 20| 10| breads|
| 2| butter|spencers| 10| 21| butters|
| 3| jam| niles| 10| 22| jams|
| 1| bread| marks| 16| 18| breads|
| 1| butter| jims| 19| 12| butters|
| 1| jam| jills| 16| 6| jams|
| 2| bread| marks| 16| 18| breads|
+--------+---------+--------+-----+--------+----------+
Then you can pivot table with customer group and use F.collect_list() to aggregate other columns at the same time.
items_df = items_df.groupby(['customer']).pivot("item_types").agg(
F.collect_list(F.struct(F.col("item_type"),F.col("brand"), F.col("price"),F.col("quantity")))
).sort('customer')
items_df.show()
+--------+--------------------+--------------------+--------------------+
|customer| breads| butters| jams|
+--------+--------------------+--------------------+--------------------+
| 1|[[bread, reems, 2...|[[butter, jims, 1...|[[jam, jills, 16,...|
| 2|[[bread, marks, 1...|[[butter, spencer...| []|
| 3| []| []|[[jam, niles, 10,...|
+--------+--------------------+--------------------+--------------------+
Finally you need set recursive=True to convert the nested Row into dict.
rdd = items_df.rdd.map(lambda row: row.asDict(recursive=True))
print(rdd.take(10))
[{'customer': 1,
'breads': [{'item_type': u'bread', 'brand': u'reems', 'price': 20, 'quantity': 10},
{'item_type': u'bread', 'brand': u'marks', 'price': 16, 'quantity': 18}],
'butters': [{'item_type': u'butter', 'brand': u'jims', 'price': 19, 'quantity': 12}],
'jams': [{'item_type': u'jam', 'brand': u'jills', 'price': 16, 'quantity': 6}]},
{'customer': 2,
'breads': [{'item_type': u'bread', 'brand': u'marks', 'price': 16, 'quantity': 18}],
'butters': [{'item_type': u'butter', 'brand': u'spencers', 'price': 10, 'quantity': 21}],
'jams': []},
{'customer': 3,
'breads': [],
'butters': [],
'jams': [{'item_type': u'jam', 'brand': u'niles', 'price': 10, 'quantity': 22}]}]

I used another approach as well using reduceByKey() in rdd. Given the dataframe items_df, first convert it to rdd:
rdd = items_df.rdd.map(lambda row: row.asDict())
Transform each row to have tuple (customer, [row_obj]) where we have row_obj is in a list:
rdd = rdd.map(lambda row: ( row["customer"], [row] ) )
Group by customer using reduceByKey, where the lists are concatenated for a given customer:
rdd = rdd.reduceByKey(lambda x,y: x+y)
Transform the tuple back to dict where key is customer and value is list of all rows associated:
rdd = rdd.map(lambda tup: { tup[0]: tup[1] } )
Since each customer data is all now in a row, we can segregate the data as breads, butters, jams using a custom function:
def organize_items_in_customer(row):
cust_id = list(row.keys())[0]
items = row[cust_id]
new_cust_obj = { "customer": cust_id, "breads": [], "butters": [], "jams": [] }
plurals = { "bread":"breads", "butter":"butters", "jam":"jams" }
for item in items:
item_type = item["item_type"]
key = plurals[item_type]
new_cust_obj[key].append(item)
return new_cust_obj
Call the above function to transform rdd:
rdd = rdd.map(organize_items_in_customer)

Related

Search for a dictionary based on a property value

I am trying to get list of dictionaries from a list based on a specific property list of values? Any suggestions
list_of_persons = [
{"id": 2, "name": "name_2", "age": 23},
{"id": 3, "name": "name_3", "age": 43},
{"id": 4, "name": "name_4", "age": 35},
{"id": 5, "name": "name_5", "age": 59}
]
ids_search_list = [2, 4]
I'd like to get the following list
result_list = [
{"id": 2, "name": "name_2", "age": 23},
{"id": 4, "name": "name_4", "age": 35}
]
looping could be the simplest solution but there should be a better one in python

you can do this like that :
list_of_persons = [
{"id": 2, "name": "name_2", "age": 23},
{"id": 3, "name": "name_3", "age": 43},
{"id": 4, "name": "name_4", "age": 35},
{"id": 5, "name": "name_5", "age": 59}
]
ids_search_list = [2, 4]
result = []
for person in list_of_persons:
if person["id"] in ids_search_list:
result.append(person)
print(result)

You can use list comprehension
result_list = [person for person in list_of_persons if person["id"] in ids_search_list]
If you want some reading material about it: https://realpython.com/list-comprehension-python/

Get dict inside a list with value without for loop

I have this dict:
data_flights = {
"prices": [
{ "city": "Paris", "iataCode": "AAA", "lowestPrice": 54, "id": 2 },
{ "city": "Berlin", "iataCode": "BBB", "lowestPrice": 42, "id": 3 },
{ "city": "Tokyo", "iataCode": "CCC", "lowestPrice": 485, "id": 4 },
{ "city": "Sydney", "iataCode": "DDD", "lowestPrice": 551, "id": 5 },
],
"date": "31/03/2022"
}
Can I acess a dict using a key value from one of the dics, without using for loop?
something like this:
data_flights["prices"]["city" == "Berlin"]

You can achieve this by either using a comprehension or the filter built in.
comprehension:
[e for e in d['prices'] if e['city'] == 'Berlin']
filter:
list(filter(lambda e: e['city'] == 'Berlin', d['prices']))
Both would result in:
[{'city': 'Berlin', 'iataCode': 'BBB', 'lowestPrice': 42, 'id': 3}]

You can use list comprehension
x = [a for a in data_flights["prices"] if a["city"] == "Berlin"]
>>> x
[{'city': 'Berlin', 'iataCode': 'BBB', 'lowestPrice': 42, 'id': 3}]

spark find which persons often go to the same counties

The data source is:
val spark = SparkSession.builder().master("local[1,1]").config("spark.sql.shuffle.partitions", "1").config("spark.sql.crossJoin.enabled","true").getOrCreate()
spark.sparkContext.setLogLevel("error")
import spark.implicits._
val df=Seq(
("tom","America","2019"),
("jim","America","2019"),
("jack","America","2019"),
("tom","Russia","2019"),
("jim","Russia","2019"),
("jack","Russia","2019"),
("alex","Russia","2019"),
("tom","America","2018"),
("jim","America","2018"),
("tom","Germany","2018"),
("jim","England","2018")
).toDF("person","country","year")
I want to find which persons often go to the same countries for each year,and where they gone together,so what I expect is a json like this:
[{
"year": "2019",
"items": [{
"persons": ["tom", "jim", "jack"],
"common": ["America", "Russia"],
"times": 2
}, {
"persons": ["tom", "jack"],
"common": ["America", "Russia"],
"times": 2
}, {
"persons": ["tom", "jim"],
"common": ["America", "Russia"],
"times": 2
}, {
"persons": ["jack", "jim"],
"common": ["America", "Russia"],
"times": 2
}]
},
{
"year": "2018",
"items": [{
"persons": ["tom", "jim"],
"common": ["America"],
"times": 1
}]
}
]
Then I am not sure what model shall I use?
I tried Frequent Items Pattern:
val df1=df.where('year===2019)
val rdd1= df1.groupBy("country").agg(collect_set('person)).drop("country","year")
.as[Array[String]].rdd
val fpg = new FPGrowth()
.setMinSupport(0.3)
.setNumPartitions(10)
val schema = new StructType().add(new StructField("items", ArrayType(StringType))).add(new StructField("freq", LongType))
val model = fpg.run(rdd1);
val rdd2 = model.freqItemsets.map(itemset => Row(itemset.items, itemset.freq))
val df1 = spark.createDataFrame(rdd2, schema).where(size('items)>1)
.show()
loop for every year
val df2=df.where('year===2018)
val rdd2= df1.groupBy("country").agg(collect_set('person)).drop("country","year")
.as[Array[String]].rdd
....
val model = fpg.run(rdd12);
....
The result is :
for 2019
+----------------+----+
| items|freq|
+----------------+----+
| [jack, tom]| 2|
|[jack, tom, jim]| 2|
| [jack, jim]| 2|
| [tom, jim]| 2|
+----------------+----+
for 2018:
+----------+----+
| items|freq|
+----------+----+
|[tom, jim]| 1|
+----------+----+
But I can not get when and where they gone together,becuase the rdd I give to FPGRowth must be a RDD[Array[String]],no more columns allowed.
Is there any other better model?How can I achieve it?
I also want to know how many times each person group go together
Maybe what I should use collaborative filtering

Just self-join and aggregate
import org.apache.spark.sql.functions._
df.alias("left")
.join(df.alias("right"), Seq("country", "year"))
.where($"left.person" < $"right.person")
.groupBy(array($"left.person", $"right.person").alias("persons"))
.agg(collect_set(struct($"country", $"year")).alias("common"))

Try this:
val window = Window.partitionBy("country", "year")
df
.withColumn("persons", collect_set('person) over window)
.drop('person)
.distinct()
.groupBy('persons)
.agg(collect_set(struct('country, 'year)).alias("common"))
Output (tested):
+----------+----------------------------------+
|persons |common |
+----------+----------------------------------+
|[jim, tom]|[[America, 2019], [Russia, 2019]] |
|[tom] |[[Germany, 2018], [America, 2018]]|
|[jim] |[[Russia, 2018], [England, 2018]] |
+----------+----------------------------------+

How do I convert a list to dictionary with specific format in python?

I'm developing a Web API, I have a list from a query results like this:
[{'ORG': 'Asset Management',
'SURVEY_DATE': datetime.date(2018, 4, 23),
'NOS': '1'},
{'ORG': 'Asset Management',
'SURVEY_DATE': datetime.date(2018, 5, 8),
'NOS': '1'},
{'ORG': 'Chief Advocacy Office',
'SURVEY_DATE': datetime.date(2018, 10, 31),
'NOS': '50'},
{'ORG': 'Chief Advocacy Office',
'SURVEY_DATE': datetime.date(2019, 2, 13),
'NOS': '1'},
{'ORG': 'Chief Information Office',
'SURVEY_DATE': datetime.date(2018, 1, 22),
'NOS': '1'},
{'ORG': 'Chief Information Office',
'SURVEY_DATE': datetime.date(2018, 2, 2),
'NOS': '1'}]
I tried to convert it first into a dataframe and code it like this:
df1 = df1.groupby('ORG').apply(lambda x: dict(zip(x['SURVEY_DATE'],x['NOS']))).to_dict()
but is there a way that I dont need to convert it in dataframe?
And I want to format it into a dictionary like the one below for my response data:
{
"Asset Management": [
{
"date": "2019-03-30",
"numberOfSurveys": 76
},
{
"date": "2019-03-31",
"numberOfSurveys": 83
}
],
"Chief Advocacy Office": [
{
"date": "2019-03-30",
"numberOfSurveys": 50
},
{
"date": "2019-03-31",
"numberOfSurveys": 40
}
],
"Chief Information Office": [
{
"date": "2019-03-30",
"numberOfSurveys": 50
},
{
"date": "2019-03-31",
"numberOfSurveys": 40
}
]
}

Use collections.defaultdict():
import collections
result = collections.defaultdict(list)
for row in original_data:
result[row['ORG']].append({
'date': row['SURVEY_DATE'],
'numberOfSurveys': row['NOS'],
})

how to pptable the dynamic json String useing aeson in haskell

I want to use the pptable and the aeson lib to change the json string to table display in console.
The json string come from the es table, it just like
{
"hits": {
"hits": [
{
"_type": "tableName",
"_routing": "key",
"_source": {
"col1": 1,
"col2": 0,
"col3": "1",
"col4": "2",
"col5": 2824,
"col6": "2018-05-26 22:49:24"
},
"_score": 11.97,
"_index": "mysql_",
"_id": "9"
}
],
"total": 1,
"max_score": 11.97
},
"_shards": {
"successful": 30,
"failed": 0,
"total": 30
},
"took": 60,
"timed_out": false
}
And I want to display a table just like
+----------------------------------------------------------+
|col1| col2 | col3 | col4 | col5 | col6 |
| 1 | 0 | 1 | 2 | 2824 | 2018-05-26 22:49:24|
+----------------------------------------------------------+
I can pase the json String to aeson Object and filter the _source sub object. But the Object type not deriving Generic.
So I have no idea how to do with it.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to reduceByKey in PySpark with custom grouping of rows? - apache-spark

Related

Search for a dictionary based on a property value

Get dict inside a list with value without for loop

spark find which persons often go to the same counties

How do I convert a list to dictionary with specific format in python?

how to pptable the dynamic json String useing aeson in haskell

Categories

Resources