Grouping in pySpark Dataframes - apache-spark

I am using spark dataframes.
The task is this: to calculate and display in descending order the number of cities in the country grouped by country and region.
Initial data:
from pyspark.sql.functions import col
from pyspark.sql.functions import count
df = spark.read.json("/content/world-cities.json")
df.printSchema()
df.show()
enter image description here
Desired result:
enter image description here
I get grouping only by the country column.
How to add grouping by second column subcountry?
df.groupBy(col('country')).agg(count("*").alias("cnt"))\
.orderBy(col('cnt').desc())\
.show()
enter image description here

If i understand you correctly you just need to add second column to your group by
import pyspark.sql.functions as F
x = [("USA","usa-subcountry", "usa-city"),("USA","usa-subcountry", "usa-city-2"),("USA","usa-subcountry-2", "usa-city"), ("Argentina","argentina-subcountry", "argentina-city")]
df = spark.createDataFrame(x, schema=['country', 'subcountry', 'city'])
df.groupBy(F.col('country'), F.col('subcountry')).agg(F.count("*").alias("cnt"))\
.orderBy(F.col('cnt').desc())\
.show()
Output is:
+---------+--------------------+---+
| country| subcountry|cnt|
+---------+--------------------+---+
| USA| usa-subcountry| 2|
| USA| usa-subcountry-2| 1|
|Argentina|argentina-subcountry| 1|
+---------+--------------------+---+
Edit: another try based on comment:
import pyspark.sql.functions as F
x = [("USA","usa-subcountry", "usa-city"),
("USA","usa-subcountry", "usa-city-2"),
("USA","usa-subcountry", "usa-city-3"),
("USA","usa-subcountry-2", "usa-city"),
("Argentina","argentina-subcountry", "argentina-city"),
("Argentina","argentina-subcountry-2", "argentina-city-2"),
("UK","UK-subcountry", "UK-city-1")]
df = spark.createDataFrame(x, schema=['country', 'subcountry', 'city'])
df.groupBy(F.col('country'), F.col('subcountry')).agg(F.count("*").alias("city_count"))\
.groupBy(F.col('country')).agg(F.count("*").alias("subcountry_count"), F.sum('city_count').alias("city_count"))\
.orderBy(F.col('city_count').desc())\
.show()
output:
+---------+----------------+----------+
| country|subcountry_count|city_count|
+---------+----------------+----------+
| USA| 2| 4|
|Argentina| 2| 2|
| UK| 1| 1|
+---------+----------------+----------+
I am assuming that cities and subcountries are unique, if not you may consider to use countDistinct instead of count

Related

Get the distinct elements of a column grouped by another column on a PySpark Dataframe

I have a pyspark DF of ids and purchases which I'm trying to transform for use with FP growth.
Currently i have multiple rows for a given id with each row only relating to a single purchase.
I'd like to transform this dataframe to a form where there are two columns, one for id (with a single row per id ) and the second column containing a list of distinct purchases for that id.
I've tried to use a User Defined Function (UDF) to map the distinct purchases onto the distinct ids but I get a "py4j.Py4JException: Method getstate([]) does not exist". Thanks to #Mithril
I see that "You can't use sparkSession object , spark.DataFrame object or other Spark distributed objects in udf and pandas_udf, because they are unpickled."
So I've implemented the TERRIBLE approach below (which will work but is not scalable):
#Lets create some fake transactions
customers = [1,2,3,1,1]
purschases = ['cake','tea','beer','fruit','cake']
# Lets create a spark DF to capture the transactions
transactions = zip(customers,purschases)
spk_df_1 = spark.createDataFrame(list(transactions) , ["id", "item"])
# Lets have a look at the resulting spark dataframe
spk_df_1.show()
# Lets capture the ids and list of their distinct pruschases in a
# list of tuples
purschases_lst = []
nums1 = []
import pyspark.sql.functions as f
# for each distinct id lets get the list of their distinct pruschases
for id in spark.sql("SELECT distinct(id) FROM TBLdf ").rdd.map(lambda row : row[0]).collect():
purschase = df.filter(f.col("id") == id).select("item").distinct().rdd.map(lambda row : row[0]).collect()
nums1.append((id,purschase))
# Lets see what our list of transaction tuples looks like
print(nums1)
print("\n")
# lets turn the list of transaction tuples into a pandas dataframe
df_pd = pd.DataFrame(nums1)
# Finally lets turn our pandas dataframe into a pyspark Dataframe
df2 = spark.createDataFrame(df_pd)
df2.show()
Output:
+---+-----+
| id| item|
+---+-----+
| 1| cake|
| 2| tea|
| 3| beer|
| 1|fruit|
| 1| cake|
+---+-----+
[(1, ['fruit', 'cake']), (3, ['beer']), (2, ['tea'])]
+---+-------------+
| 0| 1|
+---+-------------+
| 1|[fruit, cake]|
| 3| [beer]|
| 2| [tea]|
+---+-------------+
If anybody has any suggestions I'd greatly appreciate it.
That is a task for collect_set, which creates a set of items without duplicates:
import pyspark.sql.functions as F
#Lets create some fake transactions
customers = [1,2,3,1,1]
purschases = ['cake','tea','beer','fruit','cake']
# Lets create a spark DF to capture the transactions
transactions = zip(customers,purschases)
spk_df_1 = spark.createDataFrame(list(transactions) , ["id", "item"])
spk_df_1.show()
spk_df_1.groupby('id').agg(F.collect_set('item')).show()
Output:
+---+-----+
| id| item|
+---+-----+
| 1| cake|
| 2| tea|
| 3| beer|
| 1|fruit|
| 1| cake|
+---+-----+
+---+-----------------+
| id|collect_set(item)|
+---+-----------------+
| 1| [fruit, cake]|
| 3| [beer]|
| 2| [tea]|
+---+-----------------+

using window function sql inside pyspark

I have data like the example data below. I’m trying to create a new column in my data using PySpark that would be the category of the first event for a customer based on the timestamp. Like the example output data below.
I have an example below of what I think would accomplish it using a window function in sql.
I’m pretty new to PySpark. I understand you can run sql inside of PySpark. I’m wondering if I have the code correct below to run the sql window function in PySpark. That is I’m wondering if I can just paste the sql code inside of spark.sql, as I have below.
Input:
eventid customerid category timestamp
1 3 a 1/1/12
2 3 b 2/3/14
4 2 c 4/1/12
Output:
eventid customerid category timestamp first_event
1 3 a 1/1/12 a
2 3 b 2/3/14 a
4 2 c 4/1/12 c
window function example:
select eventid, customerid, category, timestamp
FIRST_VALUE(catgegory) over(partition by customerid order by timestamp) first_event
from table
# implementing window function example with pyspark
PySpark:
# Note: assume df is dataframe with structure of table above
# (df is table)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(“Operations”).getOrCreate()
# Register the DataFrame as a SQL temporary view
df.createOrReplaceView(“Table”)
sql_results = spark.sql(“select eventid, customerid, category, timestamp
FIRST_VALUE(catgegory) over(partition by customerid order by timestamp) first_event
from table”)
# display results
sql_results.show()
You can use window function in pyspark as well
>>> import pyspark.sql.functions as F
>>> from pyspark.sql.window import Window
>>>
>>> df.show()
+-------+----------+--------+---------+
|eventid|customerid|category|timestamp|
+-------+----------+--------+---------+
| 1| 3| a| 1/1/12|
| 2| 3| b| 2/3/14|
| 4| 2| c| 4/1/12|
+-------+----------+--------+---------+
>>> window = Window.partitionBy('customerid')
>>> df = df.withColumn('first_event', F.first('category').over(window))
>>>
>>> df.show()
+-------+----------+--------+---------+-----------+
|eventid|customerid|category|timestamp|first_event|
+-------+----------+--------+---------+-----------+
| 1| 3| a| 1/1/12| a|
| 2| 3| b| 2/3/14| a|
| 4| 2| c| 4/1/12| c|
+-------+----------+--------+---------+-----------+

Aggregating List of Dicts in Spark DataFrame

How can I perform aggregations and analysis on column in a Spark DF that was created from column that contained multiple dictionaries such as the below:
rootKey=[Row(key1='value1', key2='value2', key3='value3'), Row(key1='value1', key2='value2', key3='value3'), Row(key1='value1', key2='value2', key3='value3'), Row(key1='value1', key2='value2', key3='value3')]
Here is an example of what the column looks like:
>>> df.select('column').show(20, False)
+-----------------------------------------------------------------+
|column |
+-----------------------------------------------------------------+
|[[1,1,1], [1,2,6], [1,2,13], [1,3,3]] |
|[[2,1,1], [2,3,6], [2,4,10]] |
|[[1,1,1], [1,1,6], [1,2,1], [2,2,2], [2,3,6], [1,3,7], [2,4,10]] |
An example would be to summarize all of the key values and groupBy a different column.
You need f.explode:
json_file.json:
{"idx":1, "col":[{"k":1,"v1":1,"v2":1},{"k":1,"v1":2,"v2":6},{"k":1,"v1":2,"v2":13},{"k":1,"v1":2,"v2":2}]}
{"idx":2, "col":[{"k":2,"v1":1,"v2":1},{"k":2,"v1":3,"v2":6},{"k":2,"v1":4,"v2":10}]}
from pyspark.sql import functions as f
df = spark.read.load('file:///home/zht/PycharmProjects/test/json_file.json', format='json')
df = df.withColumn('col', f.explode(df['col']))
df = df.groupBy(df['col']['v1']).sum('col.k')
df.show()
# output:
+---------+-----------------+
|col['v1']|sum(col.k AS `k`)|
+---------+-----------------+
| 1| 3|
| 3| 2|
| 2| 3|
| 4| 2|
+---------+-----------------+

Updating a column in pyspark dependent on the column current value

Lets say given a DataFrame
+-----+-----+-----+
| x| y| z|
+-----|-----+-----+
| 3| 5| 9|
| 2| 4| 6|
+-----+-----+-----+
I want to multiply all of the values in z column by the value in y column where z column equals 6.
This post shows the solution I am aiming for, using the code
from pyspark.sql import functions as F
df = df.withColumn('z',
F.when(df['z']==6, df['z']*df['y']).
otherwise(df['z']))
The problem is that df['z'] and df['y'] are recognized as Column object and casting them won't work...
How can I do this correctly?
from pyspark.sql import functions as F
from pyspark.sql.types import LongType
df = df.withColumn('new_col',
F.when(df.z==6,
(df.z.cast(LongType()) * df.y.cast(LongType()))
).otherwise(df.z)
)

Change a columns values in dataframe pyspark

I have 2 dataframes in Spark which are train and test. I have a categorical column in both, say Product_ID, what I want to do is that, I want to put -1 value for those categories, which are in test but not present in train.
So for that I first found distinct categories for that column in p_not_in_test. But I am not able proceed further. how to do that.....
p_not_in_test = test.select('Product_ID').subtract(train.select('Product_ID'))
p_not_in_test = p_not_in_test.distinct()
Regards
Here's a reproducible example, first we create dummy data:
test = sc.parallelize([("ID1", 1,5),("ID2", 2,4),
("ID3", 5,8),("ID4", 9,0),
("ID5", 0,3)]).toDF(["PRODUCT_ID", "val1", "val2"])
train = sc.parallelize([("ID1", 4,7),("ID3", 1,4),
("ID5", 9,2)]).toDF(["PRODUCT_ID", "val1", "val2"])
Now we need to extend your definition of p_not_in_test so we get a list as an output:
p_not_in_test = (test.select('PRODUCT_ID')
.subtract(train.select('PRODUCT_ID'))
.rdd.map(lambda x: x[0]).collect())
Finally, we can create an udf that will add "-1" in front of each ID that's not present in train.
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
addString = udf(lambda x: '-1 ' + x if x in p_not_in_test else x, StringType())
test.withColumn("NEW_ID",addString(test["PRODUCT_ID"])).show()
+----------+----+----+------+
|PRODUCT_ID|val1|val2|NEW_ID|
+----------+----+----+------+
| ID1| 1| 5| ID1|
| ID2| 2| 4|-1 ID2|
| ID3| 5| 8| ID3|
| ID4| 9| 0|-1 ID4|
| ID5| 0| 3| ID5|
+----------+----+----+------+

Resources