Pyspark how to group row based value from a data frame - apache-spark

I am need to group row based value against each index from below data frame
+-----+------+------+------+------+-----+----+-------+
|index|amount| dept | date | amount |dept |date |
+-----+-----------+-----+--+---------+---------+----+
| 1|1000 | acnt |2-4-21| 2000 | acnt2 |2-4-21 |
| 2|1500 | sales|2-3-21| 1600 | sales2|2-3-21 |
since index stand unique to each row and date are same , i need to group the row values as below
+-----+------ +------------+-------+
|index|amount | dept | date |
+-----+---------+------------+-------+
| 1|1000,2000|acnt,acnt2 |2-4-21 |
| 2|1500,1600|sales,sales2|2-3-21 |
i see many option to group columns but specifically for row based value in pyspark
Is there any solution to populate the result as above?

Ideally this needs to be fixed upstream (check if you have joins in your upstream codes and try to select only appropriate aliases to retain the unique columns only).
With that being said, you can create a helper spark function after creating a helper dictionary and column names:
from pyspark.sql import functions as F
from itertools import groupby
Create a fresh list with a counter:
l = []
s = {}
for i in df.columns:
l.append(f"{i}_{s.get(i)}" if i in s else i)
s[i] = s.get(i,0)+1
#['index', 'amount', 'dept', 'date', 'amount_1', 'dept_1', 'date_1']
Then with this new list create a dataframe with the existing dataframe and use a helper function to concat based on duplicate checks:
def mysparkfunc(cols):
cols = [list(v) for k,v in groupby(sorted(cols),lambda x: x.split("_")[0])]
return [F.concat_ws(",",*col).alias(col[0])
if len(col)>1 and col[0]!= 'date'
else F.col(col[0]) for col in cols]
df.toDF(*l).select(*mysparkfunc(l)).show()
+---------+------+------------+-----+
| amount| date| dept|index|
+---------+------+------------+-----+
|1000,2000|2-4-21| acnt,acnt2| 1|
|1500,1600|2-3-21|sales,sales2| 2|
+---------+------+------------+-----+
Full Code:
from pyspark.sql import functions as F
from itertools import groupby
l = []
s = {}
for i in df.columns:
l.append(f"{i}_{s.get(i)}" if i in s else i)
s[i] = s.get(i,0)+1
def mysparkfunc(cols):
cols = [list(v) for k,v in groupby(sorted(cols),lambda x: x.split("_")[0])]
return [F.concat_ws(",",*col).alias(col[0])
if len(col)>1 and col[0]!= 'date'
else F.col(col[0]) for col in cols]
df.toDF(*l).select(*mysparkfunc(l)).show()

let's say you have an initial data frame as shown below
INPUT:+------+------+------+------+
| dept| dept|amount|amount|
+------+------+------+------+
|sales1|sales2| 1| 1|
|sales1|sales2| 2| 2|
|sales1|sales2| 3| 3|
|sales1|sales2| 4| 4|
|sales1|sales2| 5| 5|
+------+------+------+------+
Rename the columns:
newColumns = ["dept1","dept2","amount1","amount2"]
new_clms_df = df.toDF(*newColumns)
new_clms_df.show()
+------+------+-------+-------+
| dept1| dept2|amount1|amount2|
+------+------+-------+-------+
|sales1|sales2| 1| 1|
|sales1|sales2| 2| 2|
|sales1|sales2| 3| 3|
|sales1|sales2| 4| 4|
|sales1|sales2| 5| 5|
+------+------+-------+-------+
Derive the final output columns:
final_df = None
final_df = new_clms_df.\
withColumn('dept', concat_ws(',',new_clms_df['dept1'],new_clms_df['dept2'])).\
withColumn('amount', concat_ws(',',new_clms_df['amount1'],new_clms_df['amount2']))
final_df.show()
+------+------+-------+-------+-------------+------+
| dept1| dept2|amount1|amount2| dept|amount|
+------+------+-------+-------+-------------+------+
|sales1|sales2| 1| 1|sales1,sales2| 1,1|
|sales1|sales2| 2| 2|sales1,sales2| 2,2|
|sales1|sales2| 3| 3|sales1,sales2| 3,3|
|sales1|sales2| 4| 4|sales1,sales2| 4,4|
|sales1|sales2| 5| 5|sales1,sales2| 5,5|
+------+------+-------+-------+-------------+------+

There are two ways.. deppending on what you want
from pyspark.sql.functions import struct, array, col
df = df.withColumn('amount', struct(col('amount1'),col('amount2')) # Map
df = df.withColumn('amount', array(col('amount1'),col('amount2')) # Array
if there are two columns with same name (like in your example), just recreate your df
(If is a join, there is no need... Just use alias)
cols = ['index','amount1','dept', 'amount2', 'dept2', 'date']
df = df.toDF(*cols)

Related

Merge two columns in a single DataFrame and count the occurrences using PySpark

I've two columns in my DataFrame name1 and name2.
I want to join them and count the occurrence (without Null values!).
df = spark.createDataFrame([
["Luc Krier","Jeanny Thorn"],
["Jeanny Thorn","Ben Weller"],
[ "Teddy E Beecher","Luc Krier"],
["Philippe Schauss","Jeanny Thorn"],
["Meindert I Tholen","Liam Muller"],
["Meindert I Tholen",""]
]).toDF("name1", "name2")
Desired result:
+------------------------------+
|name |Occurrence |
+------------------------------+
|Luc Krier |2 |
|Jeanny Thorn |3 |
|Teddy E Beecher |1 |
|Philippe Schauss |1 |
|Meindert I Tholen |2 |
|Liam Muller |1 |
|Ben Weller |1 |
+------------------------------+
How can I achieve this?
You can use explode with array fuction to merge the columns into one then simply group by and count, like this :
from pyspark.sql.functions import col, array, explode, count
df.select(explode(array("name1", "name2")).alias("name")) \
.filter("nullif(name, '') is not null") \
.groupBy("name") \
.agg(count("*").alias("Occurrence")) \
.show()
#+-----------------+----------+
#| name|Occurrence|
#+-----------------+----------+
#|Meindert I Tholen| 2|
#| Jeanny Thorn| 3|
#| Luc Krier| 2|
#| Teddy E Beecher| 1|
#|Philippe Schauss| 1|
#| Ben Weller| 1|
#| Liam Muller| 1|
#+-----------------+----------+
Another way is to select each column, union then group by and count:
df.select(col("name1").alias("name")).union(df.select(col("name2").alias("name"))) \
.filter("nullif(name, '') is not null")\
.groupBy("name") \
.agg(count("name").alias("Occurrence")) \
.show()
Many fancy answers out there, but the easiest solution should be to do a union and then aggregate the count:
df2 = (df.select('name1')
.union(df.select('name2'))
.filter("name1 != ''")
.groupBy('name1')
.count()
.toDF('name', 'Occurrence')
)
df2.show()
+-----------------+----------+
| name|Occurrence|
+-----------------+----------+
|Meindert I Tholen| 2|
| Jeanny Thorn| 3|
| Luc Krier| 2|
| Teddy E Beecher| 1|
|Philippe Schauss| 1|
| Ben Weller| 1|
| Liam Muller| 1|
+-----------------+----------+
There are better ways to do it. One naive way of doing it is as follows
from collections import Counter
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("OccurenceCount").getOrCreate()
df = spark.createDataFrame([
["Luc Krier","Jeanny Thorn"],
["Jeanny Thorn","Ben Weller"],
[ "Teddy E Beecher","Luc Krier"],
["Philippe Schauss","Jeanny Thorn"],
["Meindert I Tholen","Liam Muller"],
["Meindert I Tholen",""]
]).toDF("name1", "name2")
counter_dict = dict(Counter(df.select("name1", "name2").rdd.flatMap(lambda x: x).collect()))
counter_list = list(map(list, counter_dict.items()))
frequency_df = spark.createDataFrame(counter_list, ["name", "Occurrence"])
frequency_df.show()
Output:
+-----------------+----------+
| name|Occurrence|
+-----------------+----------+
| | 1|
| Liam Muller| 1|
| Teddy E Beecher| 1|
| Ben Weller| 1|
| Jeanny Thorn| 3|
| Luc Krier| 2|
|Philippe Schauss| 1|
|Meindert I Tholen| 2|
+-----------------+----------+
Does this work?
# Groupby & count both dataframes individually to reduce size.
df_name1 = (df.groupby(['name1']).count()
.withColumnRenamed('name1', 'name')
.withColumnRenamed('count', 'count1'))
df_name2 = (df.groupby(['name2']).count()
.withColumnRenamed('name2', 'name')
.withColumnRenamed('count', 'count2'))
# Join the two dataframes containing frequency counts
# Any null value in the 'count' column can be correctly interpreted as zero.
df_count = (df_name1.join(df_name2, on=['name'], how='outer')
.fillna(0, subset=['count1', 'count2']))
# Sum the two counts and drop the useless columns
df_count = (df_count.withColumn('count', df_count['count1'] + df_count['count2'])
.drop('count1').drop('count2').dropna(subset=['name']))
# (Optional) While any rows with a null name have been removed, rows with an
# empty string ("") for a name are still there. We can drop the empty name
# rows like this.
df_count = df_count[df_count['name'] != '']
df_count.show()
# +-----------------+-----+
# | name|count|
# +-----------------+-----+
# |Meindert I Tholen| 2|
# | Jeanny Thorn| 3|
# | Luc Krier| 2|
# | Teddy E Beecher| 1|
# |Philippe Schauss| 1|
# | Ben Weller| 1|
# | Liam Muller| 1|
# +-----------------+-----+
You can get the required output as follows in scala :
import org.apache.spark.sql.functions._
val df = Seq(
("Luc Krier","Jeanny Thorn"),
("Jeanny Thorn","Ben Weller"),
( "Teddy E Beecher","Luc Krier"),
("Philippe Schauss","Jeanny Thorn"),
("Meindert I Tholen","Liam Muller"),
("Meindert I Tholen","")
).toDF("name1", "name2")
val df1 = df.filter($"name1".isNotNull).filter($"name1" !==
"").groupBy("name1").agg(count("name1").as("count1"))
val df2 = df.filter($"name2".isNotNull).filter($"name2" !==
"").groupBy("name2").agg(count("name2").as("count2"))
val newdf = df1.join(df2, $"name1" === $"name2","outer").withColumn("count1",
when($"count1".isNull,0).otherwise($"count1")).withColumn("count2",
when($"count2".isNull,0).otherwise($"count2")).withColumn("Count",$"count1" +
$"count2")
val finalDF =newdf.withColumn("name",when($"name1".isNull,$"name2")
.when($"name2".isNull,$"name1").otherwise($"name1")).select("name","Count")
display(finalDF)
You can see the final output as image below :

How to create an index column, a 2 window moving average column and a 2 window difference column with groupby in pyspark

I have a spark dataframe that looks like this
import pandas as pd
dfs = pd.DataFrame({'country':['a','a','a','a','b','b'], 'value':[1,2,3,4,5,6], 'id':[3,5,4,6, 8,7]})
I would like to add 3 new columns in this dataframe.
An index that starts from 1 and increases for each row, by country
A 2 window difference of the value column by country, ordered by id
A 2 window moving average of the value column by country, ordered by id
Any ideas how I can do that in one go ?
EDIT
The difference column should be [1,2,-1,2,6,-1] and it is calculated as follows:
The rows are ordered by id. Then, the first rows for each country remain unchanged. Then for the second row for country a it is 3-1=2, for the 3rd row it is 2-3=-1 etc
you can use the rowsBetween window spec with windows function
##%%
import pyspark.sql.functions as F
from pyspark.sql.functions import udf
from pyspark.sql.types import *
from pyspark.sql.window import Window
# Test data
dfs = sqlContext.createDataFrame([('a',1,3),('a',2,5),('a',3,4),('a',4,6),('b',5,8),('b',6,7)],schema=['country','value','id'])
# First window to calculate the id and difference in values
w=Window.partitionBy('country').orderBy('id')
# use row_number() and lag() functions to get the values
df_id = (dfs.withColumn("id",F.row_number().over(w))).withColumn("delta",F.col('value')-F.lag('value',default=0).over(w))
#% Second window to calculate the moving average, sum and difference
w1 = w=Window.partitionBy('country').orderBy('id').rowsBetween(-1,0)
# do the calculations with a window spec of 2, defined by (-1,0) in w1
df = (df_id.withColumn("movingaverage",F.mean('value').over(w1))).withColumn("moving_sum",F.sum('value').over(w1))
# Additional calculation, not requested by the author
df_res = df.withColumn("moving_difference", F.col('value')-F.col("moving_sum"))
The results
df_res.show()
+-------+-----+---+-----+-------------+----------+-----------------+
|country|value| id|delta|movingaverage|moving_sum|moving_difference|
+-------+-----+---+-----+-------------+----------+-----------------+
| a| 1| 1| 1| 1.0| 1| 0|
| a| 3| 2| 2| 2.0| 4| -1|
| a| 2| 3| -1| 2.5| 5| -3|
| a| 4| 4| 2| 3.0| 6| -2|
| b| 6| 1| 6| 6.0| 6| 0|
| b| 5| 2| -1| 5.5| 11| -6|
+-------+-----+---+-----+-------------+----------+-----------------+

How to do a conditional aggregation after a groupby in pyspark dataframe?

I'm trying to group by an ID column in a pyspark dataframe and sum a column depending on the value of another column.
To illustrate, consider the following dummy dataframe:
+-----+-------+---------+
| ID| type| amount|
+-----+-------+---------+
| 1| a| 55|
| 2| b| 1455|
| 2| a| 20|
| 2| b| 100|
| 3| null| 230|
+-----+-------+---------+
My desired output is:
+-----+--------+----------+----------+
| ID| sales| sales_a| sales_b|
+-----+--------+----------+----------+
| 1| 55| 55| 0|
| 2| 1575| 20| 1555|
| 3| 230| 0| 0|
+-----+--------+----------+----------+
So basically, sales will be the sum of amount, while sales_a and sales_b are the sum of amount when type is a or b respectively.
For sales, I know this could be done like this:
from pyspark.sql import functions as F
df = df.groupBy("ID").agg(F.sum("amount").alias("sales"))
For the others, I'm guessing F.when would be useful but I'm not sure how to go about it.
You could create two columns before the aggregation based off of the value of type.
df.withColumn("sales_a", F.when(col("type") == "a", col("amount"))) \
.withColumn("sales_b", F.when(col("type") == "b", col("amount"))) \
.groupBy("ID") \
.agg(F.sum("amount").alias("sales"),
F.sum("sales_a").alias("sales_a"),
F.sum("sales_b").alias("sales_b"))
from pyspark.sql import functions as F
df = df.groupBy("ID").agg(F.sum("amount").alias("sales"))
dfPivot = df.filter("type is not null").groupBy("ID").pivot("type").agg(F.sum("amount").alias("sales"))
res = df.join(dfPivot, df.id== dfPivot.id,how='left')
Then replace null with 0.
This is generic solution will work irrespective of values in type column.. so if type c is added in dataframe then it will create column _c

Pyspark: Dropping columns with no distinct values only using transformations [duplicate]

Related question: How to drop columns which have same values in all rows via pandas or spark dataframe?
So I have a pyspark dataframe, and I want to drop the columns where all values are the same in all rows while keeping other columns intact.
However the answers in the above question are only for pandas. Is there a solution for pyspark dataframe?
Thanks
You can apply the countDistinct() aggregation function on each column to get count of distinct values per column. Column with count=1 means it has only 1 value in all rows.
# apply countDistinct on each column
col_counts = df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns)).collect()[0].asDict()
# select the cols with count=1 in an array
cols_to_drop = [col for col in df.columns if col_counts[col] == 1 ]
# drop the selected column
df.drop(*cols_to_drop).show()
You can use approx_count_distinct function (link) to count the number of distinct elements in a column. In case there is just one distinct, the remove the corresponding column.
Creating the DataFrame
from pyspark.sql.functions import approx_count_distinct
myValues = [(1,2,2,0),(2,2,2,0),(3,2,2,0),(4,2,2,0),(3,1,2,0)]
df = sqlContext.createDataFrame(myValues,['value1','value2','value3','value4'])
df.show()
+------+------+------+------+
|value1|value2|value3|value4|
+------+------+------+------+
| 1| 2| 2| 0|
| 2| 2| 2| 0|
| 3| 2| 2| 0|
| 4| 2| 2| 0|
| 3| 1| 2| 0|
+------+------+------+------+
Couting number of distinct elements and converting it into dictionary.
count_distinct_df=df.select([approx_count_distinct(x).alias("{0}".format(x)) for x in df.columns])
count_distinct_df.show()
+------+------+------+------+
|value1|value2|value3|value4|
+------+------+------+------+
| 4| 2| 1| 1|
+------+------+------+------+
dict_of_columns = count_distinct_df.toPandas().to_dict(orient='list')
dict_of_columns
{'value1': [4], 'value2': [2], 'value3': [1], 'value4': [1]}
#Storing those keys in the list which have just 1 distinct key.
distinct_columns=[k for k,v in dict_of_columns.items() if v == [1]]
distinct_columns
['value3', 'value4']
Drop the columns having distinct values
df=df.drop(*distinct_columns)
df.show()
+------+------+
|value1|value2|
+------+------+
| 1| 2|
| 2| 2|
| 3| 2|
| 4| 2|
| 3| 1|
+------+------+

Aggregating List of Dicts in Spark DataFrame

How can I perform aggregations and analysis on column in a Spark DF that was created from column that contained multiple dictionaries such as the below:
rootKey=[Row(key1='value1', key2='value2', key3='value3'), Row(key1='value1', key2='value2', key3='value3'), Row(key1='value1', key2='value2', key3='value3'), Row(key1='value1', key2='value2', key3='value3')]
Here is an example of what the column looks like:
>>> df.select('column').show(20, False)
+-----------------------------------------------------------------+
|column |
+-----------------------------------------------------------------+
|[[1,1,1], [1,2,6], [1,2,13], [1,3,3]] |
|[[2,1,1], [2,3,6], [2,4,10]] |
|[[1,1,1], [1,1,6], [1,2,1], [2,2,2], [2,3,6], [1,3,7], [2,4,10]] |
An example would be to summarize all of the key values and groupBy a different column.
You need f.explode:
json_file.json:
{"idx":1, "col":[{"k":1,"v1":1,"v2":1},{"k":1,"v1":2,"v2":6},{"k":1,"v1":2,"v2":13},{"k":1,"v1":2,"v2":2}]}
{"idx":2, "col":[{"k":2,"v1":1,"v2":1},{"k":2,"v1":3,"v2":6},{"k":2,"v1":4,"v2":10}]}
from pyspark.sql import functions as f
df = spark.read.load('file:///home/zht/PycharmProjects/test/json_file.json', format='json')
df = df.withColumn('col', f.explode(df['col']))
df = df.groupBy(df['col']['v1']).sum('col.k')
df.show()
# output:
+---------+-----------------+
|col['v1']|sum(col.k AS `k`)|
+---------+-----------------+
| 1| 3|
| 3| 2|
| 2| 3|
| 4| 2|
+---------+-----------------+

Resources