groupby category and sum the count - apache-spark

Let's say I have a table (df) like so:
type count
A 5000
B 5000
C 200
D 123
... ...
... ...
Z 453
How can I sum the column count by type A, B and all other types fall into Others category?
I currently have this:
df = df.withColumn('type', when(col("type").isnot("A", "B"))
My expected output would be like so:
type count
A 5000
B 5000
Other 3043

You want to group by when expression and sum the count :
from pyspark.sql import functions as F
df1 = df.groupBy(
when(
F.col("type").isin("A", "B"), F.col("type")
).otherwise("Others").alias("type")
).agg(
F.sum("count").alias("count")
)
df1.show()
#+------+-----+
#| type|count|
#+------+-----+
#| B| 5000|
#| A| 5000|
#|Others| 776|
#+------+-----+

You can divide the dataframe into two parts based on the type, aggregate a sum for the second part, and do a unionAll to combine them.
import pyspark.sql.functions as F
result = df.filter("type in ('A', 'B')").unionAll(
df.filter("type not in ('A', 'B')")
.select(F.lit('Other'), F.sum('count'))
)
result.show()
+-----+-----+
| type|count|
+-----+-----+
| A| 5000|
| B| 5000|
|Other| 776|
+-----+-----+

Related

How to remove all the subset from a column except few based on the other column in Pyspark?

I have a pyspark data frame with a column of lists(column a) and another column with numbers(column b), I want to retain all are supersets rows, and subsets that have greater value in column b than their supersets.
For Example,
Input data frame:
Column a = ([A,B,C],[A,C],[B,C],[J,S,K],[J,S],[J,K])
Column b = (10,15,7,8,9,8)
Expected Outcome:
Column a = ([A,B,C],[A,C],[J,S,K],[J,S])
Column b = (10,15,8,9)
Here [B,C] and [A,C] are subsets of [A,B,C] but we only retain [A,C] because this subset has 15 in column b which is greater than 10 the supersets([A,B,C]) value in column b.
Similarly, the superset [J,S,K] is retained along with its subset [J,S] because its value in column b is greater than the superset column b value.
You can use a self left_anti join to filter rows that satisfies that condition.
You'll need to have an ID column in your dataframe, here I'm using monotonically_increasing_id function to generate an ID for each row:
import pyspark.sql.functions as F
df = df.withColumn("ID", F.monotonically_increasing_id())
df.show()
#+---------+---+-----------+
#| a| b| ID|
#+---------+---+-----------+
#|[A, B, C]| 10| 8589934592|
#| [A, C]| 15|17179869184|
#| [B, C]| 7|25769803776|
#|[J, S, K]| 8|42949672960|
#| [J, S]| 9|51539607552|
#| [J, K]| 8|60129542144|
#+---------+---+-----------+
Now, to verify an array arr1 is a subset of another array arr2, you can use array_intersect and size functions size(array_intersect(arr1, arr2)) = size(arr1):
df_result = df.alias("df1").join(
df.alias("df2"),
(
(F.size(F.array_intersect("df1.a", "df2.a")) == F.size("df1.a"))
& (F.col("df1.b") <= F.col("df2.b"))
& (F.col("df1.ID") != F.col("df2.ID")) # not the same row
),
"left_anti"
).drop("ID")
df_result.show()
#+---------+---+
#| a| b|
#+---------+---+
#|[A, B, C]| 10|
#| [A, C]| 15|
#|[J, S, K]| 8|
#| [J, S]| 9|
#+---------+---+

Populate a column based on previous value and row Pyspark

I have a spark dataframe with 5 columns group, date, a, b, and c and I want to do the following:
given df
group date a b c
a 2018-01 2 3 10
a 2018-02 4 5 null
a 2018-03 2 1 null
expected output
group date a b c
a 2018-01 2 3 10
a 2018-02 4 5 10*3+2=32
a 2018-03 2 1 32*5+4=164
for each group, calculate c by b * c + a and use the output as the c of the next row.
I tried using Lag and window function but couldn't find the right way for this.
Within a window you cannot access results of a column that you are currently about to calculate. This would force Spark to do the calculations sequentially and should be avoided. Another approach is to transform the recursive calculation c_n = func(c_(n-1)) into a formula that only uses the (constant) values of a, b and the first value of c:
All input values for this formula can be collected with a window and the formula itself is implemented as udf:
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.sql import Window
df = ...
w=Window.partitionBy('group').orderBy('date')
df1 = df.withColumn("la", F.collect_list("a").over(w)) \
.withColumn("lb", F.collect_list("b").over(w)) \
.withColumn("c0", F.first("c").over(w))
import numpy as np
def calc_c(c0, a, b):
if c0 is None:
return 0.0
if len(a) == 1:
return float(c0)
e1 = c0 * np.prod(b[:-1])
e2 = 0.0
for i,an in enumerate(a[:-1]):
e2 = e2 + an * np.prod(b[i+1:-1])
return float(e1 + e2)
calc_c_udf= F.udf(calc_c, T.DoubleType())
df1.withColumn("result", calc_c_udf("c0", "la", "lb")) \
.show()
Output:
+-----+-------+---+---+----+---------+---------+---+------+
|group| date| a| b| c| la| lb| c0|result|
+-----+-------+---+---+----+---------+---------+---+------+
| a|2018-01| 2| 3| 10| [2]| [3]| 10| 10.0|
| a|2018-02| 4| 5|null| [2, 4]| [3, 5]| 10| 32.0|
| a|2018-03| 2| 1|null|[2, 4, 2]|[3, 5, 1]| 10| 164.0|
+-----+-------+---+---+----+---------+---------+---+------+

How can I create all pairwise combinations of multiple columns in a Pyspark Dataframe?

Consider the following Pyspark dataframe
Col1
Col2
Col3
A
D
G
B
E
H
C
F
I
How can I create the following dataframe which has all pairwise combinations of all the columns?
Col1
Col2
Col3
Col1_Col2_cross
Col1_Col3_cross
Col2_Col3_cross
A
D
G
A,D
A,G
D,G
B
E
H
B,E
B,H
E,H
C
F
I
C,F
C,I
F,I
You can generate column combinations using itertools:
import pyspark.sql.functions as F
import itertools
df2 = df.select(
'*',
*[F.concat_ws(',', x[0], x[1]).alias(x[0] + '_' + x[1] + '_cross')
for x in itertools.combinations(df.columns, 2)]
)
df2.show()
+----+----+----+---------------+---------------+---------------+
|Col1|Col2|Col3|Col1_Col2_cross|Col1_Col3_cross|Col2_Col3_cross|
+----+----+----+---------------+---------------+---------------+
| A| D| G| A,D| A,G| D,G|
| B| E| H| B,E| B,H| E,H|
| C| F| I| C,F| C,I| F,I|
+----+----+----+---------------+---------------+---------------+

Compare each spark dataframe element with all the rest of same dataframe

I'm looking for efficient way of applying some map function to each pair of elements in a dataframe. e.g.
records = spark.createDataFrame(
[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')], \
['id', 'val'])
records.show()
+---+---+
| id|val|
+---+---+
| 1| a|
| 2| b|
| 3| c|
| 4| d|
+---+---+
I want to take values a, b, c, d and compare each of them with all the rest:
a -> b
a -> c
a -> d
b -> c
b -> d
c -> d
By comparison I mean custom function that takes those 2 values and calculates some similarity index between them.
Could you suggest efficient way to perform this calculation, assuming input dataframe could contain tenth millions elements?
Spark version 2.4.6 (AWS emr-5.31.0), using EMR notebook with pyspark
Collect val column values into lookup column. then compare each value from lookup array with val column.
Check below code.
>>> records\
.select(F.collect_list(F.struct(F.col("id"),F.col("val"))).alias("data"),F.collect_list(F.col("val")).alias("lookup"))\
.withColumn("data",F.explode(F.col("data"))) \
.select("data.*",F.expr("filter(lookup,v -> v != data.val)").alias("lookup")) \
#.withColumn("compare",expr("transform(lookup, v -> val [.....] )")) # May be you can add your logic in this -> [.....]
.show()
+---+---+---------+
| id|val| lookup|
+---+---+---------+
| 1| a|[b, c, d]|
| 2| b|[a, c, d]|
| 3| c|[a, b, d]|
| 4| d|[a, b, c]|
+---+---+---------+
This is a cross join operation with a collect_list aggregation. if you want a's matches list to contain only [b,c,d] you should apply that filter before doing the collect_list.
records.alias("lhs")
.crossJoin(episodes.alias("rhs"))
.filter("lhs.val!=rhs.val")
.groupBy("lhs")
.agg(functions.collect_list("rhs.val").alias("lookup"))
.selectExpr("lhs.*", "lookup");

Join two dataframes in pyspark by one column

I have a two dataframes that I need to join by one column and take just rows from the first dataframe if that id is contained in the same column of second dataframe:
df1:
id a b
2 1 1
3 0.5 1
4 1 2
5 2 1
df2:
id c d
2 fs a
5 fa f
Desired output:
df:
id a b
2 1 1
5 2 1
I have tried with df1.join(df2("id"),"left"), but gives me error :'Dataframe' object is not callable.
df2("id") is not a valid python syntax for selecting columns, you'd either need df2[["id"]] or use select df2.select("id"); For your example, you can do:
df1.join(df2.select("id"), "id").show()
+---+---+---+
| id| a| b|
+---+---+---+
| 5|2.0| 1|
| 2|1.0| 1|
+---+---+---+
or:
df1.join(df2[["id"]], "id").show()
+---+---+---+
| id| a| b|
+---+---+---+
| 5|2.0| 1|
| 2|1.0| 1|
+---+---+---+
If you need to check if id exists in df2 and does not need any column in your output from df2 then isin() is more efficient solution (This is similar to EXISTS and IN in SQL).
df1 = spark.createDataFrame([(2,1,1) ,(3,5,1,),(4,1,2),(5,2,1)], "id: Int, a : Int , b : Int")
df2 = spark.createDataFrame([(2,'fs','a') ,(5,'fa','f')], ['id','c','d'])
Create df2.id as list and pass it to df1 under isin()
from pyspark.sql.functions import col
df2_list = df2.select('id').rdd.map(lambda row : row[0]).collect()
df1.where(col('id').isin(df2_list)).show()
#+---+---+---+
#| id| a| b|
#+---+---+---+
#| 2| 1| 1|
#| 5| 2| 1|
#+---+---+---+
It is reccomended to use isin() IF -
You don't need to return data from the refrence dataframe/table
You have duplicates in the refrence dataframe/table (JOIN can cause duplicate rows if values are repeated)
You just want to check existence of particular value

Resources