How to intersect/union pyspark dataframes with different values - apache-spark

I have one data frame as ( This is overall data frame) with 0s and 1s
+---+-----+
|key|value|
+---+-----+
| a| 0.5|
| b| 0.4|
| c| 0.5|
| d| 0.3|
| x| 0.0|
| y| 0.0|
| z| 0.0|
+---+-----+
and the second dataframe is ( Bad Output ) ( Should contain only 0s )
+---+-----+
|key|value|
+---+-----+
| a| 0.0|
| e| 0.0|
| f| 0.0|
| g| 0.0|
+---+-----+
Note : the value of `a` has chnaged
How to write my script so I can get the following output of my second Dataframe ( only 0s and as value of a is 1 in good data frame , i want to remove it from bad one
+---+-----+
|key|value|
+---+-----+
| e| 0.0|
| f| 0.0|
| g| 0.0|
| x| 0.0|
| y| 0.0|
| z| 0.0|
+---+-----+

Non-zero overall values can be removed from bad output, and zero overall values added (Scala):
val overall = Seq(
("a", 0.5),
("b", 0.4),
("c", 0.5),
("d", 0.3),
("x", 0.0),
("y", 0.0),
("z", 0.0),
).toDF("key", "value")
val badOutput = Seq(
("a", 0.0),
("e", 0.0),
("f", 0.0),
("g", 0.0)
)
.toDF("key", "value")
badOutput
.except(overall.where($"value"=!=0).withColumn("value", lit(0.0)))
.union (overall.where($"value"===0))

You can union two dataframes then groupBy + array_contains function to get the desired result.
Example:
df.show()
#+---+-----+
#|key|value|
#+---+-----+
#| a| 1|
#| b| 1|
#| c| 1|
#| d| 1|
#| x| 0|
#| y| 0|
#| z| 0|
#+---+-----+
df1.show()
#+---+-----+
#|key|value|
#+---+-----+
#| a| 0|
#| e| 0|
#| f| 0|
#| g| 0|
#+---+-----+
df2=df.unionAll(df1)
df3=df2.groupBy("key").agg(collect_list(col("value")).alias("lst"))
df3.filter(~array_contains("lst",1)).\
withColumn("lst",array_join(col("lst"),'')).\
show()
#+---+---+
#|key|lst|
#+---+---+
#| x| 0|
#| g| 0|
#| f| 0|
#| e| 0|
#| z| 0|
#| y| 0|
#+---+---+

Related

Iterating through rows to create custom formula structure in PySpark

I have a dataframe with variable names and numerator and denominator.
Each variable is a ratio, eg below:
And another dataset with actual data to compute the attributes:
Goal is to create these attributes with formulas in 1st and compute with 2nd.
Currently my approach is very naive:
df = df.withColumn("var1", col('a')/col('b'))./
.
.
.
Desired Output:
Since I have >500 variables, any suggestions for a smarter way to get around this are welcome!
This can be achieved by cross join , unpivot and pivot function in PySpark.
import pyspark.sql.functions as f
from pyspark.sql.functions import *
from pyspark.sql.types import *
data = [
("var1", "a","c"),
("var2", "b","d"),
("var3", "b","a"),
("var4", "d","c")
]
schema = StructType([
StructField('name', StringType(),True), \
StructField('numerator', StringType(),True), \
StructField('denonminator', StringType(),True)
])
data2 = [
("ID1", 6,4,3,7),
("ID2", 1,2,3,9)
]
schema2 = StructType([
StructField('ID', StringType(),True), \
StructField('a', IntegerType(),True), \
StructField('b', IntegerType(),True),\
StructField('c', IntegerType(),True), \
StructField('d', IntegerType(),True)
])
df = spark.createDataFrame(data=data, schema=schema)
df2 = spark.createDataFrame(data=data2, schema=schema2)
df.createOrReplaceTempView("table1")
df2.createOrReplaceTempView("table2")
df.createOrReplaceTempView("table1")
df2.createOrReplaceTempView("table2")
""" CRoss Join for Duplicating the values """
df3=spark.sql("select * from table1 cross join table2")
df3.createOrReplaceTempView("table3")
""" Unpivoting the values and joining to fecth the value of numerator and denominator"""
cols = df2.columns[1:]
df4=df2.selectExpr('ID', "stack({}, {})".format(len(cols), ', '.join(("'{}', {}".format(i, i) for i in cols))))
df4.createOrReplaceTempView("table4")
df5=spark.sql("select name,B.ID,round(B.col1/C.col1,2) as value from table3 A left outer join table4 B on A.ID=B.ID and a.numerator=b.col0 left outer join table4 C on A.ID=C.ID and a.denonminator=C.col0 order by name,ID")
""" Pivot for fetching the results """
df_final=df5.groupBy("ID").pivot("name").max("value")
The results of all intermediate and final dataframes
>>> df.show()
+----+---------+------------+
|name|numerator|denonminator|
+----+---------+------------+
|var1| a| c|
|var2| b| d|
|var3| b| a|
|var4| d| c|
+----+---------+------------+
>>> df2.show()
+---+---+---+---+---+
| ID| a| b| c| d|
+---+---+---+---+---+
|ID1| 6| 4| 3| 7|
|ID2| 1| 2| 3| 9|
+---+---+---+---+---+
>>> df3.show()
+----+---------+------------+---+---+---+---+---+
|name|numerator|denonminator| ID| a| b| c| d|
+----+---------+------------+---+---+---+---+---+
|var1| a| c|ID1| 6| 4| 3| 7|
|var2| b| d|ID1| 6| 4| 3| 7|
|var1| a| c|ID2| 1| 2| 3| 9|
|var2| b| d|ID2| 1| 2| 3| 9|
|var3| b| a|ID1| 6| 4| 3| 7|
|var4| d| c|ID1| 6| 4| 3| 7|
|var3| b| a|ID2| 1| 2| 3| 9|
|var4| d| c|ID2| 1| 2| 3| 9|
+----+---------+------------+---+---+---+---+---+
>>> df4.show()
+---+----+----+
| ID|col0|col1|
+---+----+----+
|ID1| a| 6|
|ID1| b| 4|
|ID1| c| 3|
|ID1| d| 7|
|ID2| a| 1|
|ID2| b| 2|
|ID2| c| 3|
|ID2| d| 9|
+---+----+----+
>>> df5.show()
+----+---+-----+
|name| ID|value|
+----+---+-----+
|var1|ID1| 2.0|
|var1|ID2| 0.33|
|var2|ID1| 0.57|
|var2|ID2| 0.22|
|var3|ID1| 0.67|
|var3|ID2| 2.0|
|var4|ID1| 2.33|
|var4|ID2| 3.0|
+----+---+-----+
>>> df_final.show() final
+---+----+----+----+----+
| ID|var1|var2|var3|var4|
+---+----+----+----+----+
|ID2|0.33|0.22| 2.0| 3.0|
|ID1| 2.0|0.57|0.67|2.33|
+---+----+----+----+----+

Conditions in Spark window function

I have a dataframe like
+---+---+---+---+
| q| w| e| r|
+---+---+---+---+
| a| 1| 20| y|
| a| 2| 22| z|
| b| 3| 10| y|
| b| 4| 12| y|
+---+---+---+---+
I want to mark the rows with the minimum e and r = z . If there are no rows which have r = z, I want the row with the minimum e, even if r = y.
Essentially, something like
+---+---+---+---+---+
| q| w| e| r| t|
+---+---+---+---+---+
| a| 1| 20| y| 0|
| a| 2| 22| z| 1|
| b| 3| 10| y| 1|
| b| 4| 12| y| 0|
+---+---+---+---+---+
I can do it using a number of joins, but that would be too expensive.
So I was looking for a window-based solution.
You can calculate the minimum per group once for rows with r = z and then for all rows within a group. The first non-null value can then be compared to e:
from pyspark.sql import functions as F
from pyspark.sql import Window
df = ...
w = Window.partitionBy("q")
#When ordering is not defined, an unbounded window frame is used by default.
df.withColumn("min_e_with_r_eq_z", F.expr("min(case when r='z' then e else null end)").over(w)) \
.withColumn("min_e_overall", F.min("e").over(w)) \
.withColumn("t", F.coalesce("min_e_with_r_eq_z","min_e_overall") == F.col("e")) \
.orderBy("w") \
.show()
Output:
+---+---+---+---+-----------------+-------------+-----+
| q| w| e| r|min_e_with_r_eq_z|min_e_overall| t|
+---+---+---+---+-----------------+-------------+-----+
| a| 1| 20| y| 22| 20|false|
| a| 2| 22| z| 22| 20| true|
| b| 3| 10| y| null| 10| true|
| b| 4| 12| y| null| 10|false|
+---+---+---+---+-----------------+-------------+-----+
Note: I assume that q is the grouping column for the window.
You can assign row numbers based on whether r = z and the value of column e:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
't',
F.when(
F.row_number().over(
Window.partitionBy('q')
.orderBy((F.col('r') == 'z').desc(), 'e')
) == 1,
1
).otherwise(0)
)
df2.show()
+---+---+---+---+---+
| q| w| e| r| t|
+---+---+---+---+---+
| a| 2| 22| z| 1|
| a| 1| 20| y| 0|
| b| 3| 10| y| 1|
| b| 4| 12| y| 0|
+---+---+---+---+---+
Adding the spark-scala version of #werner 's accepted answer
val w = Window.partitionBy("q")
df.withColumn("min_e_with_r_eq_z", min(when($"r" === "z", $"e").otherwise(null)).over(w))
.withColumn("min_e_overall", min("e").over(w))
.withColumn("t", coalesce($"min_e_with_r_eq_z", $"min_e_overall") === $"e")
.orderBy("w")
.show()

Why sum is not displaying after aggregation & pivot?

Here I have student marks like below and I want to transpose subject name column and want to get the total marks also after the pivot.
Source table like:
+---------+-----------+-----+
|StudentId|SubjectName|Marks|
+---------+-----------+-----+
| 1| A| 10|
| 1| B| 20|
| 1| C| 30|
| 2| A| 20|
| 2| B| 25|
| 2| C| 30|
| 3| A| 10|
| 3| B| 20|
| 3| C| 20|
+---------+-----------+-----+
Destination:
+---------+---+---+---+-----+
|StudentId| A| B| C|Total|
+---------+---+---+---+-----+
| 1| 10| 20| 30| 60|
| 3| 10| 20| 20| 50|
| 2| 20| 25| 30| 75|
+---------+---+---+---+-----+
Please find the below source code:
val spark = SparkSession.builder().appName("test").master("local[*]").getOrCreate()
import spark.implicits._
val list = List((1, "A", 10), (1, "B", 20), (1, "C", 30), (2, "A", 20), (2, "B", 25), (2, "C", 30), (3, "A", 10),
(3, "B", 20), (3, "C", 20))
val df = list.toDF("StudentId", "SubjectName", "Marks")
df.show() // source table as per above
val df1 = df.groupBy("StudentId").pivot("SubjectName", Seq("A", "B", "C")).agg(sum("Marks"))
df1.show()
val df2 = df1.withColumn("Total", col("A") + col("B") + col("C"))
df2.show // required destitnation
val df3 = df.groupBy("StudentId").agg(sum("Marks").as("Total"))
df3.show()
df1 is not displaying the sum/total column. it's displaying like below.
+---------+---+---+---+
|StudentId| A| B| C|
+---------+---+---+---+
| 1| 10| 20| 30|
| 3| 10| 20| 20|
| 2| 20| 25| 30|
+---------+---+---+---+
df3 is able to create new Total column but why in df1 it not able to create a new column?
Please, can anybody help me what I missing or anything wrong with my understanding of pivot concept?
This is an expected behaviour from spark pivot function as .agg function is applied on the pivoted columns that's the reason why you are not able to see sum of marks as new column.
Refer to this link for official documentation about pivot.
Example:
scala> df.groupBy("StudentId").pivot("SubjectName").agg(sum("Marks") + 2).show()
+---------+---+---+---+
|StudentId| A| B| C|
+---------+---+---+---+
| 1| 12| 22| 32|
| 3| 12| 22| 22|
| 2| 22| 27| 32|
+---------+---+---+---+
In the above example we have added 2 to all the pivoted columns.
Example2:
To get count using pivot and agg
scala> df.groupBy("StudentId").pivot("SubjectName").agg(count("*")).show()
+---------+---+---+---+
|StudentId| A| B| C|
+---------+---+---+---+
| 1| 1| 1| 1|
| 3| 1| 1| 1|
| 2| 1| 1| 1|
+---------+---+---+---+
The .agg followed by pivot is applicable only for the pivoted data. To find the sum you should you should add new column and sum it as below.
val cols = Seq("A", "B", "C")
val result = df.groupBy("StudentId")
.pivot("SubjectName")
.agg(sum("Marks"))
.withColumn("Total", cols.map(col _).reduce(_ + _))
result.show(false)
Output:
+---------+---+---+---+-----+
|StudentId|A |B |C |Total|
+---------+---+---+---+-----+
|1 |10 |20 |30 |60 |
|3 |10 |20 |20 |50 |
|2 |20 |25 |30 |75 |
+---------+---+---+---+-----+

Replacing all column values using Window operation?

Hi Data frame created like below.
df = sc.parallelize([
(1, 3),
(2, 3),
(3, 2),
(4,2),
(1, 3)
]).toDF(["id",'t'])
it shows like below.
+---+---+
| id| t|
+---+---+
| 1| 3|
| 2| 3|
| 3| 2|
| 4| 2|
| 1| 3|
+---+---+
my main aim is ,I want to replace repeated value in every column with how many times repeated.
so i have tried flowing code it is not working as expected.
from pyspark.sql.functions import col
column_list = ["id",'t']
w = Window.partitionBy(column_list)
dfmax=df.select(*((count(col(c)).over(w)).alias(c) for c in df.columns))
dfmax.show()
+---+---+
| id| t|
+---+---+
| 2| 2|
| 2| 2|
| 1| 1|
| 1| 1|
| 1| 1|
+---+---+
my expected output will be
+---+---+
| id| t|
+---+---+
| 2| 3|
| 1| 3|
| 1| 1|
| 1| 1|
| 2| 3|
+---+---+
If I understand you correctly, what you're looking for is simply:
df.select(*[count(c).over(Window.partitionBy(c)).alias(c) for c in df.columns]).show()
#+---+---+
#| id| t|
#+---+---+
#| 2| 3|
#| 2| 3|
#| 1| 2|
#| 1| 3|
#| 1| 2|
#+---+---+
The difference between this and what you posted is that we only partition by one column at a time.
Remember that DataFrames are unordered. If you wanted to maintain your row order, you could add an ordering column using pyspark.sql.functions.monotonically_increasing_id():
from pyspark.sql.functions import monotonically_increasing_id
df.withColumn("order", monotonically_increasing_id())\
.select(*[count(c).over(Window.partitionBy(c)).alias(c) for c in df.columns])\
.sort("order")\
.drop("order")\
.show()
#+---+---+
#| id| t|
#+---+---+
#| 2| 3|
#| 1| 3|
#| 1| 2|
#| 1| 2|
#| 2| 3|
#+---+---+

Applying function to dataframe columns spark scala

I have a large dataset with considerabely large number of columns(150), I want to apply a function(UDF) on all the column expect first column, which has the id field. I was able to apply the function dynamically but now I need the final dataset with id filed back to the dataframe. The spark job will be running on cluster mode,heere is what I tried.
val df = sc.parallelize(
Seq(("id1", "B", "c","d"), ("id2", "e", "d","k"),("id3", "e", "m","n"))).toDF("id", "dat1", "dat2","dat3")
df.show
+---+----+----+----+
| id|dat1|dat2|dat3|
+---+----+----+----+
|id1| B| c| d|
|id2| e| d| k|
|id3| e| m| n|
+---+----+----+----+
df.select(df.columns.slice(1,df.columns.size).map(c => upper(col(c)).alias(c)): _*).show
----+----+----+
|dat1|dat2|dat3|
+----+----+----+
| B| C| D|
| E| D| K|
| E| M| N|
+----+----+----+
Expected output
-----+----+----+
id|dat1|dat2|dat3|
-+----+----+----+
|id1| B| C| D|
|id2| E| D| K|
|id3| E| M| N|
-+----+----+----+
Simply prepend the id column to the other (transformed) columns:
df.select(
col("id") +: df.columns.tail.map(c => upper(col(c)).alias(c)): _*
).show
+---+----+----+----+
| id|dat1|dat2|dat3|
+---+----+----+----+
|id1| B| C| D|
|id2| E| D| K|
|id3| E| M| N|
+---+----+----+----+

Resources