Applying function to dataframe columns spark scala - apache-spark

I have a large dataset with considerabely large number of columns(150), I want to apply a function(UDF) on all the column expect first column, which has the id field. I was able to apply the function dynamically but now I need the final dataset with id filed back to the dataframe. The spark job will be running on cluster mode,heere is what I tried.
val df = sc.parallelize(
Seq(("id1", "B", "c","d"), ("id2", "e", "d","k"),("id3", "e", "m","n"))).toDF("id", "dat1", "dat2","dat3")
df.show
+---+----+----+----+
| id|dat1|dat2|dat3|
+---+----+----+----+
|id1| B| c| d|
|id2| e| d| k|
|id3| e| m| n|
+---+----+----+----+
df.select(df.columns.slice(1,df.columns.size).map(c => upper(col(c)).alias(c)): _*).show
----+----+----+
|dat1|dat2|dat3|
+----+----+----+
| B| C| D|
| E| D| K|
| E| M| N|
+----+----+----+
Expected output
-----+----+----+
id|dat1|dat2|dat3|
-+----+----+----+
|id1| B| C| D|
|id2| E| D| K|
|id3| E| M| N|
-+----+----+----+

Simply prepend the id column to the other (transformed) columns:
df.select(
col("id") +: df.columns.tail.map(c => upper(col(c)).alias(c)): _*
).show
+---+----+----+----+
| id|dat1|dat2|dat3|
+---+----+----+----+
|id1| B| C| D|
|id2| E| D| K|
|id3| E| M| N|
+---+----+----+----+

Related

How to intersect/union pyspark dataframes with different values

I have one data frame as ( This is overall data frame) with 0s and 1s
+---+-----+
|key|value|
+---+-----+
| a| 0.5|
| b| 0.4|
| c| 0.5|
| d| 0.3|
| x| 0.0|
| y| 0.0|
| z| 0.0|
+---+-----+
and the second dataframe is ( Bad Output ) ( Should contain only 0s )
+---+-----+
|key|value|
+---+-----+
| a| 0.0|
| e| 0.0|
| f| 0.0|
| g| 0.0|
+---+-----+
Note : the value of `a` has chnaged
How to write my script so I can get the following output of my second Dataframe ( only 0s and as value of a is 1 in good data frame , i want to remove it from bad one
+---+-----+
|key|value|
+---+-----+
| e| 0.0|
| f| 0.0|
| g| 0.0|
| x| 0.0|
| y| 0.0|
| z| 0.0|
+---+-----+
Non-zero overall values can be removed from bad output, and zero overall values added (Scala):
val overall = Seq(
("a", 0.5),
("b", 0.4),
("c", 0.5),
("d", 0.3),
("x", 0.0),
("y", 0.0),
("z", 0.0),
).toDF("key", "value")
val badOutput = Seq(
("a", 0.0),
("e", 0.0),
("f", 0.0),
("g", 0.0)
)
.toDF("key", "value")
badOutput
.except(overall.where($"value"=!=0).withColumn("value", lit(0.0)))
.union (overall.where($"value"===0))
You can union two dataframes then groupBy + array_contains function to get the desired result.
Example:
df.show()
#+---+-----+
#|key|value|
#+---+-----+
#| a| 1|
#| b| 1|
#| c| 1|
#| d| 1|
#| x| 0|
#| y| 0|
#| z| 0|
#+---+-----+
df1.show()
#+---+-----+
#|key|value|
#+---+-----+
#| a| 0|
#| e| 0|
#| f| 0|
#| g| 0|
#+---+-----+
df2=df.unionAll(df1)
df3=df2.groupBy("key").agg(collect_list(col("value")).alias("lst"))
df3.filter(~array_contains("lst",1)).\
withColumn("lst",array_join(col("lst"),'')).\
show()
#+---+---+
#|key|lst|
#+---+---+
#| x| 0|
#| g| 0|
#| f| 0|
#| e| 0|
#| z| 0|
#| y| 0|
#+---+---+

Iterating through rows to create custom formula structure in PySpark

I have a dataframe with variable names and numerator and denominator.
Each variable is a ratio, eg below:
And another dataset with actual data to compute the attributes:
Goal is to create these attributes with formulas in 1st and compute with 2nd.
Currently my approach is very naive:
df = df.withColumn("var1", col('a')/col('b'))./
.
.
.
Desired Output:
Since I have >500 variables, any suggestions for a smarter way to get around this are welcome!
This can be achieved by cross join , unpivot and pivot function in PySpark.
import pyspark.sql.functions as f
from pyspark.sql.functions import *
from pyspark.sql.types import *
data = [
("var1", "a","c"),
("var2", "b","d"),
("var3", "b","a"),
("var4", "d","c")
]
schema = StructType([
StructField('name', StringType(),True), \
StructField('numerator', StringType(),True), \
StructField('denonminator', StringType(),True)
])
data2 = [
("ID1", 6,4,3,7),
("ID2", 1,2,3,9)
]
schema2 = StructType([
StructField('ID', StringType(),True), \
StructField('a', IntegerType(),True), \
StructField('b', IntegerType(),True),\
StructField('c', IntegerType(),True), \
StructField('d', IntegerType(),True)
])
df = spark.createDataFrame(data=data, schema=schema)
df2 = spark.createDataFrame(data=data2, schema=schema2)
df.createOrReplaceTempView("table1")
df2.createOrReplaceTempView("table2")
df.createOrReplaceTempView("table1")
df2.createOrReplaceTempView("table2")
""" CRoss Join for Duplicating the values """
df3=spark.sql("select * from table1 cross join table2")
df3.createOrReplaceTempView("table3")
""" Unpivoting the values and joining to fecth the value of numerator and denominator"""
cols = df2.columns[1:]
df4=df2.selectExpr('ID', "stack({}, {})".format(len(cols), ', '.join(("'{}', {}".format(i, i) for i in cols))))
df4.createOrReplaceTempView("table4")
df5=spark.sql("select name,B.ID,round(B.col1/C.col1,2) as value from table3 A left outer join table4 B on A.ID=B.ID and a.numerator=b.col0 left outer join table4 C on A.ID=C.ID and a.denonminator=C.col0 order by name,ID")
""" Pivot for fetching the results """
df_final=df5.groupBy("ID").pivot("name").max("value")
The results of all intermediate and final dataframes
>>> df.show()
+----+---------+------------+
|name|numerator|denonminator|
+----+---------+------------+
|var1| a| c|
|var2| b| d|
|var3| b| a|
|var4| d| c|
+----+---------+------------+
>>> df2.show()
+---+---+---+---+---+
| ID| a| b| c| d|
+---+---+---+---+---+
|ID1| 6| 4| 3| 7|
|ID2| 1| 2| 3| 9|
+---+---+---+---+---+
>>> df3.show()
+----+---------+------------+---+---+---+---+---+
|name|numerator|denonminator| ID| a| b| c| d|
+----+---------+------------+---+---+---+---+---+
|var1| a| c|ID1| 6| 4| 3| 7|
|var2| b| d|ID1| 6| 4| 3| 7|
|var1| a| c|ID2| 1| 2| 3| 9|
|var2| b| d|ID2| 1| 2| 3| 9|
|var3| b| a|ID1| 6| 4| 3| 7|
|var4| d| c|ID1| 6| 4| 3| 7|
|var3| b| a|ID2| 1| 2| 3| 9|
|var4| d| c|ID2| 1| 2| 3| 9|
+----+---------+------------+---+---+---+---+---+
>>> df4.show()
+---+----+----+
| ID|col0|col1|
+---+----+----+
|ID1| a| 6|
|ID1| b| 4|
|ID1| c| 3|
|ID1| d| 7|
|ID2| a| 1|
|ID2| b| 2|
|ID2| c| 3|
|ID2| d| 9|
+---+----+----+
>>> df5.show()
+----+---+-----+
|name| ID|value|
+----+---+-----+
|var1|ID1| 2.0|
|var1|ID2| 0.33|
|var2|ID1| 0.57|
|var2|ID2| 0.22|
|var3|ID1| 0.67|
|var3|ID2| 2.0|
|var4|ID1| 2.33|
|var4|ID2| 3.0|
+----+---+-----+
>>> df_final.show() final
+---+----+----+----+----+
| ID|var1|var2|var3|var4|
+---+----+----+----+----+
|ID2|0.33|0.22| 2.0| 3.0|
|ID1| 2.0|0.57|0.67|2.33|
+---+----+----+----+----+

Conditions in Spark window function

I have a dataframe like
+---+---+---+---+
| q| w| e| r|
+---+---+---+---+
| a| 1| 20| y|
| a| 2| 22| z|
| b| 3| 10| y|
| b| 4| 12| y|
+---+---+---+---+
I want to mark the rows with the minimum e and r = z . If there are no rows which have r = z, I want the row with the minimum e, even if r = y.
Essentially, something like
+---+---+---+---+---+
| q| w| e| r| t|
+---+---+---+---+---+
| a| 1| 20| y| 0|
| a| 2| 22| z| 1|
| b| 3| 10| y| 1|
| b| 4| 12| y| 0|
+---+---+---+---+---+
I can do it using a number of joins, but that would be too expensive.
So I was looking for a window-based solution.
You can calculate the minimum per group once for rows with r = z and then for all rows within a group. The first non-null value can then be compared to e:
from pyspark.sql import functions as F
from pyspark.sql import Window
df = ...
w = Window.partitionBy("q")
#When ordering is not defined, an unbounded window frame is used by default.
df.withColumn("min_e_with_r_eq_z", F.expr("min(case when r='z' then e else null end)").over(w)) \
.withColumn("min_e_overall", F.min("e").over(w)) \
.withColumn("t", F.coalesce("min_e_with_r_eq_z","min_e_overall") == F.col("e")) \
.orderBy("w") \
.show()
Output:
+---+---+---+---+-----------------+-------------+-----+
| q| w| e| r|min_e_with_r_eq_z|min_e_overall| t|
+---+---+---+---+-----------------+-------------+-----+
| a| 1| 20| y| 22| 20|false|
| a| 2| 22| z| 22| 20| true|
| b| 3| 10| y| null| 10| true|
| b| 4| 12| y| null| 10|false|
+---+---+---+---+-----------------+-------------+-----+
Note: I assume that q is the grouping column for the window.
You can assign row numbers based on whether r = z and the value of column e:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
't',
F.when(
F.row_number().over(
Window.partitionBy('q')
.orderBy((F.col('r') == 'z').desc(), 'e')
) == 1,
1
).otherwise(0)
)
df2.show()
+---+---+---+---+---+
| q| w| e| r| t|
+---+---+---+---+---+
| a| 2| 22| z| 1|
| a| 1| 20| y| 0|
| b| 3| 10| y| 1|
| b| 4| 12| y| 0|
+---+---+---+---+---+
Adding the spark-scala version of #werner 's accepted answer
val w = Window.partitionBy("q")
df.withColumn("min_e_with_r_eq_z", min(when($"r" === "z", $"e").otherwise(null)).over(w))
.withColumn("min_e_overall", min("e").over(w))
.withColumn("t", coalesce($"min_e_with_r_eq_z", $"min_e_overall") === $"e")
.orderBy("w")
.show()

PySpark - How to set the default value for pyspark.sql.functions.lag to a value within the current row?

How does one set the default value for pyspark.sql.functions.lag to a value within the current row?
For example, given:
testInput = [(1, 'a'),(2, 'c'),(3, 'e'),(1, 'a'),(1, 'b'),(1, 'b')]
columns = ['Col A', 'Col B']
df = sc.parallelize(testInput).toDF(columns)
df.show()
windowSpecification = Window.partitionBy(col('Col A')).orderBy(col('Col B'))
changedRows = col('Col B') != F.lag(col('Col B'), 1).over(windowSpecification)
df.select(col('Col A'), col('Col B'), changedRows.alias('New Col C')).show()
which outputs:
+-----+-----+
|Col A|Col B|
+-----+-----+
| 1| a|
| 2| c|
| 3| e|
| 1| a|
| 1| b|
| 1| b|
+-----+-----+
+-----+-----+---------+
|Col A|Col B|New Col C|
+-----+-----+---------+
| 1| a| null|
| 1| a| false|
| 1| b| true|
| 1| b| false|
| 3| e| null|
| 2| c| null|
+-----+-----+---------+
I would like the output to look like:
+-----+-----+---------+
|Col A|Col B|New Col C|
+-----+-----+---------+
| 1| a| false|
| 1| a| false|
| 1| b| true|
| 1| b| false|
| 3| e| false|
| 2| c| false|
+-----+-----+---------+
My current workaround is to add a second lag call to the changedRows, like so:
changedRows = (col('Col B') != F.lag(col('Col B'), 1).over(windowSpecification)) & F.lag(col('Col B'), 1).over(windowSpecification).isNotNull()
but this does not look clean to me.
I would like to do something like
changedRows = col('Col B') != F.lag(col('Col B'), 1, col('Col B')).over(windowSpecification)
but I get the error TypeError: 'Column' object is not callable.
You can use column values as parameters if you use pyspark.sql.functions.expr. In your case, make the following modification to changedRows:
changedRows = F.expr(
"`Col B` != lag(`Col B`, 1, `Col B`) over (PARTITION BY `Col A` ORDER BY `Col B`)"
)
df.select('Col A', 'Col B', changedRows.alias('New Col C')).show()
#+-----+-----+---------+
#|Col A|Col B|New Col C|
#+-----+-----+---------+
#| 1| a| false|
#| 1| a| false|
#| 1| b| true|
#| 1| b| false|
#| 3| e| false|
#| 2| c| false|
#+-----+-----+---------+
You have to refer to the column names in back ticks because of the space.

Conditionally remove duplicates duplicated rows in spark dataset

I would like to achieve the following thing:
val df1 = Seq(("a","x",20),("z","x",10),("b","y",7),("z","y",5),("c","w",1),("z","w",2)).toDS
+---+---+---+
| _1| _2| _3|
+---+---+---+
| a| x| 20|
| z| x| 10|
| b| y| 7|
| z| y| 5|
| c| w| 1|
| z| w| 2|
+---+---+---+
should be reduced to
val df2 = Seq(("a","x",30),("b","y",12),("c","w",3)).toDS
+---+---+---+
| _1| _2| _3|
+---+---+---+
| a| x| 30|
| b| y| 12|
| c| w| 3|
+---+---+---+
I am aware of the dropDuplicates()command with its options. But for what I ould like to achieve this does not work. Somehow one has to detect the duplicates according to column _2 and then one has to remove the always the row with the z entry in _1 and add its _3 value to the the _3column that one keeps.
Thank you in advance.
As per your question this is what you are looking for
import spark.implicits._
val df1 = Seq(("a","x",20),("z","x",10),("b","y",7),("z","y",5),("c","w",1),("z","w",2)).toDS
val resultDf = df1.groupBy("_2").agg(collect_list("_1")(0).as("_1"), sum("_3").as("_3"))
Output:
+---+---+---+
| _2| _1| _3|
+---+---+---+
| x| a| 30|
| w| c| 3|
| y| b| 12|
+---+---+---+
You will get the result but the order is not guaranteed.

Resources