How to iterate over columns of "spark" dataframe? - apache-spark

I have the following Spark dataframe that is created dynamically
| name| number |
+--------+---------+
| Andy | (20,10,30)|
|Berta | (30,40,20)|
| Joe | (40,90,60)|
+-------+---------+
Now, I need to iterate each row and column in Spark to print the following output, How to do this?
Andy 20
Andy 10
Andy 30
Berta 30
Berta 40
Berta 20
Joe 40
Joe 90
Joe 60

Assuming the number column is of string Data Type, you can achieve the desired results by following below steps.
Original Data Frame:
val df = Seq(("Andy", "20,10,30"), ("Berta", "30,40,20"), ("Joe", "40,90,60"))
.toDF("name", "number")
Then Create an intermediate Data Frame having 3 number columns by splitting the number column with comma.
val Interim_Df = df.withColumn("n1", split(col("number"), ",").getItem(0))
.withColumn("n2", split(col("number"), ",").getItem(1))
.withColumn("n3", split(col("number"), ",").getItem(2))
.drop("number")
Then generate the final result data frame by doing union with oneIndexDfs.
val columnIndexes = Seq(1, 2, 3)
val onlyOneIndexDfs = columnIndexes.map(x =>
Interim_Df.select(
$"name",
col(s"n$x").alias("number")))
val resultDF = onlyOneIndexDfs.reduce(_ union _)

You need explode function.
Here samples of its usage.

Related

Filter DataFrame to delete duplicate values in pyspark

I have the following dataframe
date | value | ID
--------------------------------------
2021-12-06 15:00:00 25 1
2021-12-06 15:15:00 35 1
2021-11-30 00:00:00 20 2
2021-11-25 00:00:00 10 2
I want to join this DF with another one like this:
idUser | Name | Gender
-------------------
1 John M
2 Anne F
My expected output is:
ID | Name | Gender | Value
---------------------------
1 John M 35
2 Anne F 20
What I need is: Get only the most recent value of the first dataframe and join only this value with my second dataframe. Although, my spark script is joining both values:
My code:
df = df1.select(
col("date"),
col("value"),
col("ID"),
).OrderBy(
col("ID").asc(),
col("date").desc(),
).groupBy(
col("ID"), col("date").cast(StringType()).substr(0,10).alias("date")
).agg (
max(col("value")).alias("value")
)
final_df = df2.join(
df,
(col("idUser") == col("ID")),
how="left"
)
When i perform this join (formating the columns is abstracted in this post) I have the following output:
ID | Name | Gender | Value
---------------------------
1 John M 35
2 Anne F 20
2 Anne F 10
I use substr to remove hours and minutes to filter only by date. But when I have the same ID in different days my output df has the 2 values instead of the most recently. How can I fix this?
Note: I'm using only pyspark functions to do this (I now want to use spark.sql(...)).
You can use window and row_number function in pysaprk
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
windowSpec = Window.partitionBy("ID").orderBy("date").desc()
df1_latest_val = df1.withColumn("row_number", row_number().over(windowSpec)).filter(
f.col("row_number") == 1
)
The output of table df1_latest_val will look something like this
date | value | ID | row_number |
-----------------------------------------------------
2021-12-06 15:15:00 35 1 1
2021-11-30 00:00:00 20 2 1
Now you will have df with the latest val, which you can directly join with another table.

How to join the two dataframe by condition in PySpark?

I am having two dataframe like described below
Dataframe 1
P_ID P_Name P_Description P_Size
100 Moto Mobile 16
200 Apple Mobile 15
300 Oppo Mobile 18
Dataframe 2
P_ID List_Code P_Amount
100 ALPHA 20000
100 BETA 60000
300 GAMMA 15000
Requirement :
Need to join the two dataframe by P_ID.
Information about the dataframe :
In dataframe 1 P_ID is a primary key and dataframe 2 does't have any primary attribute.
How to join the dataframe
Need to create new columns in dataframe 1 from the value of dataframe 2 List_Code appends with "_price". If dataframe 2 List_Code contains 20 unique values we need to create 20 column in dataframe 1. Then, we have fill the value in newly created column in dataframe 1 from the dataframe 2 P_Amount column based on P_ID if present else fills with zero. After creation of dataframe we need to join the dataframe based on the P_ID. If we add the column with the expected value in dataframe 1 we can join the dataframe. My problem is creating new columns with the expected value.
The expected dataframe is shown below
Expected dataframe
P_ID P_Name P_Description P_Size ALPHA_price BETA_price GAMMA_price
100 Moto Mobile 16 20000 60000 0
200 Apple Mobile 15 0 0 0
300 Oppo Mobile 18 0 0 15000
Can you please help me to solve the problem, thanks in advance.
For you application, you need to pivot the second dataframe and then join the first dataframe on to the pivoted result on P_ID using left join.
See the code below.
df_1 = pd.DataFrame({'P_ID' : [100, 200, 300], 'P_Name': ['Moto', 'Apple', 'Oppo'], 'P_Size' : [16, 15, 18]})
sdf_1 = sc.createDataFrame(df_1)
df_2 = pd.DataFrame({'P_ID' : [100, 100, 300], 'List_Code': ['ALPHA', 'BETA', 'GAMMA'], 'P_Amount' : [20000, 60000, 10000]})
sdf_2 = sc.createDataFrame(df_2)
sdf_pivoted = sdf_2.groupby('P_ID').pivot('List_Code').agg(f.sum('P_Amount')).fillna(0)
sdf_joined = sdf_1.join(sdf_pivoted, on='P_ID', how='left').fillna(0)
sdf_joined.show()
+----+------+------+-----+-----+-----+
|P_ID|P_Name|P_Size|ALPHA| BETA|GAMMA|
+----+------+------+-----+-----+-----+
| 300| Oppo| 18| 0| 0|10000|
| 200| Apple| 15| 0| 0| 0|
| 100| Moto| 16|20000|60000| 0|
+----+------+------+-----+-----+-----+
You can change the column names or ordering of the dataframe as needed.

Apache Spark - Finding Array/List/Set subsets

I have 2 dataframes each one having Array[String] as one of the columns. For each entry in one dataframe, I need to find out subsets, if any, in the other dataframe. An example is here:
DF1:
----------------------------------------------------
id : Long | labels : Array[String]
---------------------------------------------------
10 | [label1, label2, label3]
11 | [label4, label5]
12 | [label6, label7]
DF2:
----------------------------------------------------
item : String | labels : Array[String]
---------------------------------------------------
item1 | [label1, label2, label3, label4, label5]
item2 | [label4, label5]
item3 | [label4, label5, label6, label7]
After the subset operation I described, the expected o/p should be
DF3:
----------------------------------------------------
item : String | id : Long
---------------------------------------------------
item1 | [10, 11]
item2 | [11]
item3 | [11, 12]
It is guaranteed that the DF2, will always have corresponding subsets in DF1, so there won't be any left over elements.
Can someone please help with the right approach here ? It looks like for each element in DF2, I need to scan DF1 and do subset operation (or set subtraction) on the 2nd column until I find all the subsets and exhaust the labels in that row and while doing that accumulate the list of "id" fields. How do I do this in compact and efficient manner ? Any help is greatly appreciated. Realistically, I may have 100s of elements in DF1 and 1000s of elements in DF2.
I'm not aware of any way to perform this kind of operation in an efficient way. However, here is one possible solution using UDF as well as Cartesian join.
The UDF takes two sequences and checks if all strings in the first exists in the second:
val matchLabel = udf((array1: Seq[String], array2: Seq[String]) => {
array1.forall{x => array2.contains(x)}
})
To use Cartesian join, it needs to be enabled as it is computationally expensive.
val spark = SparkSession.builder.getOrCreate()
spark.conf.set("spark.sql.crossJoin.enabled", true)
The two dataframes are joined together utilizing the UDF. Afterwards the resulting dataframe is grouped by the item column to collect a list of all ids. Using the same DF1 and DF2 as in the question:
val DF3 = DF2.join(DF1, matchLabel(DF1("labels"), DF2("labels")))
.groupBy("item")
.agg(collect_list("id").as("id"))
The result is as follows:
+-----+--------+
| item| id|
+-----+--------+
|item3|[11, 12]|
|item2| [11]|
|item1|[10, 11]|
+-----+--------+

pyspark: rolling average using timeseries data

I have a dataset consisting of a timestamp column and a dollars column. I would like to find the average number of dollars per week ending at the timestamp of each row. I was initially looking at the pyspark.sql.functions.window function, but that bins the data by week.
Here's an example:
%pyspark
import datetime
from pyspark.sql import functions as F
df1 = sc.parallelize([(17,"2017-03-11T15:27:18+00:00"), (13,"2017-03-11T12:27:18+00:00"), (21,"2017-03-17T11:27:18+00:00")]).toDF(["dollars", "datestring"])
df2 = df1.withColumn('timestampGMT', df1.datestring.cast('timestamp'))
w = df2.groupBy(F.window("timestampGMT", "7 days")).agg(F.avg("dollars").alias('avg'))
w.select(w.window.start.cast("string").alias("start"), w.window.end.cast("string").alias("end"), "avg").collect()
This results in two records:
| start | end | avg |
|---------------------|----------------------|-----|
|'2017-03-16 00:00:00'| '2017-03-23 00:00:00'| 21.0|
|---------------------|----------------------|-----|
|'2017-03-09 00:00:00'| '2017-03-16 00:00:00'| 15.0|
|---------------------|----------------------|-----|
The window function binned the time series data rather than performing a rolling average.
Is there a way to perform a rolling average where I'll get back a weekly average for each row with a time period ending at the timestampGMT of the row?
EDIT:
Zhang's answer below is close to what I want, but not exactly what I'd like to see.
Here's a better example to show what I'm trying to get at:
%pyspark
from pyspark.sql import functions as F
df = spark.createDataFrame([(17, "2017-03-10T15:27:18+00:00"),
(13, "2017-03-15T12:27:18+00:00"),
(25, "2017-03-18T11:27:18+00:00")],
["dollars", "timestampGMT"])
df = df.withColumn('timestampGMT', df.timestampGMT.cast('timestamp'))
df = df.withColumn('rolling_average', F.avg("dollars").over(Window.partitionBy(F.window("timestampGMT", "7 days"))))
This results in the following dataframe:
dollars timestampGMT rolling_average
25 2017-03-18 11:27:18.0 25
17 2017-03-10 15:27:18.0 15
13 2017-03-15 12:27:18.0 15
I'd like the average to be over the week proceeding the date in the timestampGMT column, which would result in this:
dollars timestampGMT rolling_average
17 2017-03-10 15:27:18.0 17
13 2017-03-15 12:27:18.0 15
25 2017-03-18 11:27:18.0 19
In the above results, the rolling_average for 2017-03-10 is 17, since there are no preceding records. The rolling_average for 2017-03-15 is 15 because it is averaging the 13 from 2017-03-15 and the 17 from 2017-03-10 which falls withing the preceding 7 day window. The rolling average for 2017-03-18 is 19 because it is averaging the 25 from 2017-03-18 and the 13 from 2017-03-10 which falls withing the preceding 7 day window, and it is not including the 17 from 2017-03-10 because that does not fall withing the preceding 7 day window.
Is there a way to do this rather than the binning window where the weekly windows don't overlap?
I figured out the correct way to calculate a moving/rolling average using this stackoverflow:
Spark Window Functions - rangeBetween dates
The basic idea is to convert your timestamp column to seconds, and then you can use the rangeBetween function in the pyspark.sql.Window class to include the correct rows in your window.
Here's the solved example:
%pyspark
from pyspark.sql import functions as F
from pyspark.sql.window import Window
#function to calculate number of seconds from number of days
days = lambda i: i * 86400
df = spark.createDataFrame([(17, "2017-03-10T15:27:18+00:00"),
(13, "2017-03-15T12:27:18+00:00"),
(25, "2017-03-18T11:27:18+00:00")],
["dollars", "timestampGMT"])
df = df.withColumn('timestampGMT', df.timestampGMT.cast('timestamp'))
#create window by casting timestamp to long (number of seconds)
w = (Window.orderBy(F.col("timestampGMT").cast('long')).rangeBetween(-days(7), 0))
df = df.withColumn('rolling_average', F.avg("dollars").over(w))
This results in the exact column of rolling averages that I was looking for:
dollars timestampGMT rolling_average
17 2017-03-10 15:27:18.0 17.0
13 2017-03-15 12:27:18.0 15.0
25 2017-03-18 11:27:18.0 19.0
I will add a variation which I personally found very useful. I hope someone will find it useful as well:
If you want to groupby then within the respective groups calculate the moving average:
Example of the dataframe :
from pyspark.sql.window import Window
from pyspark.sql import functions as func
df = spark.createDataFrame([("tshilidzi", 17.00, "2018-03-10T15:27:18+00:00"),
("tshilidzi", 13.00, "2018-03-11T12:27:18+00:00"),
("tshilidzi", 25.00, "2018-03-12T11:27:18+00:00"),
("thabo", 20.00, "2018-03-13T15:27:18+00:00"),
("thabo", 56.00, "2018-03-14T12:27:18+00:00"),
("thabo", 99.00, "2018-03-15T11:27:18+00:00"),
("tshilidzi", 156.00, "2019-03-22T11:27:18+00:00"),
("thabo", 122.00, "2018-03-31T11:27:18+00:00"),
("tshilidzi", 7000.00, "2019-04-15T11:27:18+00:00"),
("ash", 9999.00, "2018-04-16T11:27:18+00:00")
],
["name", "dollars", "timestampGMT"])
# we need this timestampGMT as seconds for our Window time frame
df = df.withColumn('timestampGMT', df.timestampGMT.cast('timestamp'))
df.show(10000, False)
Output:
+---------+-------+---------------------+
|name |dollars|timestampGMT |
+---------+-------+---------------------+
|tshilidzi|17.0 |2018-03-10 17:27:18.0|
|tshilidzi|13.0 |2018-03-11 14:27:18.0|
|tshilidzi|25.0 |2018-03-12 13:27:18.0|
|thabo |20.0 |2018-03-13 17:27:18.0|
|thabo |56.0 |2018-03-14 14:27:18.0|
|thabo |99.0 |2018-03-15 13:27:18.0|
|tshilidzi|156.0 |2019-03-22 13:27:18.0|
|thabo |122.0 |2018-03-31 13:27:18.0|
|tshilidzi|7000.0 |2019-04-15 13:27:18.0|
|ash |9999.0 |2018-04-16 13:27:18.0|
+---------+-------+---------------------+
To calculate the moving average based on the name and still maintain all rows:
#create window by casting timestamp to long (number of seconds)
w = (Window()
.partitionBy(col("name"))
.orderBy(F.col("timestampGMT").cast('long'))
.rangeBetween(-days(7), 0))
df2 = df.withColumn('rolling_average', F.avg("dollars").over(w))
df2.show(100, False)
Output:
+---------+-------+---------------------+------------------+
|name |dollars|timestampGMT |rolling_average |
+---------+-------+---------------------+------------------+
|ash |9999.0 |2018-04-16 13:27:18.0|9999.0 |
|tshilidzi|17.0 |2018-03-10 17:27:18.0|17.0 |
|tshilidzi|13.0 |2018-03-11 14:27:18.0|15.0 |
|tshilidzi|25.0 |2018-03-12 13:27:18.0|18.333333333333332|
|tshilidzi|156.0 |2019-03-22 13:27:18.0|156.0 |
|tshilidzi|7000.0 |2019-04-15 13:27:18.0|7000.0 |
|thabo |20.0 |2018-03-13 17:27:18.0|20.0 |
|thabo |56.0 |2018-03-14 14:27:18.0|38.0 |
|thabo |99.0 |2018-03-15 13:27:18.0|58.333333333333336|
|thabo |122.0 |2018-03-31 13:27:18.0|122.0 |
+---------+-------+---------------------+------------------+
It's worth noting, that if you don't care about the exact dates - but care to have the average of the last 30 days available you can use the rowsBetween function as follows:
w = Window.orderBy('timestampGMT').rowsBetween(-7, 0)
df = eurPrices.withColumn('rolling_average', F.avg('dollars').over(w))
Since you order by the dates, it will take the last 7 occurrences.
You save all the casting.
Do you mean this :
df = spark.createDataFrame([(17, "2017-03-11T15:27:18+00:00"),
(13, "2017-03-11T12:27:18+00:00"),
(21, "2017-03-17T11:27:18+00:00")],
["dollars", "timestampGMT"])
df = df.withColumn('timestampGMT', df.timestampGMT.cast('timestamp'))
df = df.withColumn('rolling_average', f.avg("dollars").over(Window.partitionBy(f.window("timestampGMT", "7 days"))))
Output:
+-------+-------------------+---------------+
|dollars|timestampGMT |rolling_average|
+-------+-------------------+---------------+
|21 |2017-03-17 19:27:18|21.0 |
|17 |2017-03-11 23:27:18|15.0 |
|13 |2017-03-11 20:27:18|15.0 |
+-------+-------------------+---------------+

Inserting and Deleting data in a Spark Dataframe

I have a PySpark Dataframe input_dataframe as shown below:
**cust_id** **source_id** **value**
10 11 test_value
10 12 test_value2
i have another dataframe delta_dataframe which have updated records from input_dataframe and some new records as shown below:
**cust_id** **source_id** **value**
10 11 update_value
10 15 new_value2
In Both dataframe, primary key is combination of cust_id and source_id.
I have to generate a new dataframe output_dataframe, which will have records from input_dataframe with updated records from delta_dataframe, so my final dataframe is as below:
**cust_id** **source_id** **value**
10 11 update_value
10 12 test_value2
10 15 new_value2
Can someone please suggest me, how can i achieve it in PySpark. Any help will be appreciated on this.
Subtract the two dataframes based on primary key. Make inner join of output with input_dataframe. Then take Uion of it with Delta_dataframe. You will get proper output.
You need to join input_dataframe and delta_dataframe using join on two columns
output_df = input_df.join(delta_df, input_df['cust_id'] = delta_df['cust_id'] & input_df['source_id'] = delta_df['source_id'], 'left_outer')
And then select only the required fields from output_df
We can use Outer join and select the required dataframe value,
>>> input_dataframe.join(delta_dataframe,['custid','sourceid'],'outer').select('custid','sourceid',F.coalesce(delta_dataframe['value'],input_dataframe['value']).alias('value')).show()
+------+--------+-------------+
|custid|sourceid| value|
+------+--------+-------------+
| 10| 15| new_value2|
| 10| 11|updated_value|
| 10| 12| test_value2|
+------+--------+-------------+

Resources