How to repartition in spark based on column? - apache-spark

I want to repartition the dataframe based on day column.
Like, I have 90 days data in dataframe and I want to partition data based on day, so that I have each day in each partition
I want a syntax like below..
df.repartition("day",90)
Where
day => column in dataframe
90 => number of partitions I want

You can do that by
import spark.implicits._
df.repartition(df.select($"day").count().toInt, $"day")

Related

best way to merge two pandas dataframe having 150 k records in each..Merging should happen on one common column like ID

best way to merge two pandas dataframe having 150 k records in each..Merging should happen on one common column like ID
df1(1st Datafrma150k records)
df2(2nd dataframe 150k records)
df3 = pd.merge(df1,df2,on='ID',how='left')
is this correct way to merge ?

How to repartition dataframe after filter? pyspark

I have the following code:
df = df.where(df.count>5)
Df before filter had 500M rows and after filter it has 10M rows.
I understood that repartition can improve performance in this case because the data size changed dramatically and partitions number stay the same.
My question is how to choose the column to make repartition with?
I have key column that is unique between all the values and category column that is not distinct
Should I make df.repartition("key") ? (key have 10M distinct values out of 10M)
Should i make df.repartition("category") ? (Category have 200k distinct values out of 10M)

Add prefix to all values in a column in the most efficient way

I have a dataframe with over 5M rows. I am concerned with just one column in this dataframe. Let's assume dataframe name to be df and the column in consideration to be df['id']. An example of the dataframe is shown below:
df['id'] :
id
0 432000000
1 432000010
2 432000020
The column df['id] is stored as a string.
I want to add a prefix to all the rows of a particular column in this dataframe. Below is the code I use:
for i in tqdm(range(0,len(df['id']))):
df['id'][i]='ABC-1234-'+df['id'][i]
While the above code works, it shows 15 hours to complete. Is there a more efficient way to perform this task ?

spark - get average of past N records excluding the current record

Given a Spark dataframe that I have
val df = Seq(
("2019-01-01",100),
("2019-01-02",101),
("2019-01-03",102),
("2019-01-04",103),
("2019-01-05",102),
("2019-01-06",99),
("2019-01-07",98),
("2019-01-08",100),
("2019-01-09",47)
).toDF("day","records")
I want to add a new column to this so that I get an average value of last N records on a given day. For example, if N=3, then on a given day, that value should be average of last 3 values EXCLUDING the current record
For example, for day 2019-01-05, it would be (103+102+101)/3
How I can use efficiently use over() clause in order to do this in Spark?
PySpark solution.
Window definition should be 3 PRECEDING AND 1 FOLLOWING which translates to positions (-3,-1) with both boundaries included.
from pyspark.sql import Window
from pyspark.sql.functions import avg
w = Window.orderBy(df.day)
df_with_rsum = df.withColumn("rsum_prev_3_days",avg(df.records).over(w).rowsBetween(-3, -1))
df_with_rsum.show()
The solution assumes there is one row per date in the dataframe without missing dates in between. If not, aggregate the rows by date before applying the window function.

How to filter out duplicate rows based on some columns in spark dataframe?

Suppose, I have a Dataframe like below:
Here, you can see that transaction number 1,2 and 3 have same value for columns A,B,C but different value for column D and E. Column E has date entries.
For same A,B and C combination (A=1,B=1,C=1), we have 3 rows. I want to take only one row based on the recent transaction date of column E means the rows which have the most recent date. But for the most recent date, there are 2 transactions. But i want to take just one of them if two or more rows found for the same combination of A,B,C and most recent date in column E.
So my expected output for this combination will be row number 3 or 4(any one will do).
For same A,B and C combination (A=2,B=2,C=2), we have 2 rows. But based on column E, the most recent date is the date of row number 5. So we will just take this row for this combination of A,B and C.
So my expected output for this combination will be row number 5
So the final output will be (3 and 5) or (4 and 5).
Now how should i approach:
I read this:
Both reduceByKey and groupByKey can be used for the same purposes but
reduceByKey works much better on a large dataset. That’s because Spark
knows it can combine output with a common key on each partition before
shuffling the data.
I tried with groupBy on Column A,B,C and max on column E. But it can't give me the head of the rows if multiple rows present for the same date.
What is the most optimized approach to solve this? Thanks in advance.
EDIT: I need get back my filtered transactions. How to do it also?
I have used spark window functions to get my solution:
val window = Window
.partitionBy(dataframe("A"), dataframe("B"),dataframe("C"))
.orderBy(dataframe("E") desc)
val dfWithRowNumber = dataframe.withColumn("row_number", row_number() over window)
val filteredDf = dfWithRowNumber.filter(dfWithRowNumber("row_number") === 1)
Link possible by several steps. Agregated Dataframe:
val agregatedDF=initialDF.select("A","B","C","E").groupBy("A","B","C").agg(max("E").as("E_max"))
Link intial-agregated:
initialDF.join(agregatedDF, List("A","B","C"))
If initial DataFrame comes from Hive, all can be simplified.
val initialDF = Seq((1,1,1,1,"2/28/2017 0:00"),(1,1,1,2,"3/1/2017 0:00"),
(1,1,1,3,"3/1/2017 0:00"),(2,2,2,1,"2/28/2017 0:00"),(2,2,2,2,"2/25/20170:00"))
This will miss out on corresponding col(D)
initialDF
.toDS.groupBy("_1","_2","_3")
.agg(max(col("_5"))).show
In case you want the corresponding colD for the max col:
initialDF.toDS.map(x=>x._1,x._2,x._3,x._5,x._4))).groupBy("_1","_2","_3")
.agg(max(col("_4")).as("_4")).select(col("_1"),col("_2"),col("_3"),col("_4._2"),col("_4._1")).show
For ReduceByKey you can convert the dataset to pairRDD and then work off it.Should be faster in case the Catalyst is not able to optimize the groupByKey in the first one. Refer Rolling your own reduceByKey in Spark Dataset

Resources