filter nearest row by date in pyspark - apache-spark

I want to filter by date on my pyspark dataframe.
I have a dataframe like this:
+------+---------+------+-------------------+-------------------+----------+
|amount|cost_type|place2| min_ts| max_ts| ds|
+------+---------+------+-------------------+-------------------+----------+
|100000| reorder| 1.0|2020-10-16 10:16:31|2020-11-21 18:50:27|2021-05-29|
|250000|overusage| 1.0|2020-11-21 18:48:02|2021-02-09 20:07:28|2021-05-29|
|100000|overusage| 1.0|2019-05-12 16:00:40|2020-11-21 18:44:04|2021-05-29|
|200000| reorder| 1.0|2020-11-21 19:00:09|2021-05-29 23:56:25|2021-05-29|
+------+---------+------+-------------------+-------------------+----------+
And I want to filter just one row for every possible ‍‍cost_type which has the nearest time to ds
for example for ds = '2021-05-29' my filter should select the second and fourth row. But for ds = '2020-05-01' should select the first and third row of my dataframe. If my ds was in the range of ‍‍‍min_ts and max_ts my filter should select that row for every cost type.

A possible way is to assign row numbers based on some conditions:
Whether ds is between min_ts and max_ts.
If not, the smaller of the absolute date difference between ds and min_ts, or between ds and max_ts.
from pyspark.sql import functions as F, Window
w = Window.partitionBy('cost_type').orderBy(
F.col('ds').cast('timestamp').between(F.col('min_ts'), F.col('max_ts')).desc(),
F.least(F.abs(F.datediff('ds', 'max_ts')), F.abs(F.datediff('ds', 'min_ts')))
)
df2 = df.withColumn('rn', F.row_number().over(w)).filter('rn = 1').drop('rn')

Related

How to update the empty Pandas dataframe from the the sum of the bottom x rows of a column from an another Pandas dataframe

I would like to sum the bottom x rows of each column of the dataframe and update it in a another empty data frame.
I tried the below code but i could not update the dataframe.
The master DataFrame is ‘df_new_final’ and it contains numerical values.
I want to update in a ‘df_new_final_tail’ as an input of the sum of tail 15 rows from Master DataFrame. But df_new_final_tail is still an empty but i can see that ‘sum_x’ is getting calculated. Not sure why it is not getting updated.
Master DataFrame ——> df_new_final
Child DataFrame ——-> df_new_final_tail
df_series_list = df_series.columns.values.tolist()
df_new_final_tail = pd.DataFrame(columns=df_series_list)
for items in df_series_list:
sum_x = df_new_final.tail(15)[items+’_buy’].sum()
df_new_final_tail[items]=sum_x
Thanks
Convert Series after sum to one column DataFrame by Series.to_frame and for one row DataFrame use transpose by DataFrame.T:
df_new_final_tail = df_new_final.tail(15).sum().to_frame().T
If df_series is another DataFrame and columns names are same with suffix _buy for parse df_new_final use:
items = df_series.columns
df_new_final_tail = df_new_final.tail(15)[items+'_buy'].sum().to_frame().T

How I can get the same result using iloc in Pandas in PySpark?

In Pandas dataframe I can get the first 1000 rows with data.iloc[1:1000,:]. How I can do it in PySpark?
You can use df.limit(1000) to get 1000 rows from your dataframe. Note that Spark does not have a concept of index, so it will just return 1000 random rows. If you need a particular ordering, you can assign a row number based on a certain column, and filter the row numbers. e.g.
import pyspark.sql.functions as F
df2 = df.withColumn('rn', F.row_number().over(Window.orderBy('col_to_order'))) \
.filter('rn <= 1000')

Spark drop duplicates and select row with max value

I'm trying to drop duplicates based on column1 and select the row with max value in column2. The column2 has "year"(2019,2020 etc) as values and it is of type "String". The solution I have is, converting the column 2 into integer and selecting the max value.
Dataset<Row> ds ; //The dataset with column1,column2(year), column3 etc.
Dataset<Row> newDs = ds.withColumn("column2Int", col("column2").cast(DataTypes.IntegerType));
newDs = newDs.groupBy("column1").max("column2Int"); // drops all other columns
This approach drops all other columns in the original dataset 'ds' when I do a "group by". So I have to do a join between 'ds' and 'newDS' to get back all the original columns. Also casting the String column to Integer looks like an ineffective workaround.
Is it possible to drop the duplicates and get the row with bigger string value from the original dataset itself ?
This is a classic de-duplication problem and you'll need to use Window + Rank + filter combo for this.
I'm not very familiar with the Java syntax, but the sample code should look like something below,
import org.apache.spark.sql.expressions.Window;
import org.apache.spark.sql.expressions.WindowSpec;
import org.apache.spark.sql.functions;
import org.apache.spark.sql.types.DataTypes;
Dataset<Row> df = ???;
WindowSpec windowSpec = Window.partitionBy("column1").orderBy(functions.desc("column2Int"));
Dataset<Row> result =
df.withColumn("column2Int", functions.col("column2").cast(DataTypes.IntegerType))
.withColumn("rank", functions.rank().over(windowSpec))
.where("rank == 1")
.drop("rank");
result.show(false);
Overview of what happened,
Add the casted integer column to the df for future sorting.
Subsections/ windows were formed in your dataset (partitions) based on the value of column1
For each of these subsections/ windows/ partitions the rows were sorted on column casted to int. Desc order as you want max.
Ranks like row numbers are assigned to the rows in each partition/ window created.
Filtering is done for all row where rank is 1 (max value as the ordering was desc.)

How to selecting multiple rows and take mean value based on name of the row

From this data frame I like to select rows with same concentration and also almost same name. For example, first three rows has same concentration and also same name except at the end of the name Dig_I, Dig_II, Dig_III. This 3 rows same with same concentration. I like to somehow select this three rows and take mean value of each column. After that I want to create a new data frame.
here is the whole data frame:
import pandas as pd
df = pd.read_csv("https://gist.github.com/akash062/75dea3e23a002c98c77a0b7ad3fbd25b.js")
import pandas as pd
df = pd.read_csv("https://gist.github.com/akash062/75dea3e23a002c98c77a0b7ad3fbd25b.js")
new_df = df.groupby('concentration').mean()
Note: This will only find the averages for columns with dtype float or int... this will drop the img_name column and will take the averages of all columns...
This may be faster...
df = pd.read_csv("https://gist.github.com/akash062/75dea3e23a002c98c77a0b7ad3fbd25b.js").groupby('concentration').mean()
If you would like to preserve the img_name...
df = pd.read_csv("https://gist.github.com/akash062/75dea3e23a002c98c77a0b7ad3fbd25b.js")
new = df.groupby('concentration').mean()
pd.merge(df, new, left_on = 'concentration', right_on = 'concentration', how = 'inner')
Does that help?

spark - get average of past N records excluding the current record

Given a Spark dataframe that I have
val df = Seq(
("2019-01-01",100),
("2019-01-02",101),
("2019-01-03",102),
("2019-01-04",103),
("2019-01-05",102),
("2019-01-06",99),
("2019-01-07",98),
("2019-01-08",100),
("2019-01-09",47)
).toDF("day","records")
I want to add a new column to this so that I get an average value of last N records on a given day. For example, if N=3, then on a given day, that value should be average of last 3 values EXCLUDING the current record
For example, for day 2019-01-05, it would be (103+102+101)/3
How I can use efficiently use over() clause in order to do this in Spark?
PySpark solution.
Window definition should be 3 PRECEDING AND 1 FOLLOWING which translates to positions (-3,-1) with both boundaries included.
from pyspark.sql import Window
from pyspark.sql.functions import avg
w = Window.orderBy(df.day)
df_with_rsum = df.withColumn("rsum_prev_3_days",avg(df.records).over(w).rowsBetween(-3, -1))
df_with_rsum.show()
The solution assumes there is one row per date in the dataframe without missing dates in between. If not, aggregate the rows by date before applying the window function.

Resources