How to run user defined function over a window in spark dataframe? - apache-spark

I am trying to detect the outliers from my spark dataframe.
Below is the data sample.
pressure
Timestamp
358.64
2022-01-01 00:00:00
354.98
2022-01-01 00:10:00
350.34
2022-01-01 00:20:00
429.69
2022-01-01 00:30:00
420.41
2022-01-01 00:40:00
413.82
2022-01-01 00:50:00
409.42
2022-01-01 01:00:00
409.67
2022-01-01 01:10:00
413.33
2022-01-01 01:20:00
405.03
2022-01-01 01:30:00
1209.42
2022-01-01 01:40:00
405.03
2022-01-01 01:50:00
404.54
2022-01-01 02:00:00
405.27
2022-01-01 02:10:00
999.45
2022-01-01 02:20:00
362.79
2022-01-01 02:30:00
349.37
2022-01-01 02:40:00
356.2
2022-01-01 02:50:00
3200.23
2022-01-01 03:00:00
348.39
2022-01-01 03:10:00
Here is my function to find out outliers for entire dataset
def outlierDetection(df):
inter_quantile_range = df.approxQuantile("pressure",[0.20,0.80],relativeError=0)
Q1=inter_quantile_range[0]
Q3=inter_quantile_range[1]
inter_quantile_diff = Q3 - Q1
minimum_Q1 = Q1 - 1.5 * inter_quantile_diff
maximum_Q3 = Q3 + 1.5 * inter_quantile_diff
df= df.withColumn("isOutlier",F.when((df["pressure"] > maximum_Q3) | (df["pressure"] < minimum_Q1), 1).otherwise(0))
return df
It is working as expected. but it is considering the outliers for all the values which doesn't fit in the range.
I want to check outlier present for each hourly interval.
I have created another column which has hourly value as follows
pressure
Timestamp
date_hour
358.64
2022-01-01 00:00:00
2022-01-01 00
354.98
2022-01-01 00:10:00
2022-01-01 00
350.34
2022-01-01 00:20:00
2022-01-01 00
429.69
2022-01-01 00:30:00
2022-01-01 00
420.41
2022-01-01 00:40:00
2022-01-01 00
413.82
2022-01-01 00:50:00
2022-01-01 00
409.42
2022-01-01 01:00:00
2022-01-01 01
409.67
2022-01-01 01:10:00
2022-01-01 01
413.33
2022-01-01 01:20:00
2022-01-01 01
405.03
2022-01-01 01:30:00
2022-01-01 01
I am trying to create a window like below.
w1= Window.partitionBy("date_hour").orderBy("Timestamp")
Is there any way to use my function over each window in the dataframe?

If you're using spark 3.1+, you can use percentile_approx to calculate the quantiles, and do rest of the calculations in pyspark. In case your spark version does not have that function, we can use an UDF that uses numpy.quantile for the quantile calculation. I've shown both in the code.
data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['pressure', 'ts']). \
withColumn('ts', func.col('ts').cast('timestamp')). \
withColumn('dt_hr', func.date_format('ts', 'yyyyMMddHH'))
# +--------+-------------------+----------+
# |pressure| ts| dt_hr|
# +--------+-------------------+----------+
# | 358.64|2022-01-01 00:00:00|2022010100|
# | 354.98|2022-01-01 00:10:00|2022010100|
# | 350.34|2022-01-01 00:20:00|2022010100|
# | 429.69|2022-01-01 00:30:00|2022010100|
# | 420.41|2022-01-01 00:40:00|2022010100|
# | 413.82|2022-01-01 00:50:00|2022010100|
# | 409.42|2022-01-01 01:00:00|2022010101|
# | 409.67|2022-01-01 01:10:00|2022010101|
# | 413.33|2022-01-01 01:20:00|2022010101|
# | 405.03|2022-01-01 01:30:00|2022010101|
# | 1209.42|2022-01-01 01:40:00|2022010101|
# | 405.03|2022-01-01 01:50:00|2022010101|
# +--------+-------------------+----------+
getting the quantiles (showing both methods; use whichever is available)
# spark 3.1+ has percentile_approx
pressure_quantile_sdf = data_sdf. \
groupBy('dt_hr'). \
agg(func.percentile_approx('pressure', [0.2, 0.8]).alias('quantile_20_80'))
# +----------+----------------+
# | dt_hr| quantile_20_80|
# +----------+----------------+
# |2022010100|[354.98, 420.41]|
# |2022010101|[405.03, 413.33]|
# +----------+----------------+
# lower versions use UDF
def numpy_quantile_20_80(list_col):
import numpy as np
q_20 = np.quantile(list_col, 0.2)
q_80 = np.quantile(list_col, 0.8)
return [float(q_20), float(q_80)]
numpy_quantile_20_80_udf = func.udf(numpy_quantile_20_80, ArrayType(FloatType()))
pressure_quantile_sdf = data_sdf. \
groupBy('dt_hr'). \
agg(func.collect_list('pressure').alias('pressure_list')). \
withColumn('quantile_20_80', numpy_quantile_20_80_udf(func.col('pressure_list')))
# +----------+--------------------+----------------+
# | dt_hr| pressure_list| quantile_20_80|
# +----------+--------------------+----------------+
# |2022010100|[358.64, 354.98, ...|[354.98, 420.41]|
# |2022010101|[409.42, 409.67, ...|[405.03, 413.33]|
# +----------+--------------------+----------------+
outlier calculation would be easy with the quantile info
pressure_quantile_sdf = pressure_quantile_sdf. \
withColumn('quantile_20', func.col('quantile_20_80')[0]). \
withColumn('quantile_80', func.col('quantile_20_80')[1]). \
withColumn('min_q_20', func.col('quantile_20') - 1.5 * (func.col('quantile_80') - func.col('quantile_20'))). \
withColumn('max_q_80', func.col('quantile_80') + 1.5 * (func.col('quantile_80') - func.col('quantile_20'))). \
select('dt_hr', 'min_q_20', 'max_q_80')
# +----------+------------------+------------------+
# | dt_hr| min_q_20| max_q_80|
# +----------+------------------+------------------+
# |2022010100|256.83502197265625| 518.5549926757812|
# |2022010101|392.58001708984375|425.77996826171875|
# +----------+------------------+------------------+
# outlier calc -- select columns that are required
data_sdf. \
join(pressure_quantile_sdf, 'dt_hr', 'left'). \
withColumn('is_outlier', ((func.col('pressure') > func.col('max_q_80')) | (func.col('pressure') < func.col('min_q_20'))).cast('int')). \
show()
# +----------+--------+-------------------+------------------+------------------+----------+
# | dt_hr|pressure| ts| min_q_20| max_q_80|is_outlier|
# +----------+--------+-------------------+------------------+------------------+----------+
# |2022010100| 358.64|2022-01-01 00:00:00|256.83502197265625| 518.5549926757812| 0|
# |2022010100| 354.98|2022-01-01 00:10:00|256.83502197265625| 518.5549926757812| 0|
# |2022010100| 350.34|2022-01-01 00:20:00|256.83502197265625| 518.5549926757812| 0|
# |2022010100| 429.69|2022-01-01 00:30:00|256.83502197265625| 518.5549926757812| 0|
# |2022010100| 420.41|2022-01-01 00:40:00|256.83502197265625| 518.5549926757812| 0|
# |2022010100| 413.82|2022-01-01 00:50:00|256.83502197265625| 518.5549926757812| 0|
# |2022010101| 409.42|2022-01-01 01:00:00|392.58001708984375|425.77996826171875| 0|
# |2022010101| 409.67|2022-01-01 01:10:00|392.58001708984375|425.77996826171875| 0|
# |2022010101| 413.33|2022-01-01 01:20:00|392.58001708984375|425.77996826171875| 0|
# |2022010101| 405.03|2022-01-01 01:30:00|392.58001708984375|425.77996826171875| 0|
# |2022010101| 1209.42|2022-01-01 01:40:00|392.58001708984375|425.77996826171875| 1|
# |2022010101| 405.03|2022-01-01 01:50:00|392.58001708984375|425.77996826171875| 0|
# +----------+--------+-------------------+------------------+------------------+----------+

Related

subtract second datetime row from first datetime row of a column if another column shows duplicate values

I have a dataframe with two columns Order date and Customer(which have duplicates of only 2 values which has been sorted), I want to subtract the second Order date of the second occurrence of a Customer from the first Order date. Order date is in datetime format
here is a sample of the table
context I'm trying to calculate the time it takes for a customer to make a second order\
Order date Customer
4260 2022-11-11 16:29:00 (App admin)
8096 2022-10-22 12:54:00 (App admin)
996 2021-09-22 20:30:00 10013
946 2021-09-14 15:16:00 10013
3499 2022-04-20 12:17:00 100151
... ... ...
2856 2022-03-21 13:49:00 99491
2788 2022-03-18 12:15:00 99523
2558 2022-03-08 12:07:00 99523
2580 2022-03-04 16:03:00 99762
2544 2022-03-02 15:40:00 99762
I have tried deleting by index but it returns just the first two values.
expected output should be another dataframe with just the Customer name and the difference between the Second and first Order dates of the duplicate customer in minutes
expected output:
| Customer | difference in minutes |
| -------- | -------- |
| 1232 | 445.0 |
|(App Admin)| 3432.0 |
| 1145 | 2455.0 |
|6653 | 32.0 |
You can use groupby:
df['Order date'] = pd.to_datetime(df['Order date'])
out = (df.groupby('Customer', as_index=False)['Order date']
.agg(lambda x: (x.iloc[0] - x.iloc[-1]).total_seconds() / 60)
.query('`Order date` != 0'))
print(out)
# Output:
Customer Order date
0 (App admin) 29015.0
1 10013 11834.0
4 99523 14408.0
5 99762 2903.0

How to separate datetime string into date and time in PySpark?

I read this CSV file
PurchaseDatetime
PurchaseId
29/08/2020 10:09:01
9
5/10/2020 7:02
4
5/10/2020 9:00
6
20/06/2020 02:11:36
4
23/10/2020 07:02:15
3
6/2/2020 10:10
7
You can see, rows number 2, 3 and 6 are different from others.
When I open this CSV in Excel, I find these rows in this format 8/12/2022 12:00:00 AM.
I try to clean the data and create separate Date and Time columns.
df=df.withColumn("PurchaseDate",to_date(col("PurchaseDatetime"),"dd/MM/yyyy HH:mm:ss")).withColumn("PurchaseTime",date_format("PurchaseDatetime","dd/MM/yyyy HH:mm:ss a"))
I want to get this output:
PurchaseDate
Purchasetime
PurchaseId
PurchaseDatetime
29-08-2020
10:09:01
9
29/08/2022 10:09:01
05-10-2020
07:02:00
4
5/10/2020 7:02
05-10-2020
09:00:00
6
5/10/2020 9:00
20-06-2020
02:11:36
4
20/06/2020 02:11:36
23-10-2020
07:02:15
3
23/10/2020 07:02:15
06-02-2020
10:10:00
7
6/2/2020 10:10
But unfortunately I get this:
PurchaseDate
Purchasetime
PurchaseId
PurchaseDatetime
29-08-2020
null
9
29/08/2020 10:09:01
05-10-2020
null
4
5/10/2020 7:02
05-10-2020
null
6
5/10/2020 9:00
20-06-2020
null
4
20/06/2020 02:11:36
23-10-2020
null
3
23/10/2020 07:02:15
06-02-2020
null
7
6/2/2020 10:10
What is the problem?
date_format will convert your column into string containing your specified format. But first, Spark needs to understand what time is in your column. It only understands strings in correct format ('yyyy-MM-dd HH:mm:ss' or 'yyyy-MM-dd'). Since your string has a different format, first you need to convert your string into a timestamp using to_timestamp. However, since you have different string formats, in some rows you will have nulls, so coalesce will attempt another conversion with different parameters in those rows.
Example input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('29/08/2020 10:09:01', 9),
('5/10/2020 7:02', 4),
('5/10/2020 9:00', 6),
('20/06/2020 02:11:36', 4),
('23/10/2020 07:02:15', 3),
('6/2/2020 10:10', 7)],
['PurchaseDatetime', 'PurchaseId'])
Script:
time = F.coalesce(
F.to_timestamp('PurchaseDatetime', 'd/M/yyyy H:mm:ss'),
F.to_timestamp('PurchaseDatetime', 'd/M/yyyy H:mm')
)
df = df.withColumn("PurchaseDate", F.to_date(time)) \
.withColumn("PurchaseTime", F.date_format(time, 'HH:mm:ss'))
df.show()
# +-------------------+----------+------------+------------+
# | PurchaseDatetime|PurchaseId|PurchaseDate|PurchaseTime|
# +-------------------+----------+------------+------------+
# |29/08/2020 10:09:01| 9| 2020-08-29| 10:09:01|
# | 5/10/2020 7:02| 4| 2020-10-05| 07:02:00|
# | 5/10/2020 9:00| 6| 2020-10-05| 09:00:00|
# |20/06/2020 02:11:36| 4| 2020-06-20| 02:11:36|
# |23/10/2020 07:02:15| 3| 2020-10-23| 07:02:15|
# | 6/2/2020 10:10| 7| 2020-02-06| 10:10:00|
# +-------------------+----------+------------+------------+

Computing First Day of Previous Quarter in Spark SQL

How do I derive the first day of the last quarter pertaining to any given date in Spark-SQL query using the SQL API ? Few required samples are as below:
input_date | start_date
------------------------
2020-01-21 | 2019-10-01
2020-02-06 | 2019-10-01
2020-04-15 | 2020-01-01
2020-07-10 | 2020-04-01
2020-10-20 | 2020-07-01
2021-02-04 | 2020-10-01
The Quarters generally are:
1 | Jan - Mar
2 | Apr - Jun
3 | Jul - Sep
4 | Oct - Dec
Note:I am using Spark SQL v2.4.
Any help is appreciated. Thanks.
Use the date_trunc with the negation of 3 months.
df.withColumn("start_date", to_date(date_trunc("quarter", expr("input_date - interval 3 months"))))
.show()
+----------+----------+
|input_date|start_date|
+----------+----------+
|2020-01-21|2019-10-01|
|2020-02-06|2019-10-01|
|2020-04-15|2020-01-01|
|2020-07-10|2020-04-01|
|2020-10-20|2020-07-01|
|2021-02-04|2020-10-01|
+----------+----------+
Personally I would create a table with the dates in from now for the next twenty years using excel or something and just reference that table.

Reassigning substring using a dictionary object within pandas dataframe column

The problem below has been simplified.
The solution should be applicable to larger data-sets and larger dictionaries.
Given a pandas.DataFrame
import pandas as pd
pd.DataFrame(data = {'foo': [1223, 2931, 3781],
'bar': ["34 fake st, footown", "88 real crs, barrington", "28 imaginary st, bazington"]})
| | foo | bar |
|---:|------:|:---------------------------|
| 0 | 1223 | 34 fake st, footown |
| 1 | 2931 | 88 real crs, barrington |
| 2 | 3781 | 28 imaginary st, bazington |
and a dictionary object:
my_dictionary = {'st':'street', 'crs':'crescent'}
What is the best way to replace the sub-string contained within a column in a pandas.DataFrame with my_dictionary?
I expect to have a resulting pandas.DataFrame that looks like:
| | foo | bar |
|---:|------:|:--------------------------------|
| 0 | 1223 | 34 fake street, footown |
| 1 | 2931 | 88 real crescent, barrington |
| 2 | 3781 | 28 imaginary street, bazington |
I have tried the following:
for key, val in my_dictionary.items():
df.bar.loc[df.bar.str.contains(key)] = df.bar.loc[df.bar.str.contains(key)].apply(lambda x: x.replace(key,val))
df.bar
With the given output.
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_with_indexer(indexer, value)
0 34 fake street, footown
1 88 real crescent, barrington
2 28 imaginary street, bazington
Name: bar, dtype: object
How am I able to perform reassignment without getting the above warning message; and without using .copy()?
You can use Series.replace:
df["bar"] = df["bar"].replace(my_dictionary, regex=True)
print (df)
foo bar
0 1223 34 fake street, footown
1 2931 88 real crescent, barrington
2 3781 28 imaginary street, bazington
Do not use .bar.loc, that's chain indexing, which yields the warning. You should do instead:
df.loc[df.bar.str.contains(key), 'bar'] = ...
However, you can just do
for key, val in my_dictionary.items():
df['bar'] = df['bar'].str.replace(key, val)
But I would be more cautious and make sure that the replacement happens where it should be
for key, val in my_dictionary.items():
# this way, you don't replace `street` with `ststreet`
df['bar'] = df['bar'].str.replace(fr'\b({key})\b', val)
Output:
foo bar
0 1223 34 fake street, footown
1 2931 88 real crescent, barrington
2 3781 28 imaginary street, bazington

How to group by dates in Excel

I have two columns in my excel,Session_Start_time and Time_taken. Session_start_time has date and time and time_taken has time taken to complete the session like below .
For example
Session_Start_time | Time_Taken
01-AUG-2016 00:03:57 | 10
01-AUG-2016 00:07:19 | 15
01-AUG-2016 00:10:28 | 10
02-AUG-2016 00:13:26 | 20
02-AUG-2016 00:20:26 | 30
02-AUG-2016 00:25:26 | 20
03-AUG-2016 03:20:26 | 30
03-AUG-2016 04:13:26 | 40
03-AUG-2016 07:13:26 | 40
I need to group the session_start_time by the dates and have the avg time_taken for that particular day.
Session_Start_time | Time_Taken
01-AUG-2016 | 11.67
02-AUG-2016 | 23.33
03-AUG-2016 | 36.66
You could add a third column that pulls out just the date of Session_Start_Time with the formula below starting in C2 and drag it down to fill:
=MONTH(A2)&"/"&DAY(A2)&"/"&YEAR(A2)
From there, you could create a pivot table with your new column as your row labels, and your Time_Taken as y our values.

Resources