How to separate datetime string into date and time in PySpark? - apache-spark

I read this CSV file
PurchaseDatetime
PurchaseId
29/08/2020 10:09:01
9
5/10/2020 7:02
4
5/10/2020 9:00
6
20/06/2020 02:11:36
4
23/10/2020 07:02:15
3
6/2/2020 10:10
7
You can see, rows number 2, 3 and 6 are different from others.
When I open this CSV in Excel, I find these rows in this format 8/12/2022 12:00:00 AM.
I try to clean the data and create separate Date and Time columns.
df=df.withColumn("PurchaseDate",to_date(col("PurchaseDatetime"),"dd/MM/yyyy HH:mm:ss")).withColumn("PurchaseTime",date_format("PurchaseDatetime","dd/MM/yyyy HH:mm:ss a"))
I want to get this output:
PurchaseDate
Purchasetime
PurchaseId
PurchaseDatetime
29-08-2020
10:09:01
9
29/08/2022 10:09:01
05-10-2020
07:02:00
4
5/10/2020 7:02
05-10-2020
09:00:00
6
5/10/2020 9:00
20-06-2020
02:11:36
4
20/06/2020 02:11:36
23-10-2020
07:02:15
3
23/10/2020 07:02:15
06-02-2020
10:10:00
7
6/2/2020 10:10
But unfortunately I get this:
PurchaseDate
Purchasetime
PurchaseId
PurchaseDatetime
29-08-2020
null
9
29/08/2020 10:09:01
05-10-2020
null
4
5/10/2020 7:02
05-10-2020
null
6
5/10/2020 9:00
20-06-2020
null
4
20/06/2020 02:11:36
23-10-2020
null
3
23/10/2020 07:02:15
06-02-2020
null
7
6/2/2020 10:10
What is the problem?

date_format will convert your column into string containing your specified format. But first, Spark needs to understand what time is in your column. It only understands strings in correct format ('yyyy-MM-dd HH:mm:ss' or 'yyyy-MM-dd'). Since your string has a different format, first you need to convert your string into a timestamp using to_timestamp. However, since you have different string formats, in some rows you will have nulls, so coalesce will attempt another conversion with different parameters in those rows.
Example input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('29/08/2020 10:09:01', 9),
('5/10/2020 7:02', 4),
('5/10/2020 9:00', 6),
('20/06/2020 02:11:36', 4),
('23/10/2020 07:02:15', 3),
('6/2/2020 10:10', 7)],
['PurchaseDatetime', 'PurchaseId'])
Script:
time = F.coalesce(
F.to_timestamp('PurchaseDatetime', 'd/M/yyyy H:mm:ss'),
F.to_timestamp('PurchaseDatetime', 'd/M/yyyy H:mm')
)
df = df.withColumn("PurchaseDate", F.to_date(time)) \
.withColumn("PurchaseTime", F.date_format(time, 'HH:mm:ss'))
df.show()
# +-------------------+----------+------------+------------+
# | PurchaseDatetime|PurchaseId|PurchaseDate|PurchaseTime|
# +-------------------+----------+------------+------------+
# |29/08/2020 10:09:01| 9| 2020-08-29| 10:09:01|
# | 5/10/2020 7:02| 4| 2020-10-05| 07:02:00|
# | 5/10/2020 9:00| 6| 2020-10-05| 09:00:00|
# |20/06/2020 02:11:36| 4| 2020-06-20| 02:11:36|
# |23/10/2020 07:02:15| 3| 2020-10-23| 07:02:15|
# | 6/2/2020 10:10| 7| 2020-02-06| 10:10:00|
# +-------------------+----------+------------+------------+

Related

subtract second datetime row from first datetime row of a column if another column shows duplicate values

I have a dataframe with two columns Order date and Customer(which have duplicates of only 2 values which has been sorted), I want to subtract the second Order date of the second occurrence of a Customer from the first Order date. Order date is in datetime format
here is a sample of the table
context I'm trying to calculate the time it takes for a customer to make a second order\
Order date Customer
4260 2022-11-11 16:29:00 (App admin)
8096 2022-10-22 12:54:00 (App admin)
996 2021-09-22 20:30:00 10013
946 2021-09-14 15:16:00 10013
3499 2022-04-20 12:17:00 100151
... ... ...
2856 2022-03-21 13:49:00 99491
2788 2022-03-18 12:15:00 99523
2558 2022-03-08 12:07:00 99523
2580 2022-03-04 16:03:00 99762
2544 2022-03-02 15:40:00 99762
I have tried deleting by index but it returns just the first two values.
expected output should be another dataframe with just the Customer name and the difference between the Second and first Order dates of the duplicate customer in minutes
expected output:
| Customer | difference in minutes |
| -------- | -------- |
| 1232 | 445.0 |
|(App Admin)| 3432.0 |
| 1145 | 2455.0 |
|6653 | 32.0 |
You can use groupby:
df['Order date'] = pd.to_datetime(df['Order date'])
out = (df.groupby('Customer', as_index=False)['Order date']
.agg(lambda x: (x.iloc[0] - x.iloc[-1]).total_seconds() / 60)
.query('`Order date` != 0'))
print(out)
# Output:
Customer Order date
0 (App admin) 29015.0
1 10013 11834.0
4 99523 14408.0
5 99762 2903.0

PySpark: (broadcast) joining two datasets on closest datetimes/unix

I am using PySpark and are close to giving up on my problem.
I have two data sets: one very very very large one (set A) and one that is rather small (set B).
They are of the form:
Data set A:
variable | timestampA
---------------------------------
x | 2015-01-01 09:29:21
y | 2015-01-01 12:01:57
Data set B:
different information | timestampB
-------------------------------------------
info a | 2015-01-01 09:30:00
info b | 2015-01-01 09:30:00
info a | 2015-01-01 12:00:00
info b | 2015-01-01 12:00:00
A has many rows where each row has a different time stamp. B has a time stamp every couple of minutes. The main problem here is, that there are no exact time stamps that match in both data sets.
My goal is to join the data sets on the nearest time stamp. An additional problem arises since I want to join in a specific way.
For each entry in A, I want to map the entire information for the closest timestamp while duplicating the entry in A. So, the result should look like:
Final data set
variable | timestampA | information | timestampB
--------------------------------------------------------------------------
x | 2015-01-01 09:29:21 | info a | 2015-01-01 09:30:00
x | 2015-01-01 09:29:21 | info b | 2015-01-01 09:30:00
y | 2015-01-01 12:01:57 | info a | 2015-01-01 12:00:00
y | 2015-01-01 12:01:57 | info b | 2015-01-01 12:00:00
I am very new to PySpark (and also stackoverflow). I figured that I probably need to use a window function and/or a broadcast join, but I really have no point to start and would appreciate any help. Thank you!
You can you use broadcast to avoid shuffling.
If understand correctly you have timestamps in set_B which are consequent with some determined interval? If so you can do the following:
from pyspark.sql import functions as F
# assuming 5 minutes is your interval in set_B
interval = 'INTERVAL {} SECONDS'.format(5 * 60 / 2)
res = set_A.join(F.broadcast(set_B), (set_A['timestampA'] > (set_B['timestampB'] - F.expr(interval))) & (set_A['timestampA'] <= (set_B['timestampB'] + F.expr(interval))))
Output:
+--------+-------------------+------+-------------------+
|variable| timestampA| info| timestampB|
+--------+-------------------+------+-------------------+
| x|2015-01-01 09:29:21|info a|2015-01-01 09:30:00|
| x|2015-01-01 09:29:21|info b|2015-01-01 09:30:00|
| y|2015-01-01 12:01:57|info a|2015-01-01 12:00:00|
| y|2015-01-01 12:01:57|info b|2015-01-01 12:00:00|
+--------+-------------------+------+-------------------+
If you don't have determined interval then only cross join and then finding min(timestampA - timestampB) interval can do the trick. You can do that with window function and row_number function like following:
w = Window.partitionBy('variable', 'info').orderBy(F.abs(F.col('timestampA').cast('int') - F.col('timestampB').cast('int')))
res = res.withColumn('rn', F.row_number().over(w)).filter('rn = 1').drop('rn')

Computing First Day of Previous Quarter in Spark SQL

How do I derive the first day of the last quarter pertaining to any given date in Spark-SQL query using the SQL API ? Few required samples are as below:
input_date | start_date
------------------------
2020-01-21 | 2019-10-01
2020-02-06 | 2019-10-01
2020-04-15 | 2020-01-01
2020-07-10 | 2020-04-01
2020-10-20 | 2020-07-01
2021-02-04 | 2020-10-01
The Quarters generally are:
1 | Jan - Mar
2 | Apr - Jun
3 | Jul - Sep
4 | Oct - Dec
Note:I am using Spark SQL v2.4.
Any help is appreciated. Thanks.
Use the date_trunc with the negation of 3 months.
df.withColumn("start_date", to_date(date_trunc("quarter", expr("input_date - interval 3 months"))))
.show()
+----------+----------+
|input_date|start_date|
+----------+----------+
|2020-01-21|2019-10-01|
|2020-02-06|2019-10-01|
|2020-04-15|2020-01-01|
|2020-07-10|2020-04-01|
|2020-10-20|2020-07-01|
|2021-02-04|2020-10-01|
+----------+----------+
Personally I would create a table with the dates in from now for the next twenty years using excel or something and just reference that table.

Separating lines with multiple values in one cell to individual lines in excel [duplicate]

This question already has answers here:
Unnest (explode) a Pandas Series
(8 answers)
Closed 2 years ago.
I have a data set (csv file) of names that list names with number of people with that name, their "rank" and the name itself.
I am looking for a way to separate all the names into single lines ideally in excel - but maybe something in pandas is an option.
The problem is that many of the lines contain multiple names comma separated.
the data looks like this.
rank | number of occurrences | name
1 | 10000 | marie
2 | 9999 | sophie
3 | 9998 | ellen
...
...
50 | 122 | jude, allan, jaspar
I would like to have each name on an individual line alongside its correspondent number of occurrences. Its fine that the rank is duplicated.
Something like this
rank | number of occurrences | name
1 | 10000 | marie
2 | 9999 | sophie
3 | 9998 | ellen
..
...
50 | 122 | jude
50 | 122 | allan
50 | 122 | jaspar
Use df.explode()
df.assign(name=(df.name.str.split(','))).explode('name')
Way it works
df.name=# Equivalent of df.assign(name=
df.name.str.split(',')#puts the names in list
df.explode('name')# Disintegrates the multiple names into one per row
rank number of occurrences name
0 1 10000 marie
1 2 9999 sophie
2 3 9998 ellen
3 50 122 jude
3 50 122 allan
3 50 122 jaspar
In [60]: df
Out[60]:
rank no name
0 50 122 jude, allan, jaspar
In [61]: df.assign(name=df['name'].str.split(',')).explode('name')
Out[61]:
rank no name
0 50 122 jude
0 50 122 allan
0 50 122 jaspar

Excel: Difference in hours between duplicates

I am having a problem, hope you can help.
I need to have the differente in hours between duplicates. Example:
Date Time | SESSION_ID | Column I need
24/01/2020 10:00 | 100 | NaN
24/01/2020 11:00 | 100 | 1
14/03/2020 12:00 | 290 | NaN
16/03/2020 13:00 | 254 | NaN
16/03/2020 14:00 | 100 | 1251
In session_ID column, there are 3 duplicates with value 100.
I need to know the difference in hours between those sessions, which would be 1 hour between the first and the second, and 1 251 hours between the second and the third.
Does anyone has any type of clue on how this could be done?
If one has the Dynamic Array formula XLOOKUP, put this in C2 and copy down:
=IF(COUNTIF($B$1:B1,B2),A2-XLOOKUP(B2,$B$1:B1,$A$1:A1,,0,-1),"NaN")
Then format the column: [h]
If not then use INDEX/AGGREGATE in its place:
=IF(COUNTIF($B$1:B1,B2),A2-INDEX(A:A,AGGREGATE(14,7,ROW($B$1:B1)/($B$1:B1=B2),1)),"NaN")

Resources