PySpark: (broadcast) joining two datasets on closest datetimes/unix - apache-spark

I am using PySpark and are close to giving up on my problem.
I have two data sets: one very very very large one (set A) and one that is rather small (set B).
They are of the form:
Data set A:
variable | timestampA
---------------------------------
x | 2015-01-01 09:29:21
y | 2015-01-01 12:01:57
Data set B:
different information | timestampB
-------------------------------------------
info a | 2015-01-01 09:30:00
info b | 2015-01-01 09:30:00
info a | 2015-01-01 12:00:00
info b | 2015-01-01 12:00:00
A has many rows where each row has a different time stamp. B has a time stamp every couple of minutes. The main problem here is, that there are no exact time stamps that match in both data sets.
My goal is to join the data sets on the nearest time stamp. An additional problem arises since I want to join in a specific way.
For each entry in A, I want to map the entire information for the closest timestamp while duplicating the entry in A. So, the result should look like:
Final data set
variable | timestampA | information | timestampB
--------------------------------------------------------------------------
x | 2015-01-01 09:29:21 | info a | 2015-01-01 09:30:00
x | 2015-01-01 09:29:21 | info b | 2015-01-01 09:30:00
y | 2015-01-01 12:01:57 | info a | 2015-01-01 12:00:00
y | 2015-01-01 12:01:57 | info b | 2015-01-01 12:00:00
I am very new to PySpark (and also stackoverflow). I figured that I probably need to use a window function and/or a broadcast join, but I really have no point to start and would appreciate any help. Thank you!

You can you use broadcast to avoid shuffling.
If understand correctly you have timestamps in set_B which are consequent with some determined interval? If so you can do the following:
from pyspark.sql import functions as F
# assuming 5 minutes is your interval in set_B
interval = 'INTERVAL {} SECONDS'.format(5 * 60 / 2)
res = set_A.join(F.broadcast(set_B), (set_A['timestampA'] > (set_B['timestampB'] - F.expr(interval))) & (set_A['timestampA'] <= (set_B['timestampB'] + F.expr(interval))))
Output:
+--------+-------------------+------+-------------------+
|variable| timestampA| info| timestampB|
+--------+-------------------+------+-------------------+
| x|2015-01-01 09:29:21|info a|2015-01-01 09:30:00|
| x|2015-01-01 09:29:21|info b|2015-01-01 09:30:00|
| y|2015-01-01 12:01:57|info a|2015-01-01 12:00:00|
| y|2015-01-01 12:01:57|info b|2015-01-01 12:00:00|
+--------+-------------------+------+-------------------+
If you don't have determined interval then only cross join and then finding min(timestampA - timestampB) interval can do the trick. You can do that with window function and row_number function like following:
w = Window.partitionBy('variable', 'info').orderBy(F.abs(F.col('timestampA').cast('int') - F.col('timestampB').cast('int')))
res = res.withColumn('rn', F.row_number().over(w)).filter('rn = 1').drop('rn')

Related

subtract second datetime row from first datetime row of a column if another column shows duplicate values

I have a dataframe with two columns Order date and Customer(which have duplicates of only 2 values which has been sorted), I want to subtract the second Order date of the second occurrence of a Customer from the first Order date. Order date is in datetime format
here is a sample of the table
context I'm trying to calculate the time it takes for a customer to make a second order\
Order date Customer
4260 2022-11-11 16:29:00 (App admin)
8096 2022-10-22 12:54:00 (App admin)
996 2021-09-22 20:30:00 10013
946 2021-09-14 15:16:00 10013
3499 2022-04-20 12:17:00 100151
... ... ...
2856 2022-03-21 13:49:00 99491
2788 2022-03-18 12:15:00 99523
2558 2022-03-08 12:07:00 99523
2580 2022-03-04 16:03:00 99762
2544 2022-03-02 15:40:00 99762
I have tried deleting by index but it returns just the first two values.
expected output should be another dataframe with just the Customer name and the difference between the Second and first Order dates of the duplicate customer in minutes
expected output:
| Customer | difference in minutes |
| -------- | -------- |
| 1232 | 445.0 |
|(App Admin)| 3432.0 |
| 1145 | 2455.0 |
|6653 | 32.0 |
You can use groupby:
df['Order date'] = pd.to_datetime(df['Order date'])
out = (df.groupby('Customer', as_index=False)['Order date']
.agg(lambda x: (x.iloc[0] - x.iloc[-1]).total_seconds() / 60)
.query('`Order date` != 0'))
print(out)
# Output:
Customer Order date
0 (App admin) 29015.0
1 10013 11834.0
4 99523 14408.0
5 99762 2903.0

Merge in Panda does not allow second key to join on

After looking for answers and trying everything could not figure out a way out, so here it goes.
I have a list of *.txt files that I want to merge by column. I am 100% sure that they have the same structure, as follows
File1
date | time | model_name1
1850-01-16 | 12:00:00 | 0.10
File2
date | time | model_name2
1850-01-16 | 12:00:00 | 0.50
File3..... and so on
Note: the vertical bars are just for clarity here.
Now my output should look like this:
Output
date | time | model_name1 | model_name2
1850-01-16 | 12:00:00 | 0.10 | 0.50
With the following piece of code
out_list4 = os.listdir(out_directory)
df_list = [pd.read_table(out_path+os.fsdecode(file_x), sep='\s+') for file_x in out_list4]
df_merged = reduce(lambda left,right: ,
pd.merge(left,right,on=['date'], how='outer'), df_list)
pd.DataFrame.to_csv(df_merged, out_path+'merged.txt', sep='\t', index=False)
I manage the following output:
Output
date | time_x | model_name1 |time_y | model_name2
1850-01-16 | 12:00:00 | 0.10 |12:00:00| 0.50
As expected since I only have the key ""on=['date']"".
Now if I try to write time as second key as follows: ""on=['date','time']"", it crashes with the following error:
Key error:'time'
and a long list of tracebacks.
I tried placing left_on/righ_on in case "date" was being handled as index. No use. I know the problem does not lie on the files, the structure is right, it is the code. Any help will be much appreciated. And sorry for readibility on the
So, the problem was before. I had defined ""out_list4"" as a list before:
out_list4 = list()
and it was making a mess at the end. Each data element on the list should have size 1872 x 3, but at the end it was adding them altogether again making one last entry be 1872 x 12 and no 'time' header.
Changing the definition of ""out_list4"" to:
out_list4 = []
did the trick. The tip came from Combine a list of pandas dataframes to one pandas dataframe.

Excel Pivot Table: How do I count the number of working days for employees based on date-time values?

In my theoretical data set, I have a list which shows the date-time of a sale, and the employee who completed the transaction.
I know how to do grouping in order to show how many sales each employee has per day, but I'm wondering if there's a way to count how many grouped days have more than 0 sales.
For example, here's the original data set:
Employee | Order Time
A | 8/12 8:00
B | 8/12 9:00
A | 8/12 10:00
A | 8/12 14:00
B | 8/13 10:00
B | 8/13 11:00
A | 8/13 15:00
A | 8/14 12:00
Here's the pivot table that I have created:
Employee | 8/12 | 8/13 | 8/14
A | 3 | 1 | 1
B | 1 | 2 | 0
And here's what I want to know:
Employee | Working Days
A | 3
B | 2
Split your Order Time column (assumed to be B) into two, say with Text to Columns and Space as the delimiter (might need a little adjustment). Then pivot (using the Data Model) as shown:
and sum the results (outside the PT) such as with:
=SUM(F3:H3)
copied down to suit.
Columns F:G may then be hidden.
I fully support #Andrea's Comment (a correction) on the above:
I think this could have been made simpler. If you remove the "Time" in values of the pivot table and then move "Order" from columns to values and use distinct count as in the example. It should count Employee per date making the sum not needed. If you scale this to make it larger. Say 50 dates then the =Sum() needs to be moved each time.

Excel to calculate capacity levels

I have a table in excel setup as followed:
DATE | TIME | PERSON IDENTIFIER | ARRIVAL OR LEAVING
01/01/15 | 13:00 | AB1234 | A
01/01/15 | 13:01 | AC1234 | A
01/01/15 | 13:03 | AD1234 | A
01/01/15 | 13:05 | AE1234 | A
01/01/15 | 13:09 | AF1234 | A
01/01/15 | 13:10 | AB1234 | L
01/01/15 | 13:15 | AG1234 | A
01/01/15 | 13:13 | AC1234 | L
The table shows when people arrive and leave a medical ward. The ward holds 36 patients and I'm wanting to get an idea of how close it is to capacity (it's normally always full). The ward is open 24/7 and has patients arriving 24/7 but I'd like to show the time it is at the certain capacities.
For example if we inputted 24 hours of data
36 patients (0 empty beds) - 22hr 15min
35 patients (1 empty bed) - 01hr 30min
34 patients (2 empty beds) - 00hr 15min
I'm thinking we just need a count for every time some arrives and a negative count when they leave but I can't figure out how to extract the time from that.
This is going to be pretty ugly (NB using your columns from above):
order the entries sequentially
you can keep a running tally in column E of patients on hand currently with E1 = 36(or whatever starting value you have) and =IF(D2="A",E1+1,E1-1).
Get the time elapsed since the previous entry with =(B3-B2) and put that in column F
Count the chunks where you had less than a full house with =SUMIF(F:F, "<36")

Spark: How to join RDDs by time range

I have a delicate Spark problem, where i just can't wrap my head around.
We have two RDDs ( coming from Cassandra ). RDD1 contains Actions and RDD2 contains Historic data. Both have an id on which they can be matched/joined. But the problem is the two tables have an N:N relation ship. Actions contains multiple rows with the same id and so does Historic. Here are some example date from both tables.
Actions time is actually a timestamp
id | time | valueX
1 | 12:05 | 500
1 | 12:30 | 500
2 | 12:30 | 125
Historic set_at is actually a timestamp
id | set_at| valueY
1 | 11:00 | 400
1 | 12:15 | 450
2 | 12:20 | 50
2 | 12:25 | 75
How can we join these two tables in a way, that we get a result like this
1 | 100 # 500 - 400 for Actions#1 with time 12:05 because Historic was in that time at 400
1 | 50 # 500 - 450 for Actions#2 with time 12:30 because H. was in that time at 450
2 | 50 # 125 - 75 for Actions#3 with time 12:30 because H. was in that time at 75
I can't come up with a good solution that feels right, without making a lot of iterations over huge datasets. I always have to think about making a range from the Historic set and then somehow check if the Actions fits in the range e.g (11:00 - 12:15) to make the calculation. But that seems to pretty slow to me. Is there any more efficient way to do that? Seems to me, that this kind of problem could be popular, but i couldn't find any hints on this yet. How would you solve this problem in spark?
My current attempts so far ( in half way done code )
case class Historic(id: String, set_at: Long, valueY: Int)
val historicRDD = sc.cassandraTable[Historic](...)
historicRDD
.map( row => ( row.id, row ) )
.reduceByKey(...)
// transforming to another case which results in something like this; code not finished yet
// (List((Range(0, 12:25), 400), (Range(12:25, NOW), 450)))
// From here we could join with Actions
// And then some .filter maybe to select the right Lists tuple
It's an interesting problem. I also spent some time figuring out an approach. This is what I came up with:
Given case classes for Action(id, time, x) and Historic(id, time, y)
Join the actions with the history (this might be heavy)
filter all historic data not relevant for a given action
key the results by (id,time) - differentiate same key at different times
reduce the history by action to the max value, leaving us with relevant historical record for the given action
In Spark:
val actionById = actions.keyBy(_.id)
val historyById = historic.keyBy(_.id)
val actionByHistory = actionById.join(historyById)
val filteredActionByidTime = actionByHistory.collect{ case (k,(action,historic)) if (action.time>historic.t) => ((action.id, action.time),(action,historic))}
val topHistoricByAction = filteredActionByidTime.reduceByKey{ case ((a1:Action,h1:Historic),(a2:Action, h2:Historic)) => (a1, if (h1.t>h2.t) h1 else h2)}
// we are done, let's produce a report now
val report = topHistoricByAction.map{case ((id,time),(action,historic)) => (id,time,action.X -historic.y)}
Using the data provided above, the report looks like:
report.collect
Array[(Int, Long, Int)] = Array((1,43500,100), (1,45000,50), (2,45000,50))
(I transformed the time to seconds to have a simplistic timestamp)
After a few hours of thinking, trying and failing I came up with this solution. I am not sure if it is any good, but due the lack of other options, this is my solution.
First we expand our case class Historic
case class Historic(id: String, set_at: Long, valueY: Int) {
val set_at_map = new java.util.TreeMap[Long, Int]() // as it seems Scala doesn't provides something like this with similar operations we'll need a few lines later
set_at_map.put(0, valueY) // Means from the beginning of Epoch ...
set_at_map.put(set_at, valueY) // .. to the set_at date
// This is the fun part. With .getHistoricValue we can pass any timestamp and we will get the a value of the key back that contains the passed date. For more information look at this answer: http://stackoverflow.com/a/13400317/1209327
def getHistoricValue(date: Long) : Option[Int] = {
var e = set_at_map.floorEntry(date)
if (e != null && e.getValue == null) {
e = set_at_map.lowerEntry(date)
}
if ( e == null ) None else e.getValue()
}
}
The case class is ready and now we bring it into action
val historicRDD = sc.cassandraTable[Historic](...)
.map( row => ( row.id, row ) )
.reduceByKey( (row1, row2) => {
row1.set_at_map.put(row2.set_at, row2.valueY) // we add the historic Events up to each id
row1
})
// Now we load the Actions and map it by id as we did with Historic
val actionsRDD = sc.cassandraTable[Actions](...)
.map( row => ( row.id, row ) )
// Now both RDDs have the same key and we can join them
val fin = actionsRDD.join(historicRDD)
.map( row => {
( row._1.id,
(
row._2._1.id,
row._2._1.valueX - row._2._2.getHistoricValue(row._2._1.time).get // returns valueY for that timestamp
)
)
})
I am totally new to Scala, so please let me know if we could improve this code on some place.
I know that this question has been answered but I want to add another solution that worked for me -
your data -
Actions
id | time | valueX
1 | 12:05 | 500
1 | 12:30 | 500
2 | 12:30 | 125
Historic
id | set_at| valueY
1 | 11:00 | 400
1 | 12:15 | 450
2 | 12:20 | 50
2 | 12:25 | 75
Union Actions and Historic
Combined
id | time | valueX | record-type
1 | 12:05 | 500 | Action
1 | 12:30 | 500 | Action
2 | 12:30 | 125 | Action
1 | 11:00 | 400 | Historic
1 | 12:15 | 450 | Historic
2 | 12:20 | 50 | Historic
2 | 12:25 | 75 | Historic
Write a custom partitioner and use repartitionAndSortWithinPartitions to partition by id, but sort by time.
Partition-1
1 | 11:00 | 400 | Historic
1 | 12:05 | 500 | Action
1 | 12:15 | 450 | Historic
1 | 12:30 | 500 | Action
Partition-2
2 | 12:20 | 50 | Historic
2 | 12:25 | 75 | Historic
2 | 12:30 | 125 | Action
Traverse through the records per partition.
If it is a Historical record, add it to a map, or update the map if it already has that id - keep track of the latest valueY per id using a map per partition.
If it is a Action record, get the valueY value from the map and subtract it from valueX
A map M
Partition-1 traversal in order
M={ 1 -> 400} // A new entry in map M
1 | 100 // M(1) = 400; 500-400
M={1 -> 450} // update M, because key already exists
1 | 50 // M(1)
Partition-2 traversal in order
M={ 2 -> 50} // A new entry in M
M={ 2 -> 75} // update M, because key already exists
2 | 50 // M(2) = 75; 125-75
You could try to partition and sort by time, but you need to merge the partitions later. And that could add to some complexity.
This, I found it preferable to the many-to-many join that we usually get when using time ranges to join.

Resources