How to merge dataframes with the closest timestamps - apache-spark

I am trying to merge two dataframes with different timestamps using pyspark. I want to be able to merge the data based on the closest timestamp difference.
Here is the sample data: Timestamps are all different so I cant join on time=time in the dataframes.
ID
time
x
y
1
2023-01-02 14:31
10
20
1
2023-01-02 14:35
20
10
ID
time
x1
y1
2
2023-01-02 14:32
10
20
2
2023-01-01 14:36
20
10
ID
time
x1
y1
ID
time
x2
y2
1
2023-01-01 14:31
10
20
2
2023-01-02 14:32
10
20
1
2023-01-01 14:35
20
10
2
2023-01-01 14:36
20
10
When I simply join the dataframes, it creates thousands of rows and the timestamps are all over the places when theres only 200 datapoints. I am not sure what is going on please help.
I tried joining and its creating too much data

Unfortunately, time-based joins are outside of the typical Spark use-cases and usually require workarounds that tend to be inefficient. And if you only have 200 datapoints, there are probably more convenient ways to process the data than via Spark, e.g. https://pandas.pydata.org/docs/reference/api/pandas.merge_asof.html.
In addition, I'm not sure if your task is well-defined. If two timestamps from the first dataframe have the same closest timestamp in the second dataframe,
should both be joined to the same line from the second dataframe, or
should only the join with the smaller distance be performed, while the other row from the first dataframe needs a different partner?
The latter will need a sophisticated solution and it's unlikely that Spark's built-in functions can help you. In the first scenario, let's go through your options.
If the closest timestamp from the past is sufficient
In this case you need an "AS OF" join. As of 3.3.1 Spark doesn't have those natively, but there's a workaround in Scala and even a simpler one in Python via the Pandas API; see How do you create merge_asof functionality in PySpark?.
If it really has to be the closest timestamp from the past and the future
The easy but inefficient way is to perform the whole join and select the rows to be kept based on temporal distance, i.e.
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql.functions import col
spark = SparkSession.builder.master("local").getOrCreate()
df1 = spark.read.csv("./df1.csv", header=True).withColumn(
"time", col("time").cast(T.TimestampType())
)
df2 = spark.read.csv("./df2.csv", header=True).withColumn(
"time", col("time").cast(T.TimestampType())
)
df_joined = (
df1.join(df2.withColumnRenamed("time", "time1").withColumnRenamed("ID", "ID1"), how="left")
.withColumn("temporalDiff", F.abs(col("time1") - col("time")))
.groupBy("ID", "time", "x", "y")
.agg(
F.expr("min_by(ID1, temporalDiff)").alias("ID1"),
F.expr("min_by(time1, temporalDiff)").alias("time1"),
F.expr("min_by(x1, temporalDiff)").alias("x1"),
F.expr("min_by(y1, temporalDiff)").alias("y1"),
)
)
min_by was introduced in Spark 3.0. For previous versions, see https://sparkbyexamples.com/spark/spark-find-maximum-row-per-group-in-dataframe/.
You could also use the AS OF join solution linked above and expand it to work in both directions, which would be more efficient, but also more complicated to implement and read.

Related

Multilevel index of rows of a dataframe using pandas [duplicate]

I've spent hours browsing everywhere now to try to create a multiindex from dataframe in pandas. This is the dataframe I have (posting excel sheet mockup. I do have this in pandas dataframe):
And this is what I want:
I have tried
newmulti = currentDataFrame.set_index(['user_id','account_num'])
But it returns a dataframe, not a multiindex. Also, I could not figure out how to make 'user_id' level 0 and 'account_num' level 1. I think this must be trivial but I've read so many posts, tutorials, etc. and still could not figure it out. Partly because I'm a very visual person and most posts are not. Please help!
You could simply use groupby in this case, which will create the multi-index automatically when it sums the sales along the requested columns.
df.groupby(['user_id', 'account_num', 'dates']).sales.sum().to_frame()
You should also be able to simply do this:
df.set_index(['user_id', 'account_num', 'dates'])
Although you probably want to avoid any duplicates (e.g. two or more rows with identical user_id, account_num and date values but different sales figures) by summing them, which is why I recommended using groupby.
If you need the multi-index, you can simply access viat new_df.index where new_df is the new dataframe created from either of the two operations above.
And user_id will be level 0 and account_num will be level 1.
For clarification of future users I would like to add the following:
As said by Alexander,
df.set_index(['user_id', 'account_num', 'dates'])
with a possible inplace=True does the job.
The type(df) gives
pandas.core.frame.DataFrame
whereas type(df.index) is indeed the expected
pandas.core.indexes.multi.MultiIndex
Use pd.MultiIndex.from_arrays
lvl0 = currentDataFrame.user_id.values
lvl1 = currentDataFrame.account_num.values
midx = pd.MultiIndex.from_arrays([lvl0, lvl1], names=['level 0', 'level 1'])
There are two ways to do it, albeit not exactly like you have shown, but it works.
Say you have the following df:
A B C D
0 nil one 1 NaN
1 bar one 5 5.0
2 foo two 3 8.0
3 bar three 2 1.0
4 foo two 4 2.0
5 bar two 6 NaN
1. Workaround 1:
df.set_index('A', append = True, drop = False).reorder_levels(order = [1,0]).sort_index()
This will return:
2. Workaround 2:
df.set_index(['A', 'B']).sort_index()
This will return:
The DataFrame returned by currentDataFrame.set_index(['user_id','account_num']) has it's index set to ['user_id','account_num']
newmulti.index will return the MultiIndex object.

calculate average difference between dates using pyspark

I have a data frame that looks like this- user ID and dates of activity. I need to calculate the average difference between dates using RDD functions (such as reduce and map) and not SQL.
The dates for each ID needs to be sorted by order before calculating the difference, as I need the difference between each consecutive dates.
ID
Date
1
2020-09-03
1
2020-09-03
2
2020-09-02
1
2020-09-04
2
2020-09-06
2
2020-09-16
the needed outcome for this example will be:
ID
average difference
1
0.5
2
7
thanks for helping!
You can use datediff with window function to calculate the difference, then take an average.
lag is one of the window function and it will take a value from the previous row within the window.
from pyspark.sql import functions as F
# define the window
w = Window.partitionBy('ID').orderBy('Date')
# datediff takes the date difference from the first arg to the second arg (first - second).
(df.withColumn('diff', F.datediff(F.col('Date'), F.lag('Date').over(w)))
.groupby('ID') # aggregate over ID
.agg(F.avg(F.col('diff')).alias('average difference'))
)

Trying to divide two columns of a dataframe but get Nan

Background:
I deal with a dataframe and want to divide the two columns of this dataframe to get a new column. The code is shown below:
import pandas as pd
df = {'drive_mile': [15.1, 2.1, 7.12], 'price': [40, 9, 31]}
df = pd.DataFrame(df)
df['price/km'] = df[['drive_mile', 'price']].apply(lambda x: x[1]/x[0])
print(df)
And I get the below result:
drive_mile price price/km
0 15.10 40 NaN
1 2.10 9 NaN
2 7.12 31 NaN
Why would this happen? And how can I fix it?
As pointed out in the comments, you missed the axis=1 parameter to perform the division on the right dimension using apply. This is because you end up with different indices when joining back in the DataFrame.
However, more importantly, do not use apply to perform a division!. Apply is often much less efficient compared to vectorial operations.
Use div:
df['price/km'] = df['drive_mile'].div(df['price'])
Or /:
df['price/km'] = df['drive_mile']/df['price']

% difference over columns in PySpark for each row

I am trying to compute percentage difference over columns for each row in a dataframe. Here is my dataset:
For example, for the first row, I am trying to get a variation rate of 2016 compared to 2015, of 2017 compared to 2016... Only 2015 and 2019 should be removed, so that they will be 5 columns at the end.
I know that window and lag can be help achieving it, but I stay unsuccessful until now.
No window functions should be needed. You just need to calculate the % change by arithmetic operations on the columns, if I understood the question correctly.
import pyspark.sql.functions as F
df2 = df.select(
'city', 'postal_code',
*[((F.col(str(year)) - F.col(str(year-1))) / F.col(str(year-1))).alias('percent_change_%s'%year)
for year in [2016,2017,2018,2019]]
)
Also I don't understand why you want 5 columns at the end. Isn't it 6? Why is 2019 removed? You can calculate % change by (2019-2018)/2018, for instance.

How to get dot product of 2 columns containing dense vectors in spark

I have a spark data frame containing dense vectors as columns Col_W_DensV1 and Col_w_DenseV2 and now I want to calculate the cosine similarity between them and thus need dot product. I am currently using UDF and doing row operations and it is incredibly slow and uses just 1 core for operations. Can someone suggest a better way to achieve this?
My spark Dataframe
Col1 | Col2 | Col_W_DensV1 | Col_w_DenseV2
a | b | [0.1 0.1 0.2..]| [0.3 0.5 0.8..]
Need x.Dot(y) at column level instead of row-level and parallelize
My Current function(Row-Level) Takes gazillion years to run on data!!
#udf("double")def cosim(x, y):
import numpy as np
return float(x.dot(y) / np.sqrt(x.dot(x)) /np.sqrt(y.dot(y)))
cs_table1 = cs_table.withColumn("similarity",cosim(cs_table.p_result,cs_table.result))
cs_table1.show()

Resources