Get the closest currency rate - apache-spark

I would like to join two dataframes based on currency rate exchange and date from the second dataframe. I have tried the approach cited here, but the datediff gives the difference in dates so it doesn't give me the right rate.
df1:
from_curr
to_curr
Date
value_to_convert
AED
EUR
2017-03-24
2000
AED
EUR
2017-03-27
189
DZD
EUR
2017-01-12
130
EUR
EUR
2020-01-01
11
df2 (currency_table):
transacti
local
DateTra
rate_exchange
AED
EUR
2017-03-24
-5,123
AED
EUR
2017-03-26
-9.5
DZD
EUR
2017-01-01
-6,12
The output should look like this:
from_curr
to_curr
Date
value_to_convert
value_converted
AED
EUR
2017-03-24
2000
390.39
AED
EUR
2017-03-27
189
19.89
DZD
EUR
2017-01-12
130
21.24
EUR
EUR
2020-01-01
11
11
The only method that works is substracting the difference between the two dates "DATE" and "DATETra" and get the closest date to the "DATETra".
Could you please propose another method much cleaner then substracting strings?

You could aggregate your smaller dataframe (df2) in order to collect all the dates and rates into one cell. Then, join dataframes, take out what you need and do the division.
Inputs:
from pyspark.sql import functions as F
df1 = spark.createDataFrame(
[('AED', 'EUR', '2017-03-24', 2000),
('AED', 'EUR', '2017-03-27', 189),
('DZD', 'EUR', '2017-01-12', 130),
('EUR', 'EUR', '2020-01-01', 11)],
['from_curr', 'to_curr', 'Date', 'value_to_convert'])
df2 = spark.createDataFrame(
[('AED', 'EUR', '2017-03-24', -5.123),
('AED', 'EUR', '2017-03-26', -9.5),
('DZD', 'EUR', '2017-01-01', -6.12)],
['transacti', 'local', 'DateTra', 'rate_exchange'])
Script which gets the closest day's rate (could be from future):
df2 = df2.groupBy('transacti', 'local').agg(
F.collect_list(F.struct('DateTra', 'rate_exchange')).alias('_vals')
)
rate = F.array_sort(F.transform(
'_vals',
lambda x: F.struct(
F.abs(F.datediff('Date', x.DateTra)).alias('diff'),
-F.unix_timestamp(x.DateTra, 'yyyy-MM-dd').alias('DateTra'),
F.abs(x.rate_exchange).alias('rate_exchange')
)
))[0]['rate_exchange']
df = (df1
.join(df2, (df1.from_curr == df2.transacti) & (df1.to_curr == df2.local), 'left')
.select(
df1['*'],
F.coalesce(
F.col('value_to_convert') / rate,
F.when(df1.from_curr == df1.to_curr, df1.value_to_convert)
).alias('value_converted')
)
)
df.show()
# +---------+-------+----------+----------------+------------------+
# |from_curr|to_curr| Date|value_to_convert| value_converted|
# +---------+-------+----------+----------------+------------------+
# | AED| EUR|2017-03-24| 2000| 390.3962521959789|
# | AED| EUR|2017-03-27| 189|19.894736842105264|
# | EUR| EUR|2020-01-01| 11| 11.0|
# | DZD| EUR|2017-01-12| 130|21.241830065359476|
# +---------+-------+----------+----------------+------------------+
Script which gets the most recent rate, but not from future:
df2 = df2.groupBy('transacti', 'local').agg(
F.sort_array(F.collect_list(F.struct('DateTra', 'rate_exchange')), False).alias('_vals')
)
rate = F.abs(F.filter('_vals', lambda x: x.DateTra <= F.col('Date'))[0]['rate_exchange'])
df = (df1
.join(df2, (df1.from_curr == df2.transacti) & (df1.to_curr == df2.local), 'left')
.select(
df1['*'],
F.coalesce(
F.col('value_to_convert') / rate,
F.when(df1.from_curr == df1.to_curr, df1.value_to_convert)
).alias('value_converted')
)
)
df.show()
# +---------+-------+----------+----------------+------------------+
# |from_curr|to_curr| Date|value_to_convert| value_converted|
# +---------+-------+----------+----------------+------------------+
# | AED| EUR|2017-03-24| 2000| 390.3962521959789|
# | AED| EUR|2017-03-27| 189|19.894736842105264|
# | EUR| EUR|2020-01-01| 11| 11.0|
# | DZD| EUR|2017-01-12| 130|21.241830065359476|
# +---------+-------+----------+----------------+------------------+

Related

How to dynamically select columns in pandas based on data in another column

Let's say I have the following dataframes
df1
date_time | value1 | column_value
12-Mar-22 17345 17200CE
13-Mar-22 17400 17200PE
....
df2
date_time | value1 | 17200CE | 17200PE | 17300CE | .......
12-Mar-22 17345 23.3 21.2 24.5
13-Mar-22 17345 24.3 22.2 22.5
Now I want add a column in df1 which fetches the value from the column of df2 which corresponds to value in column_value column in df1 and on the same date_time.
So finally it would look like
date_time | value1 | column_value | mapped_value_result
12-Mar-22 17345 17200CE 23.3
13-Mar-22 17345 17200PE 22.2
One way to achieve this result is to merge df1 and df2 on date_time and then use df.values:
md = df2.merge(df1, on='date_time')
df1['mapped_value_result'] = md.values[md.index, md.columns.get_indexer(md.column_values)]
Output (for your sample data):
date_time value1 column_values mapped_value_result
0 12-Mar-22 17345 17200CE 23.3
1 13-Mar-22 17400 17200PE 22.2
Another alternative (also using a merge), is to use apply to select a value from a column_values column in that merged dataframe:
md = df2.merge(df1, on='date_time')
df1['mapped_value_result'] = md.apply(lambda x:x[x['column_values']], axis=1)
The output is the same.

How to add the new columns based on counts identified by conditions applied to multiple columns in Pyspark?

Suppose following is my dataframe
df
userId
deviceID
Clean_date
ABC123
202030
28-Jul-22
XYZ123
304050
27-Jul-22
ABC123
405032
28-Jul-22
PQR123
385625
22-Jun-22
WER123
465728
2-May-22
XYZ123
935452
22-Mar-22
I want to have output like which user_id with multiple devices on same day as 'P1', user_id with multiple devices on different days as 'P2' and else 'NA'
Following as sample output
df_output
userId
deviceID
Clean_date
Priority
ABC123
202030
28-Jul-22
P1
XYZ123
304050
27-Jul-22
P2
ABC123
405032
28-Jul-22
P1
PQR123
385625
22-Jun-22
NA
WER123
465728
2-May-22
NA
XYZ123
935452
22-Mar-22
P2
Suggest the solution in pyspark
You can count distinct deviceID per userId and per userId + Clean_date using Window then using when expression calculate Priority based on counts like this:
from pyspark.sql import functions as F, Window
df = spark.createDataFrame([
("ABC123", 202030, "28-Jul-22"),("XYZ123", 304050, "27-Jul-22"),
("ABC123", 405032, "28-Jul-22"),("PQR123", 385625, "22-Jun-22"),
("WER123", 465728, "02-May-22"),("XYZ123", 935452, "22-Mar-22")
], ["userId", "deviceID", "Clean_date"])
w = Window.partitionBy("userId")
w2 = Window.partitionBy("userId", "Clean_date")
df = df.withColumn(
"Priority",
F.when(F.size(F.collect_set("deviceID").over(w2)) > 1, "P1")
.when(F.size(F.collect_set("deviceID").over(w)) > 1, "P2")
.otherwise("NA")
)
df.show()
# +------+--------+----------+--------+
# |userId|deviceID|Clean_date|Priority|
# +------+--------+----------+--------+
# |ABC123| 202030| 28-Jul-22| P1|
# |ABC123| 405032| 28-Jul-22| P1|
# |PQR123| 385625| 22-Jun-22| NA|
# |WER123| 465728| 02-May-22| NA|
# |XYZ123| 935452| 22-Mar-22| P2|
# |XYZ123| 304050| 27-Jul-22| P2|
# +------+--------+----------+--------+

Pyspark to map the exchange rate value in dataframe using an exchange rate file

I have two dataframes df1 and a separate dataframe for USD exchange_rate df2:
#df1#
from_curr
to_curr
Date
value_to_convert
AED
EUR
2017-01-12
2000
AED
EUR
2018-03-20
189
UAD
EUR
2021-05-18
12.5
DZD
EUR
2017-01-12
130
SEK
EUR
2017-01-12
1000
GNF
EUR
2017-08-03
1300
df2: #currency_table#
from_curr
To_curr
Date
rate_exchange
AED
EUR
2017-01-01
-5,123
UAD
EUR
2021-05-01
-9.5
AED
EUR
2018-03-01
-5,3
DZD
EUR
2017-01-01
-6,12
GNF
EUR
2017-08-01
-7,03
SEK
EUR
2017-01-01
-12
Do you have any idea about how create a Pyspark function that convert value_to_convert from df1 using the exchange_rate from currency_table (by looking in the exchange_rate dataframe corresponding to the date group from currency ) while joining both dataframes on from_curr field and date field, each value should be converted with rate_exchange from the right date to get df3 like
from_curr
to_curr
dt
value_to_convert
converted_value
AED
EUR
2017-01-12
2000
390
AED
EUR
2018-03-20
189
35,66
UAD
EUR
2021-05-18
12.5
1,31
DZD
EUR
2017-01-12
130
21,24
SEK
EUR
2017-01-12
1000
83,33
GNF
EUR
2017-08-03
1300
184,92
I have tried to split the 'Date' field to year and month (same for dt) and join on from_curr and year, month.
Based on the data provided, there aren't any cases where there can be 2 exchange rates in a single year-month for a currency. So, it'll be an easy join if we create a column with just year and month in both the dataframes. Notice the yyyymm fields that I've created in the approach below.
data_sdf = spark.sparkContext.parallelize(data_ls). \
toDF(['from_curr', 'to_curr', 'dt', 'val_to_convert']). \
withColumn('dt', func.col('dt').cast('date')). \
withColumn('yyyymm', (func.year('dt') * 100 + func.month('dt')).cast('int'))
# +---------+-------+----------+--------------+------+
# |from_curr|to_curr| dt|val_to_convert|yyyymm|
# +---------+-------+----------+--------------+------+
# | AED| EUR|2017-01-12| 2000.0|201701|
# | AED| EUR|2018-03-20| 189.0|201803|
# | UAD| EUR|2021-05-18| 12.5|202105|
# | DZD| EUR|2017-01-12| 130.0|201701|
# | SEK| EUR|2017-01-12| 1000.0|201701|
# | GNF| EUR|2017-08-03| 1300.0|201708|
# +---------+-------+----------+--------------+------+
curr_sdf = spark.sparkContext.parallelize(curr_ls). \
toDF(['from_curr', 'to_curr', 'dt', 'rate_exchange']). \
withColumn('dt', func.col('dt').cast('date')). \
withColumn('yyyymm', (func.year('dt') * 100 + func.month('dt')).cast('int')). \
withColumnRenamed('dt', 'from_curr_start_dt')
# +---------+-------+------------------+-------------+------+
# |from_curr|to_curr|from_curr_start_dt|rate_exchange|yyyymm|
# +---------+-------+------------------+-------------+------+
# | AED| EUR| 2017-01-01| -5.123|201701|
# | UAD| EUR| 2021-05-01| -9.5|202105|
# | AED| EUR| 2018-03-01| -5.3|201803|
# | DZD| EUR| 2017-01-01| -6.12|201701|
# | GNF| EUR| 2017-08-01| -7.03|201708|
# | SEK| EUR| 2017-01-01| -12.0|201701|
# +---------+-------+------------------+-------------+------+
The dataframes can be joined on currencies and the yyyymm (year-month) fields to map the exchange rates.
data_sdf. \
join(curr_sdf, ['from_curr', 'to_curr', 'yyyymm'], 'left'). \
withColumn('converted_value', func.col('val_to_convert') / func.abs(func.col('rate_exchange'))). \
show()
# +---------+-------+------+----------+--------------+------------------+-------------+------------------+
# |from_curr|to_curr|yyyymm| dt|val_to_convert|from_curr_start_dt|rate_exchange| converted_value|
# +---------+-------+------+----------+--------------+------------------+-------------+------------------+
# | AED| EUR|201701|2017-01-12| 2000.0| 2017-01-01| -5.123| 390.3962521959789|
# | SEK| EUR|201701|2017-01-12| 1000.0| 2017-01-01| -12.0| 83.33333333333333|
# | GNF| EUR|201708|2017-08-03| 1300.0| 2017-08-01| -7.03| 184.9217638691323|
# | AED| EUR|201803|2018-03-20| 189.0| 2018-03-01| -5.3|35.660377358490564|
# | UAD| EUR|202105|2021-05-18| 12.5| 2021-05-01| -9.5|1.3157894736842106|
# | DZD| EUR|201701|2017-01-12| 130.0| 2017-01-01| -6.12|21.241830065359476|
# +---------+-------+------+----------+--------------+------------------+-------------+------------------+
Post this, select() the columns that are required in the final dataframe.

How to find Coefficient Correlation of two rows in pyspark

I have below pyspark dataframe
stat col_A col_B col_C col_D
count 14 14 14 14 14
Actual 4 4001 160987 49
Regression 3 3657 131225 38
I want to find **Coefficient Correlation ** of row Actual and Regression. And add ans as a new row CV.
stat col_A col_B col_C col_D
count 14 14 14 14 14
Actual 4 4001 160987 49
Regression 3 3657 131225 38
CV
In Spark documentation we can apply corr(col1, col2, method=None) method. But it's on the column. But in my case, I want it on row.
In pandas I have done like this
(df1.loc[['Actual','Regression']].std(axis = 0, ddof=0,skipna = True))/(df1.loc[['Actual','Regression']].mean(axis = 0))*100
result = df.union(
df.filter("stat in ('Actual', 'Regression')")
.select(
F.lit('CV').alias('stat'),
*[(F.stddev_pop(c) / F.mean(c) * 100).alias(c) for c in df.columns[1:]]
)
)
result.show()
+----------+------------------+-----------------+------------------+------------------+
| stat| col_A| col_B| col_C| col_D|
+----------+------------------+-----------------+------------------+------------------+
| count| 14.0| 14.0| 14.0| 14.0|
| Actual| 4.0| 4001.0| 160987.0| 49.0|
|Regression| 3.0| 3657.0| 131225.0| 38.0|
| CV|14.285714285714285|4.492034473752938|10.185071112753755|12.643678160919542|
+----------+------------------+-----------------+------------------+------------------+
which agrees with your expected output:
(df1.loc[['Actual','Regression']].std(axis = 0, ddof=0,skipna = True))/(df1.loc[['Actual','Regression']].mean(axis = 0))*100
col_A 14.285714
col_B 4.492034
col_C 10.185071
col_D 12.643678

Python Spark: How to join 2 datasets containing >2 elements for each tuple

I'm trying to join data from these two datasets, based on the common "stock" key
stock, sector
GOOG Tech
stock, date, volume
GOOG 2015 5759725
The join method should join these together, however the resulting RDD I got is of the form:
GOOG, (Tech, 2015)
I'm trying to obtain:
(Tech, 2015) 5759726
Additionally, how do I go about reducing the results by the keys (e.g. (Tech, 2015)) in order to obtain a numerical summation for each sector and year?
from pyspark.sql.functions import struct, col, sum
#sample data
df1 = sc.parallelize([['GOOG', 'Tech'],
['AAPL', 'Tech'],
['XOM', 'Oil']]).toDF(["stock","sector"])
df2 = sc.parallelize([['GOOG', '2015', '5759725'],
['AAPL', '2015', '123'],
['XOM', '2015', '234'],
['XOM', '2016', '789']]).toDF(["stock","date","volume"])
#final output
df = df1.join(df2, ['stock'], 'inner').\
withColumn('sector_year', struct(col('sector'), col('date'))).\
drop('stock','sector','date')
df.show()
#numerical summation for each sector and year
df.groupBy('sector_year').agg(sum('volume')).show()
Output is:
+-------+-----------+
| volume|sector_year|
+-------+-----------+
| 123|[Tech,2015]|
| 234| [Oil,2015]|
| 789| [Oil,2016]|
|5759725|[Tech,2015]|
+-------+-----------+
+-----------+-----------+
|sector_year|sum(volume)|
+-----------+-----------+
|[Tech,2015]| 5759848.0|
| [Oil,2015]| 234.0|
| [Oil,2016]| 789.0|
+-----------+-----------+

Resources