Approximate previous year for each row - apache-spark

I have a dataframe that has the following sample rows:
Product Date Revenue
A 2021-05-10 20
A 2021-03-20 10
A 2020-01-10 5
A 2020-03-10 6
A 2020-04-10 7
For each product and date, I'd like to get the closest date to last year from its original date. For example, the first row's date is 2021-05-10 the closest to the previous year for this date is 2020-04-10. The resulting output I'd like is the following:
Product Date Revenue PrevDate PrevRevenue
A 2021-05-10 20 2020-04-10 7
A 2021-03-20 10 2020-03-10 6
A 2020-01-10 5 null null
A 2020-03-10 6 null null
A 2020-04-10 7 null null

Say df is your dataframe
data = [['A', '2021-05-10', 20],
['A', '2021-03-20', 10],
['A', '2020-01-10', 5],
['A', '2020-03-10', 6],
['A', '2020-04-10', 7]]
df = spark.createDataFrame(data, "Product:string, Date:string, Revenue:long")
df.show()
# +-------+----------+-------+
# |Product| Date|Revenue|
# +-------+----------+-------+
# | A|2021-05-10| 20|
# | A|2021-03-20| 10|
# | A|2020-01-10| 5|
# | A|2020-03-10| 6|
# | A|2020-04-10| 7|
# +-------+----------+-------+
then you can get a-year-ago-today date using add_months function, join dataframe with itself to get the combination of date-last_year, rank prevDate using row_number function over a window ordered by number of days between last_year and prevDate, then filter to get the nearest date.
from pyspark.sql.functions import col, add_months, row_number, datediff
from pyspark.sql.window import Window
df = (df
.withColumn('last_year', add_months(col('Date'), -12))
.join(df.selectExpr('Product pr', 'Date prevDate', 'Revenue prevRevenue'),
[col('Product') == col('pr'), col('last_year') > col('prevDate')],
'left')
.withColumn('closest', row_number().over(Window
.partitionBy('product', 'date')
.orderBy(datediff(col('last_year'), col('prevDate')))))
.filter('closest = 1')
.drop(*['pr', 'closest'])
)
df.show()
# +-------+----------+-------+----------+----------+-----------+
# |Product| Date|Revenue| last_year| prevDate|prevRevenue|
# +-------+----------+-------+----------+----------+-----------+
# | A|2020-01-10| 5|2019-01-10| null| null|
# | A|2020-03-10| 6|2019-03-10| null| null|
# | A|2020-04-10| 7|2019-04-10| null| null|
# | A|2021-03-20| 10|2020-03-20|2020-03-10| 6|
# | A|2021-05-10| 20|2020-05-10|2020-04-10| 7|
# +-------+----------+-------+----------+----------+-----------+

Related

How to subract row wise mean from each value of column and get one row wise max after subracting mean value pyspark

I want to calculate row wise mean and subract mean from each value of row and get maximum at he end
here is my dataframe
col1 | col2 | col3
0 | 2 | 3
4 | 2 | 3
1 | 0 | 3
0 | 0 | 0
df=df.withColumn("mean_value",(sum(col(x) for x in df.columns[0:2]) / 3).alias("mean"))
i can calculate row wise mean with line of code , but i want to minus mean value from each value of row and get the maximum value of row after subtraction of mean value.
Requeire results
col1 | col2 | col3 mean_Value Max_difference_Value
0 | 2 | 3 1.66 1.34
4 | 2 | 3 3.0 1.0
1 | 0 | 3 1.33 1.67
1 | 0 | 1 0.66 0.66
Note this is main formula: abs(mean-columns value).max()
Using greatest and list comprehension.
spark.sparkContext.parallelize(data_ls).toDF(['col1', 'col2', 'col3']). \
withColumn('mean_value', (sum(func.col(x) for x in ['col1', 'col2', 'col3']) / 3)). \
withColumn('max_diff_val',
func.greatest(*[func.abs(func.col(x) - func.col('mean_value')) for x in ['col1', 'col2', 'col3']])
). \
show()
# +----+----+----+------------------+------------------+
# |col1|col2|col3| mean_value| max_diff_val|
# +----+----+----+------------------+------------------+
# | 0| 2| 3|1.6666666666666667|1.6666666666666667|
# | 4| 2| 3| 3.0| 1.0|
# | 1| 0| 3|1.3333333333333333|1.6666666666666667|
# | 0| 0| 0| 0.0| 0.0|
# +----+----+----+------------------+------------------+
Have you tried UDFs?
from pyspark.sql.types import FloatType
from pyspark.sql.functions import udf
import numpy as np
#udf
def udf_mean(col1, col2, col3):
return np.mean([col1, col2, col3])
df = df.withColumn("mean_value", udf_mean(col1, col2, col3))
Similarly you can try for max difference value.

Counting total rows, rows with null value, rows with zero values, and their ratios on PySpark

I have a table structure like this:
unique_id | group | value_1 | value_2 | value_3
abc_xxx 1 200 null 100
def_xxx 1 0 3 40
ghi_xxx 2 300 1 2
that I need to extract the following information from:
Total number of rows per group
Count number of rows per group that contains null values.
Count number of rows per group with zero values.
I can do the first one with a simple groupBy and count
df.select().groupBy(group).count()
I'm not so sure how to approach the next two which is needed for me to compute the null and zero rate from the total rows per group.
data= [
('abc_xxx', 1, 200, None, 100),
('def_xxx', 1, 0, 3, 40 ),
('ghi_xxx', 2, 300, 1, 2 ),
]
df = spark.createDataFrame(data, ['unique_id','group','value_1','value_2','value_3'])
# new edit
df = df\
.withColumn('contains_null', when(isnull(col('value_1')) | isnull(col('value_2')) | isnull(col('value_3')), lit(1)).otherwise(lit(0)))\
.withColumn('contains_zero', when((col('value_1')==0) | (col('value_2')==0) | (col('value_3')==0), lit(1)).otherwise(lit(0)))
df.groupBy('group')\
.agg(count('unique_id').alias('total_rows'), sum('contains_null').alias('null_value_rows'), sum('contains_zero').alias('zero_value_rows')).show()
+-----+----------+---------------+---------------+
|group|total_rows|null_value_rows|zero_value_rows|
+-----+----------+---------------+---------------+
| 1| 2| 1| 1|
| 2| 1| 0| 0|
+-----+----------+---------------+---------------+
# total_count = (count('value_1') + count('value_2') + count('value_3'))
# null_count = (sum(when(isnull(col('value_1')), lit(1)).otherwise(lit(0)) + when(isnull(col('value_2')), lit(1)).otherwise(lit(0)) + when(isnull(col('value_3')), lit(1)).otherwise(lit(0))))
# zero_count = (sum(when(col('value_1')==0, lit(1)).otherwise(lit(0)) + when(col('value_2')==0, lit(1)).otherwise(lit(0)) + when(col('value_3')==0, lit(1)).otherwise(lit(0))))
# df.groupBy('group')\
# .agg(total_count.alias('total_numbers'), null_count.alias('null_values'), zero_count.alias('zero_values')).show()
#+-----+-------------+-----------+-----------+
#|group|total_numbers|null_values|zero_values|
#+-----+-------------+-----------+-----------+
#| 1| 5| 1| 1|
#| 2| 3| 0| 0|
#+-----+-------------+-----------+-----------+

Create multiple columns by pivoting even when pivoted value doesn't exist

I have a PySpark df:
Store_ID
Category
ID
Sales
1
A
123
23
2
A
123
45
1
A
234
67
1
B
567
78
2
B
567
34
3
D
789
12
1
A
890
12
Expected:
Store_ID
A_ID
B_ID
C_ID
D_ID
Sales_A
Sales_B
Sales_C
Sales_D
1
3
1
0
0
102
78
0
0
2
1
1
0
0
45
34
0
0
3
0
0
0
1
0
0
0
12
I am able to transform this way using SQL (created a temp view):
SELECT Store_Id,
SUM(IF(Category='A',Sales,0)) AS Sales_A,
SUM(IF(Category='B',Sales,0)) AS Sales_B,
SUM(IF(Category='C',Sales,0)) AS Sales_C,
SUM(IF(Category='D',Sales,0)) AS Sales_D,
COUNT(DISTINCT NULLIF(IF(Category='A',ID,0),0)) AS A_ID,
COUNT(DISTINCT NULLIF(IF(Category='B',ID,0),0)) AS B_ID,
COUNT(DISTINCT NULLIF(IF(Category='C',ID,0),0)) AS C_ID,
COUNT(DISTINCT NULLIF(IF(Category='D',ID,0),0)) AS D_ID
FROM df
GROUP BY Store_Id;
How do we achieve the same in PySpark using native functions as its much faster?
This operation is called pivoting.
a couple of aggregations, since you need both, count of ID and sum of Sales
alias for aggregations, for changing column names
providing values in pivot, for cases where you want numbers for Category C, but C doesn't exist. Providing values boosts performance too.
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1, 'A', 123, 23),
(2, 'A', 123, 45),
(1, 'A', 234, 67),
(1, 'B', 567, 78),
(2, 'B', 567, 34),
(3, 'D', 789, 12),
(1, 'A', 890, 12)],
['Store_ID', 'Category', 'ID', 'Sales'])
Script:
df = (df
.groupBy('Store_ID')
.pivot('Category', ['A', 'B', 'C', 'D'])
.agg(
F.countDistinct('ID').alias('ID'),
F.sum('Sales').alias('Sales'))
.fillna(0))
df.show()
# +--------+----+-------+----+-------+----+-------+----+-------+
# |Store_ID|A_ID|A_Sales|B_ID|B_Sales|C_ID|C_Sales|D_ID|D_Sales|
# +--------+----+-------+----+-------+----+-------+----+-------+
# | 1| 3| 102| 1| 78| 0| 0| 0| 0|
# | 3| 0| 0| 0| 0| 0| 0| 1| 12|
# | 2| 1| 45| 1| 34| 0| 0| 0| 0|
# +--------+----+-------+----+-------+----+-------+----+-------+

Creating structs with column values

I'm trying to convert my dataframe into JSON so that it can be pushed into ElasticSearch. Here's how my dataframe looks like:
Provider Market Avg. Deviation
XM NY 10 5
TL AT 8 6
LM CA 7 8
I want to have it like this:
Column
XM: {
NY: {
Avg: 10,
Deviation: 5
}
}
How can I create something like this?
Check below code, You can modify this as per your requirement.
scala> :paste
// Entering paste mode (ctrl-D to finish)
df
.select(
to_json(
struct(
map(
$"provider",
map(
$"market",
struct($"avg",$"deviation")
)
).as("json_data")
)
).as("data")
)
.select(get_json_object($"data","$.json_data").as("data"))
.show(false)
Output
+--------------------------------------+
|data |
+--------------------------------------+
|{"XM":{"NY":{"avg":10,"deviation":5}}}|
|{"TL":{"AT":{"avg":8,"deviation":6}}} |
|{"LM":{"CA":{"avg":7,"deviation":8}}} |
+--------------------------------------+
In case if any one want it done in pyspark way (Spark 2.0 +),
from pyspark import Row
from pyspark.sql.functions import get_json_object, to_json, struct,create_map
row = Row('Provider', 'Market', 'Avg', 'Deviation')
row_df = spark.createDataFrame(
[row('XM', 'NY', '10', '5'),
row('TL', 'AT', '8', '6'),
row('LM', 'CA', '7', '8')])
row_df.show()
row_df.select(
to_json(struct(
create_map(
row_df.Provider,
create_map(row_df.Market,
struct(row_df.Avg, row_df.Deviation)
)
)
)
).alias("json")
).select(get_json_object('json', '$.col1').alias('json')).show(truncate=False)
Output:
+--------+------+---+---------+
|Provider|Market|Avg|Deviation|
+--------+------+---+---------+
| XM| NY| 10| 5|
| TL| AT| 8| 6|
| LM| CA| 7| 8|
+--------+------+---+---------+
+------------------------------------------+
|json |
+------------------------------------------+
|{"XM":{"NY":{"Avg":"10","Deviation":"5"}}}|
|{"TL":{"AT":{"Avg":"8","Deviation":"6"}}} |
|{"LM":{"CA":{"Avg":"7","Deviation":"8"}}} |
+------------------------------------------+

Remove value from different datasets

I have 2 pyspark datasets:
df_1
name | number <Array>
-------------------------
12 | [1, 2, 3]
-------------------------
34 | [9, 8, 7]
-------------------------
46 | [10]
-------------------------
68 | [2, 88]
-------------------------
df_2
number_to_be_deleted <String>
------------------
1
------------------
2
------------------
10
------------------
I would like to delete numbers of df_2 if they exist in df_1.
In case array will be empty I change it's value to null.
I used array_remove
df = df_1.select(F.array_remove(df_1.number, df_2.number_to_be_deleted)).collect()
I got :
TypeError: 'Column' object is not callable in array_remove
Expected result:
df_1
name | number <Array>
-------------------------
12 | [3]
-------------------------
34 | [9, 8, 7]
-------------------------
46 | null
-------------------------
68 | [88]
-------------------------
Any suggestions, please?
Thank you
You can join df1 with df2 using cross join then use array_except to remove the values. Finally, using when you can check if the size of the result array is empty to replace it with null.
df2 = df2.groupBy().agg(collect_list("number_to_be_deleted").alias("to_delete"))
df1.crossJoin(df2).withColumn("number", array_except("number", "to_delete"))\
.withColumn("number", when(size(col("number")) == 0, lit(None)).otherwise(col("number")))\
.select("name", "number")\
.show()
#+----+---------+
#|name| number|
#+----+---------+
#| 12| [3]|
#| 34|[9, 8, 7]|
#| 46| null|
#| 68| [88]|
#+----+---------+

Resources