Remove value from different datasets - apache-spark

I have 2 pyspark datasets:
df_1
name | number <Array>
-------------------------
12 | [1, 2, 3]
-------------------------
34 | [9, 8, 7]
-------------------------
46 | [10]
-------------------------
68 | [2, 88]
-------------------------
df_2
number_to_be_deleted <String>
------------------
1
------------------
2
------------------
10
------------------
I would like to delete numbers of df_2 if they exist in df_1.
In case array will be empty I change it's value to null.
I used array_remove
df = df_1.select(F.array_remove(df_1.number, df_2.number_to_be_deleted)).collect()
I got :
TypeError: 'Column' object is not callable in array_remove
Expected result:
df_1
name | number <Array>
-------------------------
12 | [3]
-------------------------
34 | [9, 8, 7]
-------------------------
46 | null
-------------------------
68 | [88]
-------------------------
Any suggestions, please?
Thank you

You can join df1 with df2 using cross join then use array_except to remove the values. Finally, using when you can check if the size of the result array is empty to replace it with null.
df2 = df2.groupBy().agg(collect_list("number_to_be_deleted").alias("to_delete"))
df1.crossJoin(df2).withColumn("number", array_except("number", "to_delete"))\
.withColumn("number", when(size(col("number")) == 0, lit(None)).otherwise(col("number")))\
.select("name", "number")\
.show()
#+----+---------+
#|name| number|
#+----+---------+
#| 12| [3]|
#| 34|[9, 8, 7]|
#| 46| null|
#| 68| [88]|
#+----+---------+

Related

Approximate previous year for each row

I have a dataframe that has the following sample rows:
Product Date Revenue
A 2021-05-10 20
A 2021-03-20 10
A 2020-01-10 5
A 2020-03-10 6
A 2020-04-10 7
For each product and date, I'd like to get the closest date to last year from its original date. For example, the first row's date is 2021-05-10 the closest to the previous year for this date is 2020-04-10. The resulting output I'd like is the following:
Product Date Revenue PrevDate PrevRevenue
A 2021-05-10 20 2020-04-10 7
A 2021-03-20 10 2020-03-10 6
A 2020-01-10 5 null null
A 2020-03-10 6 null null
A 2020-04-10 7 null null
Say df is your dataframe
data = [['A', '2021-05-10', 20],
['A', '2021-03-20', 10],
['A', '2020-01-10', 5],
['A', '2020-03-10', 6],
['A', '2020-04-10', 7]]
df = spark.createDataFrame(data, "Product:string, Date:string, Revenue:long")
df.show()
# +-------+----------+-------+
# |Product| Date|Revenue|
# +-------+----------+-------+
# | A|2021-05-10| 20|
# | A|2021-03-20| 10|
# | A|2020-01-10| 5|
# | A|2020-03-10| 6|
# | A|2020-04-10| 7|
# +-------+----------+-------+
then you can get a-year-ago-today date using add_months function, join dataframe with itself to get the combination of date-last_year, rank prevDate using row_number function over a window ordered by number of days between last_year and prevDate, then filter to get the nearest date.
from pyspark.sql.functions import col, add_months, row_number, datediff
from pyspark.sql.window import Window
df = (df
.withColumn('last_year', add_months(col('Date'), -12))
.join(df.selectExpr('Product pr', 'Date prevDate', 'Revenue prevRevenue'),
[col('Product') == col('pr'), col('last_year') > col('prevDate')],
'left')
.withColumn('closest', row_number().over(Window
.partitionBy('product', 'date')
.orderBy(datediff(col('last_year'), col('prevDate')))))
.filter('closest = 1')
.drop(*['pr', 'closest'])
)
df.show()
# +-------+----------+-------+----------+----------+-----------+
# |Product| Date|Revenue| last_year| prevDate|prevRevenue|
# +-------+----------+-------+----------+----------+-----------+
# | A|2020-01-10| 5|2019-01-10| null| null|
# | A|2020-03-10| 6|2019-03-10| null| null|
# | A|2020-04-10| 7|2019-04-10| null| null|
# | A|2021-03-20| 10|2020-03-20|2020-03-10| 6|
# | A|2021-05-10| 20|2020-05-10|2020-04-10| 7|
# +-------+----------+-------+----------+----------+-----------+

get distinct count from an array of each rows using pyspark

I am looking for distinct counts from an array of each rows using pyspark dataframe:
input:
col1
[1,1,1]
[3,4,5]
[1,2,1,2]
output:
1
3
2
I used below code but it is giving me the length of an array:
output:
3
3
4
please help me how do i achieve this using python pyspark dataframe.
slen = udf(lambda s: len(s), IntegerType())
count = Df.withColumn("Count", slen(df.col1))
count.show()
Thanks in advanced !
For spark2.4+ you can use array_distinct and then just get the size of that, to get count of distinct values in your array. Using UDF will be very slow and inefficient for big data, always try to use spark in-built functions.
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.array_distinct
(welcome to SO)
df.show()
+------------+
| col1|
+------------+
| [1, 1, 1]|
| [3, 4, 5]|
|[1, 2, 1, 2]|
+------------+
df.withColumn("count", F.size(F.array_distinct("col1"))).show()
+------------+-----+
| col1|count|
+------------+-----+
| [1, 1, 1]| 1|
| [3, 4, 5]| 3|
|[1, 2, 1, 2]| 2|
+------------+-----+

Add different arrays from numpy to each row of dataframe

I have a SparkSQL dataframe and 2D numpy matrix. They have the same number of rows. I intend to add each different array from numpy matrix as a new column to the existing PySpark data frame. In this way, the list added to each row is different.
For example, the PySpark dataframe is like this
| Id | Name |
| ------ | ------ |
| 1 | Bob |
| 2 | Alice |
| 3 | Mike |
And the numpy matrix is like this
[[2, 3, 5]
[5, 2, 6]
[1, 4, 7]]
The resulting expected dataframe should be like this
| Id | Name | customized_list
| ------ | ------ | ---------------
| 1 | Bob | [2, 3, 5]
| 2 | Alice | [5, 2, 6]
| 3 | Mike | [1, 4, 7]
Id column correspond to the order of the entries in the numpy matrix.
I wonder is there any efficient way to implement this?
Create a DataFrame from your numpy matrix and add an Id column to indicate the row number. Then you can join to your original PySpark DataFrame on the Id column.
import numpy as np
a = np.array([[2, 3, 5], [5, 2, 6], [1, 4, 7]])
list_df = spark.createDataFrame(enumerate(a.tolist(), start=1), ["Id", "customized_list"])
list_df.show()
#+---+---------------+
#| Id|customized_list|
#+---+---------------+
#| 1| [2, 3, 5]|
#| 2| [5, 2, 6]|
#| 3| [1, 4, 7]|
#+---+---------------+
Here I used enumerate(..., start=1) to add the row number.
Now just do an inner join:
df.join(list_df, on="Id", how="inner").show()
#+---+-----+---------------+
#| Id| Name|customized_list|
#+---+-----+---------------+
#| 1| Bob| [2, 3, 5]|
#| 3| Mike| [1, 4, 7]|
#| 2|Alice| [5, 2, 6]|
#+---+-----+---------------+

How to apply similar RDD reduceByKey on Dataframes [duplicate]

So I have a spark dataframe that looks like:
a | b | c
5 | 2 | 1
5 | 4 | 3
2 | 4 | 2
2 | 3 | 7
And I want to group by column a, create a list of values from column b, and forget about c. The output dataframe would be :
a | b_list
5 | (2,4)
2 | (4,3)
How would I go about doing this with a pyspark sql dataframe?
Thank you! :)
Here are the steps to get that Dataframe.
>>> from pyspark.sql import functions as F
>>>
>>> d = [{'a': 5, 'b': 2, 'c':1}, {'a': 5, 'b': 4, 'c':3}, {'a': 2, 'b': 4, 'c':2}, {'a': 2, 'b': 3,'c':7}]
>>> df = spark.createDataFrame(d)
>>> df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 5| 2| 1|
| 5| 4| 3|
| 2| 4| 2|
| 2| 3| 7|
+---+---+---+
>>> df1 = df.groupBy('a').agg(F.collect_list("b"))
>>> df1.show()
+---+---------------+
| a|collect_list(b)|
+---+---------------+
| 5| [2, 4]|
| 2| [4, 3]|
+---+---------------+

GroupByKey and create lists of values pyspark sql dataframe

So I have a spark dataframe that looks like:
a | b | c
5 | 2 | 1
5 | 4 | 3
2 | 4 | 2
2 | 3 | 7
And I want to group by column a, create a list of values from column b, and forget about c. The output dataframe would be :
a | b_list
5 | (2,4)
2 | (4,3)
How would I go about doing this with a pyspark sql dataframe?
Thank you! :)
Here are the steps to get that Dataframe.
>>> from pyspark.sql import functions as F
>>>
>>> d = [{'a': 5, 'b': 2, 'c':1}, {'a': 5, 'b': 4, 'c':3}, {'a': 2, 'b': 4, 'c':2}, {'a': 2, 'b': 3,'c':7}]
>>> df = spark.createDataFrame(d)
>>> df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 5| 2| 1|
| 5| 4| 3|
| 2| 4| 2|
| 2| 3| 7|
+---+---+---+
>>> df1 = df.groupBy('a').agg(F.collect_list("b"))
>>> df1.show()
+---+---------------+
| a|collect_list(b)|
+---+---------------+
| 5| [2, 4]|
| 2| [4, 3]|
+---+---------------+

Resources