Subtract Two Arrays to Get A New Array in Pyspark - apache-spark

I am new to Spark.
I can sum, subtract or multiply arrays in python Pandas&Numpy. But I am having difficulty doing something similar in Spark (python). I am on Databricks.
For example this kind of approach is giving a huge error message which I don't want to copy paste here:
differencer=udf(lambda x,y: x-y, ArrayType(FloatType()))
df.withColumn('difference', differencer('Array1', 'Array2'))
Schema looks like this:
root
|-- col1: integer (nullable = true)
|-- time: timestamp (nullable = true)
|-- num: integer (nullable = true)
|-- part: integer (nullable = true)
|-- result: integer (nullable = true)
|-- Array1: array (nullable = true)
| |-- element: float (containsNull = true)
|-- Array2: array (nullable = false)
| |-- element: float (containsNull = true)
I just want to create a new column subtracting those 2 array columns. Actually, I will get the RMSE between them. But I think I can handle it once I learn how to get this difference.
Arrays look like this(I am just typing in some integers):
Array1_row1[5, 4, 2, 4, 3]
Array2_row1[4, 3, 1, 2, 1]
So the resulting array for row1 should be:
DiffCol_row1[1, 1, 1, 2, 2]
Thanks for suggestions or giving directions. Thank you.

You can zip_arrays and transform
from pyspark.sql.functions import expr
df = spark.createDataFrame(
[([5, 4, 2, 4, 3], [4, 3, 1, 2, 1])], ("array1", "array2")
)
df.withColumn(
"array3",
expr("transform(arrays_zip(array1, array2), x -> x.array1 - x.array2)")
).show()
# +---------------+---------------+---------------+
# | array1| array2| array3|
# +---------------+---------------+---------------+
# |[5, 4, 2, 4, 3]|[4, 3, 1, 2, 1]|[1, 1, 1, 2, 2]|
# +---------------+---------------+---------------+
A valid udf would require an equivalent logic, i.e.
from pyspark.sql.functions import udf
#udf("array<double>")
def differencer(xs, ys):
if xs and ys:
return [float(x - y) for x, y in zip(xs, ys)]
df.withColumn("array3", differencer("array1", "array2")).show()
# +---------------+---------------+--------------------+
# | array1| array2| array3|
# +---------------+---------------+--------------------+
# |[5, 4, 2, 4, 3]|[4, 3, 1, 2, 1]|[1.0, 1.0, 1.0, 2...|
# +---------------+---------------+--------------------+

You can use zip_with (since Spark 2.4):
from pyspark.sql.functions import expr
df = spark.createDataFrame(
[([5, 4, 2, 4, 3], [4, 3, 1, 2, 1])], ("array1", "array2")
)
df.withColumn(
"array3",
expr("zip_with(array1, array2, (x, y) -> x - y)")
).show()
# +---------------+---------------+---------------+
# | array1| array2| array3|
# +---------------+---------------+---------------+
# |[5, 4, 2, 4, 3]|[4, 3, 1, 2, 1]|[1, 1, 1, 2, 2]|
# +---------------+---------------+---------------+

Related

pyspark sort array of it's array's value

I have the following df:
+--------------------+
| id| id_info|
+--------------------+
|id_1| [[1, 8, 2, "bar"], [5, 9, 2, "foo"], [4, 3, 2, "something"], [9, null, 2, "this_is_null"]] |
I would like this sorted by the second element in descending order, so:
+--------------------+
| id| id_info|
+--------------------+
|id_1| [[5, 9, 2, "foo"], [1, 8, 2, "bar"], [4, 3, 2, "something"], [9, null, 2, "this_is_null"]] |
I came up with something like this :
def def_sort(x):
return sorted(x, key=lambda x:x[1], reverse=True)
udf_sort = F.udf(def_sort, T.ArrayType(T.ArrayType(T.IntegerType())))
df.select("id", udf_sort("id_info"))
I'm not sure how to handle null values like this, also is there maybe a built-in function for this? Can I somehow do it with F.array_sort?
The elements of the array contain integers and a string, so I assume that the column id_info is an array of structs.
So the schema of the input data would be similiar to
root
|-- id: string (nullable = true)
|-- id_info: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- col1: integer (nullable = true)
| | |-- col2: integer (nullable = true)
| | |-- col3: integer (nullable = true)
| | |-- col4: string (nullable = true)
The names of the elements of the struct might be different.
With this schema information we can use array_sort to order the array:
df.selectExpr("array_sort(id_info, (l,r) -> \
case when l['col2'] > r['col2'] then -1 else 1 end) as sorted") \
.show(truncate=False)
prints
+----------------------------------------------------------------------------------+
|sorted |
+----------------------------------------------------------------------------------+
|[{5, 9, 2, foo}, {1, 8, 2, bar}, {4, 3, 2, something}, {9, null, 2, this_is_null}]|
+----------------------------------------------------------------------------------+
You can try explode folowed by orderby on id and second element on descending order, then groupBy + collect_list:
out = (sdf.select("*",F.explode("id_info").alias("element"))
.withColumn("second_ele",F.element_at("element",2))
.orderBy(*["id",F.desc("second_ele")])
.groupBy("id").agg(F.collect_list("element").alias("id_info"))
)
out.show(truncate=False)
+----+-----------------------------------------------------------------------+
|id |id_info |
+----+-----------------------------------------------------------------------+
|id_1|[[5, 9, 2, null], [1, 8, 2, null], [4, 3, 2, null], [9, null, 2, null]]|
+----+-----------------------------------------------------------------------+

Filtering array values using pyspark

I am new to pyspark and need the solution for below question.
In an array [[-1,1,2,4,5],[3,5,6,-6]], remove the elements which are <=0 and get a square of positive non-zero numbers.
Use transform & filter higher order functions.
df.printSchema()
root
|-- ids: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: integer (containsNull = false)
from pyspark.sql import functions as F
df.withColumn("new_ids",F.expr("transform(ids,o -> filter(o, i -> i > 0))")).show()
+---------------------------------+-------------------------+
|ids |new_ids |
+---------------------------------+-------------------------+
|[[-1, 1, 2, 4, 5], [3, 5, 6, -6]]|[[1, 2, 4, 5], [3, 5, 6]]|
+---------------------------------+-------------------------+

PySpark - getting number of elements of array with same value

I'm learning Spark and I came across problem that I'm unable to overcome..
What I would like to achieve is to get number of elements with the same value for 2 arrays on the same positions. I'm able to get what I want via Python UDF but I'm wondering if there is a way using only Spark functions.
df_bits = sqlContext.createDataFrame([[[0, 1, 1, 0, 0, ],
[1, 1, 1, 0, 1, ],
]],['bits1', 'bits2'])
df_bits_with_result = df_bits.select('bits1', 'bits2', some_magic('bits1', 'bits2').show()
+--------------------+--------------------+---------------------------------+
|bits1 |bits2 |some_magic(bits1, bits2)|
+--------------------+--------------------+---------------------------------+
|[0, 1, 1, 0, 1, ] |[1, 1, 1, 0, 0, ] |3 |
+--------------------+--------------------+---------------------------------+
Why 3? bits1[1] == bits2[1] AND bits1[2] == bits2[2] AND bits1[3] == bits2[3]
I tried to play with rdd.reduce but with no luck.
Perhaps this is helpful-
spark>=2.4
Use aggregate and zip_with
val df = spark.sql("select array(0, 1, 1, 0, 0, null) as bits1, array(1, 1, 1, 0, 1, null) as bits2")
df.show(false)
df.printSchema()
/**
* +----------------+----------------+
* |bits1 |bits2 |
* +----------------+----------------+
* |[0, 1, 1, 0, 0,]|[1, 1, 1, 0, 1,]|
* +----------------+----------------+
*
* root
* |-- bits1: array (nullable = false)
* | |-- element: integer (containsNull = true)
* |-- bits2: array (nullable = false)
* | |-- element: integer (containsNull = true)
*/
df.withColumn("x", expr("aggregate(zip_with(bits1, bits2, (x, y) -> if(x=y, 1, 0)), 0, (acc, x) -> acc + x)"))
.show(false)
/**
* +----------------+----------------+---+
* |bits1 |bits2 |x |
* +----------------+----------------+---+
* |[0, 1, 1, 0, 0,]|[1, 1, 1, 0, 1,]|3 |
* +----------------+----------------+---+
*/
Pyspark use arrays_zip as mentioned in comments
from pyspark.sql import functions as F
df_bits.withColumn("sum", \
F.expr("""aggregate(arrays_zip(bits1,bits2),0,(acc,x)-> IF(x.bits1==x.bits2,1,0)+acc)""")).show()
#+---------------+---------------+---+
#| bits1| bits2|sum|
#+---------------+---------------+---+
#|[0, 1, 1, 0, 0]|[1, 1, 1, 0, 1]| 3|
#+---------------+---------------+---+

Pyspark UDF to return result similar to groupby().sum() between two columns

I have the following sample dataframe
fruit_list = ['apple', 'apple', 'orange', 'apple']
qty_list = [16, 2, 3, 1]
spark_df = spark.createDataFrame([(101, 'Mark', fruit_list, qty_list)], ['ID', 'name', 'fruit', 'qty'])
and I would like to create another column which contains a result similar to what I would achieve with a pandas groupby('fruit').sum()
qty
fruits
apple 19
orange 3
The above result could be stored in the new column in any form (either a string, dictionary, list of tuples...).
I've tried an approach similar to the following one which does not work
sum_cols = udf(lambda x: pd.DataFrame({'fruits': x[0], 'qty': x[1]}).groupby('fruits').sum())
spark_df.withColumn('Result', sum_cols(F.struct('fruit', 'qty'))).show()
One example of result dataframe could be
+---+----+--------------------+-------------+-------------------------+
| ID|name| fruit| qty| Result|
+---+----+--------------------+-------------+-------------------------+
|101|Mark|[apple, apple, or...|[16, 2, 3, 1]|[(apple,19), (orange,3)] |
+---+----+--------------------+-------------+-------------------------+
Do you have any suggestion on how I could achieve that?
Thanks
Edit: running on Spark 2.4.3
As #pault mentioned, as of Spark 2.4+, you can use Spark SQL built-in function to handle your task, here is one way with array_distinct + transform + aggregate:
from pyspark.sql.functions import expr
# set up data
spark_df = spark.createDataFrame([
(101, 'Mark', ['apple', 'apple', 'orange', 'apple'], [16, 2, 3, 1])
, (102, 'Twin', ['apple', 'banana', 'avocado', 'banana', 'avocado'], [5, 2, 11, 3, 1])
, (103, 'Smith', ['avocado'], [10])
], ['ID', 'name', 'fruit', 'qty']
)
>>> spark_df.show(5,0)
+---+-----+-----------------------------------------+----------------+
|ID |name |fruit |qty |
+---+-----+-----------------------------------------+----------------+
|101|Mark |[apple, apple, orange, apple] |[16, 2, 3, 1] |
|102|Twin |[apple, banana, avocado, banana, avocado]|[5, 2, 11, 3, 1]|
|103|Smith|[avocado] |[10] |
+---+-----+-----------------------------------------+----------------+
>>> spark_df.printSchema()
root
|-- ID: long (nullable = true)
|-- name: string (nullable = true)
|-- fruit: array (nullable = true)
| |-- element: string (containsNull = true)
|-- qty: array (nullable = true)
| |-- element: long (containsNull = true)
Set up the SQL statement:
stmt = '''
transform(array_distinct(fruit), x -> (x, aggregate(
transform(sequence(0,size(fruit)-1), i -> IF(fruit[i] = x, qty[i], 0))
, 0
, (y,z) -> int(y + z)
))) AS sum_fruit
'''
>>> spark_df.withColumn('sum_fruit', expr(stmt)).show(10,0)
+---+-----+-----------------------------------------+----------------+----------------------------------------+
|ID |name |fruit |qty |sum_fruit |
+---+-----+-----------------------------------------+----------------+----------------------------------------+
|101|Mark |[apple, apple, orange, apple] |[16, 2, 3, 1] |[[apple, 19], [orange, 3]] |
|102|Twin |[apple, banana, avocado, banana, avocado]|[5, 2, 11, 3, 1]|[[apple, 5], [banana, 5], [avocado, 12]]|
|103|Smith|[avocado] |[10] |[[avocado, 10]] |
+---+-----+-----------------------------------------+----------------+----------------------------------------+
Explanation:
Use array_distinct(fruit) to find all distinct entries in the array fruit
transform this new array (with element x) from x to (x, aggregate(..x..))
the above function aggregate(..x..) takes the simple form of summing up all elements in array_T
aggregate(array_T, 0, (y,z) -> y + z)
where the array_T is from the following transformation:
transform(sequence(0,size(fruit)-1), i -> IF(fruit[i] = x, qty[i], 0))
which iterate through the array fruit, if the value of fruit[i] = x , then return the corresponding qty[i], otherwise return 0. for example for ID=101, when x = 'orange', it returns an array [0, 0, 3, 0]
There may be a fancy way to do this using only the API functions on Spark 2.4+, perhaps with some combination of arrays_zip and aggregate, but I can't think of any that don't involve an explode step followed by a groupBy. With that in mind, using a udf may actually be better for you in this case.
I think creating a pandas DataFrame just for the purpose of calling .groupby().sum() is overkill. Furthermore, even if you did do it that way, you'd need to convert the final output to a different data structure because a udf can't return a pandas DataFrame.
Here's one way with a udf using collections.defaultdict:
from collections import defaultdict
from pyspark.sql.functions import udf
def sum_cols_func(frt, qty):
d = defaultdict(int)
for x, y in zip(frt, map(int, qty)):
d[x] += y
return d.items()
sum_cols = udf(
lambda x: sum_cols_func(*x),
ArrayType(
StructType([StructField("fruit", StringType()), StructField("qty", IntegerType())])
)
)
Then call this by passing in the fruit and qty columns:
from pyspark.sql.functions import array, col
spark_df.withColumn(
"Result",
sum_cols(array([col("fruit"), col("qty")]))
).show(truncate=False)
#+---+----+-----------------------------+-------------+--------------------------+
#|ID |name|fruit |qty |Result |
#+---+----+-----------------------------+-------------+--------------------------+
#|101|Mark|[apple, apple, orange, apple]|[16, 2, 3, 1]|[[orange, 3], [apple, 19]]|
#+---+----+-----------------------------+-------------+--------------------------+
If you have spark < 2.4, use the follwoing to explode (otherwise check this answer):
df_split = (spark_df.rdd.flatMap(lambda row: [(row.ID, row.name, f, q) for f, q in zip(row.fruit, row.qty)]).toDF(["ID", "name", "fruit", "qty"]))
df_split.show()
Output:
+---+----+------+---+
| ID|name| fruit|qty|
+---+----+------+---+
|101|Mark| apple| 16|
|101|Mark| apple| 2|
|101|Mark|orange| 3|
|101|Mark| apple| 1|
+---+----+------+---+
Then prepare the result you want. First find the aggregated dataframe:
df_aggregated = df_split.groupby('ID', 'fruit').agg(F.sum('qty').alias('qty'))
df_aggregated.show()
Output:
+---+------+---+
| ID| fruit|qty|
+---+------+---+
|101|orange| 3|
|101| apple| 19|
+---+------+---+
And finally change it to the desired format:
df_aggregated.groupby('ID').agg(F.collect_list(F.struct(F.col('fruit'), F.col('qty'))).alias('Result')).show()
Output:
+---+--------------------------+
|ID |Result |
+---+--------------------------+
|101|[[orange, 3], [apple, 19]]|
+---+--------------------------+

PySpark DF column creation with UDF to mimic np.roll function from numpy

Trying to create a new column in a PySpark UDF but the values are null!
Create the DF
data_list = [['a', [1, 2, 3]], ['b', [4, 5, 6]],['c', [2, 4, 6, 8]],['d', [4, 1]],['e', [1,2]]]
all_cols = ['COL1','COL2']
df = sqlContext.createDataFrame(data_list, all_cols)
df.show()
+----+------------+
|COL1| COL2|
+----+------------+
| a| [1, 2, 3]|
| b| [4, 5, 6]|
| c|[2, 4, 6, 8]|
| d| [4, 1]|
| e| [1, 2]|
+----+------------+
df.printSchema()
root
|-- COL1: string (nullable = true)
|-- COL2: array (nullable = true)
| |-- element: long (containsNull = true)
Create a function
def cr_pair(idx_src, idx_dest):
idx_dest.append(idx_dest.pop(0))
return idx_src, idx_dest
lst1 = [1,2,3]
lst2 = [1,2,3]
cr_pair(lst1, lst2)
([1, 2, 3], [2, 3, 1])
Create and register a UDF
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
from pyspark.sql.types import ArrayType
get_idx_pairs = udf(lambda x: cr_pair(x, x), ArrayType(IntegerType()))
Add a new column to the DF
df = df.select('COL1', 'COL2', get_idx_pairs('COL2').alias('COL3'))
df.printSchema()
root
|-- COL1: string (nullable = true)
|-- COL2: array (nullable = true)
| |-- element: long (containsNull = true)
|-- COL3: array (nullable = true)
| |-- element: integer (containsNull = true)
df.show()
+----+------------+------------+
|COL1| COL2| COL3|
+----+------------+------------+
| a| [1, 2, 3]|[null, null]|
| b| [4, 5, 6]|[null, null]|
| c|[2, 4, 6, 8]|[null, null]|
| d| [4, 1]|[null, null]|
| e| [1, 2]|[null, null]|
+----+------------+------------+
Here where the problem is.
I am getting all values 'null' in the COL3 column.
The intended outcome should be:
+----+------------+----------------------------+
|COL1| COL2| COL3|
+----+------------+----------------------------+
| a| [1, 2, 3]|[[1 ,2, 3], [2, 3, 1]] |
| b| [4, 5, 6]|[[4, 5, 6], [5, 6, 4]] |
| c|[2, 4, 6, 8]|[[2, 4, 6, 8], [4, 6, 8, 2]]|
| d| [4, 1]|[[4, 1], [1, 4]] |
| e| [1, 2]|[[1, 2], [2, 1]] |
+----+------------+----------------------------+
Your UDF should return ArrayType(ArrayType(IntegerType())) since you are expecting a list of lists in your column, besides it only needs one parameter:
def cr_pair(idx_src):
return idx_src, idx_src[1:] + idx_src[:1]
get_idx_pairs = udf(cr_pair, ArrayType(ArrayType(IntegerType())))
df.withColumn('COL3', get_idx_pairs(df['COL2'])).show(5, False)
+----+------------+----------------------------+
|COL1|COL2 |COL3 |
+----+------------+----------------------------+
|a |[1, 2, 3] |[[2, 3, 1], [2, 3, 1]] |
|b |[4, 5, 6] |[[5, 6, 4], [5, 6, 4]] |
|c |[2, 4, 6, 8]|[[4, 6, 8, 2], [4, 6, 8, 2]]|
|d |[4, 1] |[[1, 4], [1, 4]] |
|e |[1, 2] |[[2, 1], [2, 1]] |
+----+------------+----------------------------+
It seems like what you want to do is circularly shift the elements in your list. Here is a non-udf approach using pyspark.sql.functions.posexplode() (Spark version 2.1 and above):
import pyspark.sql.functions as f
from pyspark.sql import Window
w = Window.partitionBy("COL1", "COL2").orderBy(f.col("pos") == 0, "pos")
df = df.select("*", f.posexplode("COL2"))\
.select("COL1", "COL2", "pos", f.collect_list("col").over(w).alias('COL3'))\
.where("pos = 0")\
.drop("pos")\
.withColumn("COL3", f.array("COL2", "COL3"))
df.show(truncate=False)
#+----+------------+----------------------------------------------------+
#|COL1|COL2 |COL3 |
#+----+------------+----------------------------------------------------+
#|a |[1, 2, 3] |[WrappedArray(1, 2, 3), WrappedArray(2, 3, 1)] |
#|b |[4, 5, 6] |[WrappedArray(4, 5, 6), WrappedArray(5, 6, 4)] |
#|c |[2, 4, 6, 8]|[WrappedArray(2, 4, 6, 8), WrappedArray(4, 6, 8, 2)]|
#|d |[4, 1] |[WrappedArray(4, 1), WrappedArray(1, 4)] |
#|e |[1, 2] |[WrappedArray(1, 2), WrappedArray(2, 1)] |
#+----+------------+----------------------------------------------------+
Using posexplode will return two columns- the position in the list (pos) and the value (col). The trick here is that we order by f.col("pos") == 0 first and then "pos". This will move the first position in the array to the end of the list.
Though this output prints differently than you would expect with list of lists in python, the contents of COL3 are indeed a list of lists of integers.
df.printSchema()
#root
# |-- COL1: string (nullable = true)
# |-- COL2: array (nullable = true)
# | |-- element: long (containsNull = true)
# |-- COL3: array (nullable = false)
# | |-- element: array (containsNull = true)
# | | |-- element: long (containsNull = true)
Update
The "WrappedArray prefix" is just the way Spark prints nested lists. The underlying array is exactly as you need it. One way to verify this is by calling collect() and inspecting the data:
results = df.collect()
print([(r["COL1"], r["COL3"]) for r in results])
#[(u'a', [[1, 2, 3], [2, 3, 1]]),
# (u'b', [[4, 5, 6], [5, 6, 4]]),
# (u'c', [[2, 4, 6, 8], [4, 6, 8, 2]]),
# (u'd', [[4, 1], [1, 4]]),
# (u'e', [[1, 2], [2, 1]])]
Or if you converted df to a pandas DataFrame:
print(df.toPandas())
# COL1 COL2 COL3
#0 a [1, 2, 3] ([1, 2, 3], [2, 3, 1])
#1 b [4, 5, 6] ([4, 5, 6], [5, 6, 4])
#2 c [2, 4, 6, 8] ([2, 4, 6, 8], [4, 6, 8, 2])
#3 d [4, 1] ([4, 1], [1, 4])
#4 e [1, 2] ([1, 2], [2, 1])

Resources