PySpark - getting number of elements of array with same value - apache-spark

I'm learning Spark and I came across problem that I'm unable to overcome..
What I would like to achieve is to get number of elements with the same value for 2 arrays on the same positions. I'm able to get what I want via Python UDF but I'm wondering if there is a way using only Spark functions.
df_bits = sqlContext.createDataFrame([[[0, 1, 1, 0, 0, ],
[1, 1, 1, 0, 1, ],
]],['bits1', 'bits2'])
df_bits_with_result = df_bits.select('bits1', 'bits2', some_magic('bits1', 'bits2').show()
+--------------------+--------------------+---------------------------------+
|bits1 |bits2 |some_magic(bits1, bits2)|
+--------------------+--------------------+---------------------------------+
|[0, 1, 1, 0, 1, ] |[1, 1, 1, 0, 0, ] |3 |
+--------------------+--------------------+---------------------------------+
Why 3? bits1[1] == bits2[1] AND bits1[2] == bits2[2] AND bits1[3] == bits2[3]
I tried to play with rdd.reduce but with no luck.

Perhaps this is helpful-
spark>=2.4
Use aggregate and zip_with
val df = spark.sql("select array(0, 1, 1, 0, 0, null) as bits1, array(1, 1, 1, 0, 1, null) as bits2")
df.show(false)
df.printSchema()
/**
* +----------------+----------------+
* |bits1 |bits2 |
* +----------------+----------------+
* |[0, 1, 1, 0, 0,]|[1, 1, 1, 0, 1,]|
* +----------------+----------------+
*
* root
* |-- bits1: array (nullable = false)
* | |-- element: integer (containsNull = true)
* |-- bits2: array (nullable = false)
* | |-- element: integer (containsNull = true)
*/
df.withColumn("x", expr("aggregate(zip_with(bits1, bits2, (x, y) -> if(x=y, 1, 0)), 0, (acc, x) -> acc + x)"))
.show(false)
/**
* +----------------+----------------+---+
* |bits1 |bits2 |x |
* +----------------+----------------+---+
* |[0, 1, 1, 0, 0,]|[1, 1, 1, 0, 1,]|3 |
* +----------------+----------------+---+
*/

Pyspark use arrays_zip as mentioned in comments
from pyspark.sql import functions as F
df_bits.withColumn("sum", \
F.expr("""aggregate(arrays_zip(bits1,bits2),0,(acc,x)-> IF(x.bits1==x.bits2,1,0)+acc)""")).show()
#+---------------+---------------+---+
#| bits1| bits2|sum|
#+---------------+---------------+---+
#|[0, 1, 1, 0, 0]|[1, 1, 1, 0, 1]| 3|
#+---------------+---------------+---+

Related

pyspark sort array of it's array's value

I have the following df:
+--------------------+
| id| id_info|
+--------------------+
|id_1| [[1, 8, 2, "bar"], [5, 9, 2, "foo"], [4, 3, 2, "something"], [9, null, 2, "this_is_null"]] |
I would like this sorted by the second element in descending order, so:
+--------------------+
| id| id_info|
+--------------------+
|id_1| [[5, 9, 2, "foo"], [1, 8, 2, "bar"], [4, 3, 2, "something"], [9, null, 2, "this_is_null"]] |
I came up with something like this :
def def_sort(x):
return sorted(x, key=lambda x:x[1], reverse=True)
udf_sort = F.udf(def_sort, T.ArrayType(T.ArrayType(T.IntegerType())))
df.select("id", udf_sort("id_info"))
I'm not sure how to handle null values like this, also is there maybe a built-in function for this? Can I somehow do it with F.array_sort?
The elements of the array contain integers and a string, so I assume that the column id_info is an array of structs.
So the schema of the input data would be similiar to
root
|-- id: string (nullable = true)
|-- id_info: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- col1: integer (nullable = true)
| | |-- col2: integer (nullable = true)
| | |-- col3: integer (nullable = true)
| | |-- col4: string (nullable = true)
The names of the elements of the struct might be different.
With this schema information we can use array_sort to order the array:
df.selectExpr("array_sort(id_info, (l,r) -> \
case when l['col2'] > r['col2'] then -1 else 1 end) as sorted") \
.show(truncate=False)
prints
+----------------------------------------------------------------------------------+
|sorted |
+----------------------------------------------------------------------------------+
|[{5, 9, 2, foo}, {1, 8, 2, bar}, {4, 3, 2, something}, {9, null, 2, this_is_null}]|
+----------------------------------------------------------------------------------+
You can try explode folowed by orderby on id and second element on descending order, then groupBy + collect_list:
out = (sdf.select("*",F.explode("id_info").alias("element"))
.withColumn("second_ele",F.element_at("element",2))
.orderBy(*["id",F.desc("second_ele")])
.groupBy("id").agg(F.collect_list("element").alias("id_info"))
)
out.show(truncate=False)
+----+-----------------------------------------------------------------------+
|id |id_info |
+----+-----------------------------------------------------------------------+
|id_1|[[5, 9, 2, null], [1, 8, 2, null], [4, 3, 2, null], [9, null, 2, null]]|
+----+-----------------------------------------------------------------------+

Calculate cumulative sum of pyspark array column

I have a spark dataframe with an array column that looks like this:
+--------------+
| x |
+--------------+
| [1, 1, 0, 1] |
| [0, 0, 0, 0] |
| [0, 0, 1, 1] |
| [0, 0, 0, 1] |
| [1, 0, 1] |
+--------------+
I want to add a new column with another array that contains the cumulative sum of x at each index. The result should look like this:
+--------------+---------------+
| x | x_running_sum |
+--------------+---------------+
| [1, 1, 0, 1] | [1, 2, 2, 3] |
| [0, 0, 0, 0] | [0, 0, 0, 0] |
| [0, 0, 1, 1] | [0, 0, 1, 2] |
| [0, 0, 0, 1] | [0, 0, 0, 1] |
| [1, 0, 1] | [1, 1, 2] |
+--------------+---------------+
How can I create the x_running_sum column? I've tried using some of the higher order functions like transform, aggregate, and zip_with, but I haven't found a solution yet.
To perform a cumulative sum I sliced the array by index position and reduce the values from it:
from pyspark.sql import Row
df = spark.createDataFrame([
Row(x=[1, 1, 0, 1]),
Row(x=[0, 0, 0, 0]),
Row(x=[0, 0, 1, 1]),
Row(x=[0, 0, 0, 1]),
Row(x=[1, 0, 1])
])
(df
.selectExpr('x', "TRANSFORM(sequence(1, size(x)), index -> REDUCE(slice(x, 1, index), CAST(0 as BIGINT), (acc, el) -> acc + el)) AS x_running_sum")
.show(truncate=False))
Output
+------------+-------------+
|x |x_running_sum|
+------------+-------------+
|[1, 1, 0, 1]|[1, 2, 2, 3] |
|[0, 0, 0, 0]|[0, 0, 0, 0] |
|[0, 0, 1, 1]|[0, 0, 1, 2] |
|[0, 0, 0, 1]|[0, 0, 0, 1] |
|[1, 0, 1] |[1, 1, 2] |
+------------+-------------+

How to split an array into chunks and find the sum of the chunks and store the output as an array in pyspark

I have a dataframe as shown below:
+-----+------------------------+
|Index| finalArray |
+-----+------------------------+
|1 |[0, 2, 0, 3, 1, 4, 2, 7]|
|2 |[0, 4, 4, 3, 4, 2, 2, 5]|
+-----+------------------------+
I want to break the array into chunks of 2 and then find the sum of each chunks and store the resultant array in the column finalArray. It will look like below:
+-----+---------------------+
|Index| finalArray |
+-----+---------------------+
|1 |[2, 3, 5, 9] |
|2 |[4, 7, 6, 7] |
+-----+---------------------+
I am able to do it by creating an UDF but looking for an better and optimised way. Preferably if I can handle it using a withColumn and passing flagArray to do it without having to write an UDF.
#udf(ArrayType(DoubleType()))
def aggregate(finalArray,chunkSize):
n = int(chunkSize)
aggsum = []
final = [finalArray[i * n:(i + 1) * n] for i in range((len(finalArray) + n - 1) // n )]
for item in final:
agg = 0
for j in item:
agg += j
aggsum.append(agg)
return aggsum
I am not able to use the below expression in UDF hence I used loops
[sum(finalArray[x:x+2]) for x in range(0, len(finalArray), chunkSize)]
For spark 2.4+, you can try sequence + transform:
from pyspark.sql.function import expr
df = spark.createDataFrame([
(1, [0, 2, 0, 3, 1, 4, 2, 7]),
(2, [0, 4, 4, 3, 4, 2, 2, 5])
], ["Index", "finalArray"])
df.withColumn("finalArray", expr("""
transform(
sequence(0,ceil(size(finalArray)/2)-1),
i -> finalArray[2*i] + ifnull(finalArray[2*i+1],0))
""")).show(truncate=False)
+-----+------------+
|Index|finalArray |
+-----+------------+
|1 |[2, 3, 5, 9]|
|2 |[4, 7, 6, 7]|
+-----+------------+
For a chunk-size of any N, use aggregate function to do the sub-totals:
N = 3
sql_expr = """
transform(
/* create a sequence from 0 to number_of_chunks-1 */
sequence(0,ceil(size(finalArray)/{0})-1),
/* iterate the above sequence */
i ->
/* create a sequence from 0 to chunk_size-1
calculate the sum of values containing every chunk_size items by their indices
*/
aggregate(
sequence(0,{0}-1),
0L,
(acc, y) -> acc + ifnull(finalArray[i*{0}+y],0)
)
)
"""
df.withColumn("finalArray", expr(sql_expr.format(N))).show()
+-----+----------+
|Index|finalArray|
+-----+----------+
| 1| [2, 8, 9]|
| 2| [8, 9, 7]|
+-----+----------+
Here is a slightly different version of #jxc's solution using slice function with transform and aggregate functions.
The logic is for each element of the array we check if its index is a multiple of chunk size and use slice to get a subarray of chunk size. With aggregate we sum the elements of each sub-array. Finally using filter to remove nulls (corresponding to indexes that do not satisfy i % chunk = 0.
chunk = 2
transform_expr = f"""
filter(transform(finalArray,
(x, i) -> IF (i % {chunk} = 0,
aggregate(slice(finalArray, i+1, {chunk}), 0L, (acc, y) -> acc + y),
null
)
),
x -> x is not null)
"""
df.withColumn("finalArray", expr(transform_expr)).show()
#+-----+------------+
#|Index| finalArray|
#+-----+------------+
#| 1|[2, 3, 5, 9]|
#| 2|[4, 7, 6, 7]|
#+-----+------------+

Subtract Two Arrays to Get A New Array in Pyspark

I am new to Spark.
I can sum, subtract or multiply arrays in python Pandas&Numpy. But I am having difficulty doing something similar in Spark (python). I am on Databricks.
For example this kind of approach is giving a huge error message which I don't want to copy paste here:
differencer=udf(lambda x,y: x-y, ArrayType(FloatType()))
df.withColumn('difference', differencer('Array1', 'Array2'))
Schema looks like this:
root
|-- col1: integer (nullable = true)
|-- time: timestamp (nullable = true)
|-- num: integer (nullable = true)
|-- part: integer (nullable = true)
|-- result: integer (nullable = true)
|-- Array1: array (nullable = true)
| |-- element: float (containsNull = true)
|-- Array2: array (nullable = false)
| |-- element: float (containsNull = true)
I just want to create a new column subtracting those 2 array columns. Actually, I will get the RMSE between them. But I think I can handle it once I learn how to get this difference.
Arrays look like this(I am just typing in some integers):
Array1_row1[5, 4, 2, 4, 3]
Array2_row1[4, 3, 1, 2, 1]
So the resulting array for row1 should be:
DiffCol_row1[1, 1, 1, 2, 2]
Thanks for suggestions or giving directions. Thank you.
You can zip_arrays and transform
from pyspark.sql.functions import expr
df = spark.createDataFrame(
[([5, 4, 2, 4, 3], [4, 3, 1, 2, 1])], ("array1", "array2")
)
df.withColumn(
"array3",
expr("transform(arrays_zip(array1, array2), x -> x.array1 - x.array2)")
).show()
# +---------------+---------------+---------------+
# | array1| array2| array3|
# +---------------+---------------+---------------+
# |[5, 4, 2, 4, 3]|[4, 3, 1, 2, 1]|[1, 1, 1, 2, 2]|
# +---------------+---------------+---------------+
A valid udf would require an equivalent logic, i.e.
from pyspark.sql.functions import udf
#udf("array<double>")
def differencer(xs, ys):
if xs and ys:
return [float(x - y) for x, y in zip(xs, ys)]
df.withColumn("array3", differencer("array1", "array2")).show()
# +---------------+---------------+--------------------+
# | array1| array2| array3|
# +---------------+---------------+--------------------+
# |[5, 4, 2, 4, 3]|[4, 3, 1, 2, 1]|[1.0, 1.0, 1.0, 2...|
# +---------------+---------------+--------------------+
You can use zip_with (since Spark 2.4):
from pyspark.sql.functions import expr
df = spark.createDataFrame(
[([5, 4, 2, 4, 3], [4, 3, 1, 2, 1])], ("array1", "array2")
)
df.withColumn(
"array3",
expr("zip_with(array1, array2, (x, y) -> x - y)")
).show()
# +---------------+---------------+---------------+
# | array1| array2| array3|
# +---------------+---------------+---------------+
# |[5, 4, 2, 4, 3]|[4, 3, 1, 2, 1]|[1, 1, 1, 2, 2]|
# +---------------+---------------+---------------+

PySpark DF column creation with UDF to mimic np.roll function from numpy

Trying to create a new column in a PySpark UDF but the values are null!
Create the DF
data_list = [['a', [1, 2, 3]], ['b', [4, 5, 6]],['c', [2, 4, 6, 8]],['d', [4, 1]],['e', [1,2]]]
all_cols = ['COL1','COL2']
df = sqlContext.createDataFrame(data_list, all_cols)
df.show()
+----+------------+
|COL1| COL2|
+----+------------+
| a| [1, 2, 3]|
| b| [4, 5, 6]|
| c|[2, 4, 6, 8]|
| d| [4, 1]|
| e| [1, 2]|
+----+------------+
df.printSchema()
root
|-- COL1: string (nullable = true)
|-- COL2: array (nullable = true)
| |-- element: long (containsNull = true)
Create a function
def cr_pair(idx_src, idx_dest):
idx_dest.append(idx_dest.pop(0))
return idx_src, idx_dest
lst1 = [1,2,3]
lst2 = [1,2,3]
cr_pair(lst1, lst2)
([1, 2, 3], [2, 3, 1])
Create and register a UDF
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
from pyspark.sql.types import ArrayType
get_idx_pairs = udf(lambda x: cr_pair(x, x), ArrayType(IntegerType()))
Add a new column to the DF
df = df.select('COL1', 'COL2', get_idx_pairs('COL2').alias('COL3'))
df.printSchema()
root
|-- COL1: string (nullable = true)
|-- COL2: array (nullable = true)
| |-- element: long (containsNull = true)
|-- COL3: array (nullable = true)
| |-- element: integer (containsNull = true)
df.show()
+----+------------+------------+
|COL1| COL2| COL3|
+----+------------+------------+
| a| [1, 2, 3]|[null, null]|
| b| [4, 5, 6]|[null, null]|
| c|[2, 4, 6, 8]|[null, null]|
| d| [4, 1]|[null, null]|
| e| [1, 2]|[null, null]|
+----+------------+------------+
Here where the problem is.
I am getting all values 'null' in the COL3 column.
The intended outcome should be:
+----+------------+----------------------------+
|COL1| COL2| COL3|
+----+------------+----------------------------+
| a| [1, 2, 3]|[[1 ,2, 3], [2, 3, 1]] |
| b| [4, 5, 6]|[[4, 5, 6], [5, 6, 4]] |
| c|[2, 4, 6, 8]|[[2, 4, 6, 8], [4, 6, 8, 2]]|
| d| [4, 1]|[[4, 1], [1, 4]] |
| e| [1, 2]|[[1, 2], [2, 1]] |
+----+------------+----------------------------+
Your UDF should return ArrayType(ArrayType(IntegerType())) since you are expecting a list of lists in your column, besides it only needs one parameter:
def cr_pair(idx_src):
return idx_src, idx_src[1:] + idx_src[:1]
get_idx_pairs = udf(cr_pair, ArrayType(ArrayType(IntegerType())))
df.withColumn('COL3', get_idx_pairs(df['COL2'])).show(5, False)
+----+------------+----------------------------+
|COL1|COL2 |COL3 |
+----+------------+----------------------------+
|a |[1, 2, 3] |[[2, 3, 1], [2, 3, 1]] |
|b |[4, 5, 6] |[[5, 6, 4], [5, 6, 4]] |
|c |[2, 4, 6, 8]|[[4, 6, 8, 2], [4, 6, 8, 2]]|
|d |[4, 1] |[[1, 4], [1, 4]] |
|e |[1, 2] |[[2, 1], [2, 1]] |
+----+------------+----------------------------+
It seems like what you want to do is circularly shift the elements in your list. Here is a non-udf approach using pyspark.sql.functions.posexplode() (Spark version 2.1 and above):
import pyspark.sql.functions as f
from pyspark.sql import Window
w = Window.partitionBy("COL1", "COL2").orderBy(f.col("pos") == 0, "pos")
df = df.select("*", f.posexplode("COL2"))\
.select("COL1", "COL2", "pos", f.collect_list("col").over(w).alias('COL3'))\
.where("pos = 0")\
.drop("pos")\
.withColumn("COL3", f.array("COL2", "COL3"))
df.show(truncate=False)
#+----+------------+----------------------------------------------------+
#|COL1|COL2 |COL3 |
#+----+------------+----------------------------------------------------+
#|a |[1, 2, 3] |[WrappedArray(1, 2, 3), WrappedArray(2, 3, 1)] |
#|b |[4, 5, 6] |[WrappedArray(4, 5, 6), WrappedArray(5, 6, 4)] |
#|c |[2, 4, 6, 8]|[WrappedArray(2, 4, 6, 8), WrappedArray(4, 6, 8, 2)]|
#|d |[4, 1] |[WrappedArray(4, 1), WrappedArray(1, 4)] |
#|e |[1, 2] |[WrappedArray(1, 2), WrappedArray(2, 1)] |
#+----+------------+----------------------------------------------------+
Using posexplode will return two columns- the position in the list (pos) and the value (col). The trick here is that we order by f.col("pos") == 0 first and then "pos". This will move the first position in the array to the end of the list.
Though this output prints differently than you would expect with list of lists in python, the contents of COL3 are indeed a list of lists of integers.
df.printSchema()
#root
# |-- COL1: string (nullable = true)
# |-- COL2: array (nullable = true)
# | |-- element: long (containsNull = true)
# |-- COL3: array (nullable = false)
# | |-- element: array (containsNull = true)
# | | |-- element: long (containsNull = true)
Update
The "WrappedArray prefix" is just the way Spark prints nested lists. The underlying array is exactly as you need it. One way to verify this is by calling collect() and inspecting the data:
results = df.collect()
print([(r["COL1"], r["COL3"]) for r in results])
#[(u'a', [[1, 2, 3], [2, 3, 1]]),
# (u'b', [[4, 5, 6], [5, 6, 4]]),
# (u'c', [[2, 4, 6, 8], [4, 6, 8, 2]]),
# (u'd', [[4, 1], [1, 4]]),
# (u'e', [[1, 2], [2, 1]])]
Or if you converted df to a pandas DataFrame:
print(df.toPandas())
# COL1 COL2 COL3
#0 a [1, 2, 3] ([1, 2, 3], [2, 3, 1])
#1 b [4, 5, 6] ([4, 5, 6], [5, 6, 4])
#2 c [2, 4, 6, 8] ([2, 4, 6, 8], [4, 6, 8, 2])
#3 d [4, 1] ([4, 1], [1, 4])
#4 e [1, 2] ([1, 2], [2, 1])

Resources