processing of a complex object (array) in pyspark - apache-spark

I am trying to figure out possible ways to process complex objects in pyspark. In the example below one of the columns of the dataframe is an array of integers. The processing is simply adding one to each value. Are these acceptable methods or there is a better practice?
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
spark = SparkSession.builder.enableHiveSupport().appName('learn').getOrCreate()
data = [('a', 1, [1, 3, 5]),
('b', 2, [4, 6, 9]),
('c', 3, [50, 60, 70, 80])]
df = spark.createDataFrame(data, ['nam', 'q', 'compl'])
# process complex object, method 1 using explode and collect_list (dataframe API)
res = df.withColumn('id', f.monotonically_increasing_id()).withColumn('compl_exploded', f.explode(f.col('compl')))
res = res.withColumn('compl_exploded', f.col('compl_exploded')+1)
res = res.groupby('id').agg(f.first('nam'), f.first('q'), f.collect_list('compl_exploded').alias('compl')).drop('id')
res.show()
# process complex object, method 2 using explode and collect_list (SQL)
df.withColumn('id', f.monotonically_increasing_id()).createOrReplaceTempView('tmp_view')
res = spark.sql("""
SELECT first(nam) AS nam, first(q) AS q, collect_list(compl_exploded+1) AS compl FROM (
SELECT *, explode(compl) AS compl_exploded FROM tmp_view
) x
GROUP BY id
""")
res.show()
# process complex object, method 3 using python UDF
from typing import List
def process(x: List[int]) -> List[int]:
return [_+1 for _ in x]
process_udf = f.udf(process, ArrayType(LongType()))
res = df.withColumn('compl', process_udf('compl'))
res.show()

For such operation you can take advantage of in build functions.
For e.g in your usecase you can use transform like below :
pyspark<=3.0
# Option 1
import pyspark.sql.functions as f
df.withColumn('add_one',f.expr('transform(compl, x -> x+1)')).show()
+---+---+----------------+----------------+
|nam| q| compl| add_one|
+---+---+----------------+----------------+
| a| 1| [1, 3, 5]| [2, 4, 6]|
| b| 2| [4, 6, 9]| [5, 7, 10]|
| c| 3|[50, 60, 70, 80]|[51, 61, 71, 81]|
+---+---+----------------+----------------+
# OR below options , all will give same output
# Option 2
df.select('nam', 'q', 'compl' , f.expr('transform(compl, x -> x+1) as add_one')).show()
# Option 3
df.createOrReplaceTempView('tmp_view')
spark.sql( 'select nam, q, compl , transform(compl, x -> x+1) as add_one from tmp_view').show()
pyspark>=3.1.0
If you are using newer version of spark then this function is easily available and you can use without expr.

Related

PySpark RDD: Manipulating Inner Array

I have a dataset (for example)
sc = SparkContext()
x = [(1, [2, 3, 4, 5]), (2, [2, 7, 8, 10])]
y = sc.parallelize(x)
print(y.take(1))
The print statement returns [(1, [2, 3, 4, 5])]
I now need to multiply everything in the sub-array by 2 across the RDD. Since I have already parallelized, I can't further break down "y.take(1)" to multiply [2, 3, 4, 5] by 2.
How can I essentially isolate the inner array across my worker nodes to then do the multiplication?
I think you can use map with a lambda function:
y = sc.parallelize(x).map(lambda x: (x[0], [2*t for t in x[1]]))
Then y.take(2) returns:
[(1, [4, 6, 8, 10]), (2, [4, 14, 16, 20])]
It will be more efficient if you use DataFrame API instead of RDDs - in this case all your processing will happen without serialization to Python that happens when you use RDD APIs.
For example you can use the transform function to apply transformation to array values:
import pyspark.sql.functions as F
df = spark.createDataFrame([(1, [2, 3, 4, 5]), (2, [2, 7, 8, 10])],
schema="id int, arr array<int>")
df2 = df.select("id", F.transform("arr", lambda x: x*2).alias("arr"))
df2.show()
will give you desired:
+---+---------------+
| id| arr|
+---+---------------+
| 1| [4, 6, 8, 10]|
| 2|[4, 14, 16, 20]|
+---+---------------+

Return map values sorted by keys

Here is a simple example
from pyspark.sql.functions import map_values
df = spark.sql("SELECT map('a', 1, 'c', 2, 'b', 3) as data")
df.show(20, False)
df.select(map_values("data").alias("values")).show()
What I want is the following (in the order of the keys: 'a', 'b', 'c')
How to achieve this? In addition - does the result from map_values function always maintain the order in the df.show() above, i.e., [1, 2, 3]?
An option using map_keys
from pyspark.sql import functions as F
df = spark.sql("SELECT map('a', 1, 'c', 2, 'b', 3) as data")
df = df.select(
F.transform(F.array_sort(F.map_keys("data")), lambda x: F.col("data")[x]).alias("values")
)
df.show()
# +---------+
# | values|
# +---------+
# |[1, 3, 2]|
# +---------+
The map's contract is that it delivers value for a certain key, and the entries ordering is not preserved. Keeping the order is provided by arrays.
What you can do is turn your map into an array with map_entries function, then sort the entries using array_sort and then use transform to get the values. A little convoluted, but works.
with data as (SELECT map('a', 1, 'c', 2, 'b', 3) as m)
select
transform(
array_sort(
map_entries(m),
(left, right) -> case when left.key < right.key then -1 when left.key > right.key then 1 else 0 end
),
e -> e.value
)
from data;

How to split an array into chunks and find the sum of the chunks and store the output as an array in pyspark

I have a dataframe as shown below:
+-----+------------------------+
|Index| finalArray |
+-----+------------------------+
|1 |[0, 2, 0, 3, 1, 4, 2, 7]|
|2 |[0, 4, 4, 3, 4, 2, 2, 5]|
+-----+------------------------+
I want to break the array into chunks of 2 and then find the sum of each chunks and store the resultant array in the column finalArray. It will look like below:
+-----+---------------------+
|Index| finalArray |
+-----+---------------------+
|1 |[2, 3, 5, 9] |
|2 |[4, 7, 6, 7] |
+-----+---------------------+
I am able to do it by creating an UDF but looking for an better and optimised way. Preferably if I can handle it using a withColumn and passing flagArray to do it without having to write an UDF.
#udf(ArrayType(DoubleType()))
def aggregate(finalArray,chunkSize):
n = int(chunkSize)
aggsum = []
final = [finalArray[i * n:(i + 1) * n] for i in range((len(finalArray) + n - 1) // n )]
for item in final:
agg = 0
for j in item:
agg += j
aggsum.append(agg)
return aggsum
I am not able to use the below expression in UDF hence I used loops
[sum(finalArray[x:x+2]) for x in range(0, len(finalArray), chunkSize)]
For spark 2.4+, you can try sequence + transform:
from pyspark.sql.function import expr
df = spark.createDataFrame([
(1, [0, 2, 0, 3, 1, 4, 2, 7]),
(2, [0, 4, 4, 3, 4, 2, 2, 5])
], ["Index", "finalArray"])
df.withColumn("finalArray", expr("""
transform(
sequence(0,ceil(size(finalArray)/2)-1),
i -> finalArray[2*i] + ifnull(finalArray[2*i+1],0))
""")).show(truncate=False)
+-----+------------+
|Index|finalArray |
+-----+------------+
|1 |[2, 3, 5, 9]|
|2 |[4, 7, 6, 7]|
+-----+------------+
For a chunk-size of any N, use aggregate function to do the sub-totals:
N = 3
sql_expr = """
transform(
/* create a sequence from 0 to number_of_chunks-1 */
sequence(0,ceil(size(finalArray)/{0})-1),
/* iterate the above sequence */
i ->
/* create a sequence from 0 to chunk_size-1
calculate the sum of values containing every chunk_size items by their indices
*/
aggregate(
sequence(0,{0}-1),
0L,
(acc, y) -> acc + ifnull(finalArray[i*{0}+y],0)
)
)
"""
df.withColumn("finalArray", expr(sql_expr.format(N))).show()
+-----+----------+
|Index|finalArray|
+-----+----------+
| 1| [2, 8, 9]|
| 2| [8, 9, 7]|
+-----+----------+
Here is a slightly different version of #jxc's solution using slice function with transform and aggregate functions.
The logic is for each element of the array we check if its index is a multiple of chunk size and use slice to get a subarray of chunk size. With aggregate we sum the elements of each sub-array. Finally using filter to remove nulls (corresponding to indexes that do not satisfy i % chunk = 0.
chunk = 2
transform_expr = f"""
filter(transform(finalArray,
(x, i) -> IF (i % {chunk} = 0,
aggregate(slice(finalArray, i+1, {chunk}), 0L, (acc, y) -> acc + y),
null
)
),
x -> x is not null)
"""
df.withColumn("finalArray", expr(transform_expr)).show()
#+-----+------------+
#|Index| finalArray|
#+-----+------------+
#| 1|[2, 3, 5, 9]|
#| 2|[4, 7, 6, 7]|
#+-----+------------+

Pyspark UDF to return result similar to groupby().sum() between two columns

I have the following sample dataframe
fruit_list = ['apple', 'apple', 'orange', 'apple']
qty_list = [16, 2, 3, 1]
spark_df = spark.createDataFrame([(101, 'Mark', fruit_list, qty_list)], ['ID', 'name', 'fruit', 'qty'])
and I would like to create another column which contains a result similar to what I would achieve with a pandas groupby('fruit').sum()
qty
fruits
apple 19
orange 3
The above result could be stored in the new column in any form (either a string, dictionary, list of tuples...).
I've tried an approach similar to the following one which does not work
sum_cols = udf(lambda x: pd.DataFrame({'fruits': x[0], 'qty': x[1]}).groupby('fruits').sum())
spark_df.withColumn('Result', sum_cols(F.struct('fruit', 'qty'))).show()
One example of result dataframe could be
+---+----+--------------------+-------------+-------------------------+
| ID|name| fruit| qty| Result|
+---+----+--------------------+-------------+-------------------------+
|101|Mark|[apple, apple, or...|[16, 2, 3, 1]|[(apple,19), (orange,3)] |
+---+----+--------------------+-------------+-------------------------+
Do you have any suggestion on how I could achieve that?
Thanks
Edit: running on Spark 2.4.3
As #pault mentioned, as of Spark 2.4+, you can use Spark SQL built-in function to handle your task, here is one way with array_distinct + transform + aggregate:
from pyspark.sql.functions import expr
# set up data
spark_df = spark.createDataFrame([
(101, 'Mark', ['apple', 'apple', 'orange', 'apple'], [16, 2, 3, 1])
, (102, 'Twin', ['apple', 'banana', 'avocado', 'banana', 'avocado'], [5, 2, 11, 3, 1])
, (103, 'Smith', ['avocado'], [10])
], ['ID', 'name', 'fruit', 'qty']
)
>>> spark_df.show(5,0)
+---+-----+-----------------------------------------+----------------+
|ID |name |fruit |qty |
+---+-----+-----------------------------------------+----------------+
|101|Mark |[apple, apple, orange, apple] |[16, 2, 3, 1] |
|102|Twin |[apple, banana, avocado, banana, avocado]|[5, 2, 11, 3, 1]|
|103|Smith|[avocado] |[10] |
+---+-----+-----------------------------------------+----------------+
>>> spark_df.printSchema()
root
|-- ID: long (nullable = true)
|-- name: string (nullable = true)
|-- fruit: array (nullable = true)
| |-- element: string (containsNull = true)
|-- qty: array (nullable = true)
| |-- element: long (containsNull = true)
Set up the SQL statement:
stmt = '''
transform(array_distinct(fruit), x -> (x, aggregate(
transform(sequence(0,size(fruit)-1), i -> IF(fruit[i] = x, qty[i], 0))
, 0
, (y,z) -> int(y + z)
))) AS sum_fruit
'''
>>> spark_df.withColumn('sum_fruit', expr(stmt)).show(10,0)
+---+-----+-----------------------------------------+----------------+----------------------------------------+
|ID |name |fruit |qty |sum_fruit |
+---+-----+-----------------------------------------+----------------+----------------------------------------+
|101|Mark |[apple, apple, orange, apple] |[16, 2, 3, 1] |[[apple, 19], [orange, 3]] |
|102|Twin |[apple, banana, avocado, banana, avocado]|[5, 2, 11, 3, 1]|[[apple, 5], [banana, 5], [avocado, 12]]|
|103|Smith|[avocado] |[10] |[[avocado, 10]] |
+---+-----+-----------------------------------------+----------------+----------------------------------------+
Explanation:
Use array_distinct(fruit) to find all distinct entries in the array fruit
transform this new array (with element x) from x to (x, aggregate(..x..))
the above function aggregate(..x..) takes the simple form of summing up all elements in array_T
aggregate(array_T, 0, (y,z) -> y + z)
where the array_T is from the following transformation:
transform(sequence(0,size(fruit)-1), i -> IF(fruit[i] = x, qty[i], 0))
which iterate through the array fruit, if the value of fruit[i] = x , then return the corresponding qty[i], otherwise return 0. for example for ID=101, when x = 'orange', it returns an array [0, 0, 3, 0]
There may be a fancy way to do this using only the API functions on Spark 2.4+, perhaps with some combination of arrays_zip and aggregate, but I can't think of any that don't involve an explode step followed by a groupBy. With that in mind, using a udf may actually be better for you in this case.
I think creating a pandas DataFrame just for the purpose of calling .groupby().sum() is overkill. Furthermore, even if you did do it that way, you'd need to convert the final output to a different data structure because a udf can't return a pandas DataFrame.
Here's one way with a udf using collections.defaultdict:
from collections import defaultdict
from pyspark.sql.functions import udf
def sum_cols_func(frt, qty):
d = defaultdict(int)
for x, y in zip(frt, map(int, qty)):
d[x] += y
return d.items()
sum_cols = udf(
lambda x: sum_cols_func(*x),
ArrayType(
StructType([StructField("fruit", StringType()), StructField("qty", IntegerType())])
)
)
Then call this by passing in the fruit and qty columns:
from pyspark.sql.functions import array, col
spark_df.withColumn(
"Result",
sum_cols(array([col("fruit"), col("qty")]))
).show(truncate=False)
#+---+----+-----------------------------+-------------+--------------------------+
#|ID |name|fruit |qty |Result |
#+---+----+-----------------------------+-------------+--------------------------+
#|101|Mark|[apple, apple, orange, apple]|[16, 2, 3, 1]|[[orange, 3], [apple, 19]]|
#+---+----+-----------------------------+-------------+--------------------------+
If you have spark < 2.4, use the follwoing to explode (otherwise check this answer):
df_split = (spark_df.rdd.flatMap(lambda row: [(row.ID, row.name, f, q) for f, q in zip(row.fruit, row.qty)]).toDF(["ID", "name", "fruit", "qty"]))
df_split.show()
Output:
+---+----+------+---+
| ID|name| fruit|qty|
+---+----+------+---+
|101|Mark| apple| 16|
|101|Mark| apple| 2|
|101|Mark|orange| 3|
|101|Mark| apple| 1|
+---+----+------+---+
Then prepare the result you want. First find the aggregated dataframe:
df_aggregated = df_split.groupby('ID', 'fruit').agg(F.sum('qty').alias('qty'))
df_aggregated.show()
Output:
+---+------+---+
| ID| fruit|qty|
+---+------+---+
|101|orange| 3|
|101| apple| 19|
+---+------+---+
And finally change it to the desired format:
df_aggregated.groupby('ID').agg(F.collect_list(F.struct(F.col('fruit'), F.col('qty'))).alias('Result')).show()
Output:
+---+--------------------------+
|ID |Result |
+---+--------------------------+
|101|[[orange, 3], [apple, 19]]|
+---+--------------------------+

Subset one array column with another (boolean) array column

I have a Dataframe like this (in Pyspark 2.3.1):
from pyspark.sql import Row
my_data = spark.createDataFrame([
Row(a=[9, 3, 4], b=['a', 'b', 'c'], mask=[True, False, False]),
Row(a=[7, 2, 6, 4], b=['w', 'x', 'y', 'z'], mask=[True, False, True, False])
])
my_data.show(truncate=False)
#+------------+------------+--------------------------+
#|a |b |mask |
#+------------+------------+--------------------------+
#|[9, 3, 4] |[a, b, c] |[true, false, false] |
#|[7, 2, 6, 4]|[w, x, y, z]|[true, false, true, false]|
#+------------+------------+--------------------------+
Now I'd like to use the mask column in order to subset the a and b columns:
my_desired_output = spark.createDataFrame([
Row(a=[9], b=['a']),
Row(a=[7, 6], b=['w', 'y'])
])
my_desired_output.show(truncate=False)
#+------+------+
#|a |b |
#+------+------+
#|[9] |[a] |
#|[7, 6]|[w, y]|
#+------+------+
What's the "idiomatic" way to achieve this? The current solution I have involves map-ing over the underlying RDD and subsetting with Numpy, which seems inelegant:
import numpy as np
def subset_with_mask(row):
mask = np.asarray(row.mask)
a_masked = np.asarray(row.a)[mask].tolist()
b_masked = np.asarray(row.b)[mask].tolist()
return Row(a=a_masked, b=b_masked)
my_desired_output = spark.createDataFrame(my_data.rdd.map(subset_with_mask))
Is this the best way to go, or is there something better (less verbose and/or more efficient) I can do using Spark SQL tools?
One option is to use a UDF, which you can optionally specialize by the data type in the array:
import numpy as np
import pyspark.sql.functions as F
import pyspark.sql.types as T
def _mask_list(lst, mask):
return np.asarray(lst)[mask].tolist()
mask_array_int = F.udf(_mask_list, T.ArrayType(T.IntegerType()))
mask_array_str = F.udf(_mask_list, T.ArrayType(T.StringType()))
my_desired_output = my_data
my_desired_output = my_desired_output.withColumn(
'a', mask_array_int(F.col('a'), F.col('mask'))
)
my_desired_output = my_desired_output.withColumn(
'b', mask_array_str(F.col('b'), F.col('mask'))
)
UDFs mentioned in the previous answer is probably the way to go prior to the array functions added in Spark 2.4. For the sake of completeness, here is a "pure SQL" implementation before 2.4.
from pyspark.sql.functions import *
df = my_data.withColumn("row", monotonically_increasing_id())
df1 = df.select("row", posexplode("a").alias("pos", "a"))
df2 = df.select("row", posexplode("b").alias("pos", "b"))
df3 = df.select("row", posexplode("mask").alias("pos", "mask"))
df1\
.join(df2, ["row", "pos"])\
.join(df3, ["row", "pos"])\
.filter("mask")\
.groupBy("row")\
.agg(collect_list("a").alias("a"), collect_list("b").alias("b"))\
.select("a", "b")\
.show()
Output:
+------+------+
| a| b|
+------+------+
|[7, 6]|[w, y]|
| [9]| [a]|
+------+------+
A better way to do this is to use pyspark.sql.functions.expr, filter, and transform:
import pandas as pd
from pyspark.sql import (
functions as F,
SparkSession
)
spark = SparkSession.builder.master('local[4]').getOrCreate()
bool_df = pd.DataFrame([
['a', [0, 1, 2, 3, 4], [True]*4 + [False]],
['b', [5, 6, 7, 8, 9], [False, True, False, True, False]]
], columns=['id', 'int_arr', 'bool_arr'])
bool_sdf = spark.createDataFrame(bool_df)
def filter_with_mask(in_col, mask_col, out_name="masked_arr"):
filt_input = f'arrays_zip({in_col}, {mask_col})'
filt_func = f'x -> x.{mask_col}'
trans_func = f'x -> x.{in_col}'
result = F.expr(f'''transform(
filter({filt_input}, {filt_func}), {trans_func}
)''').alias
return result
Using the function:
bool_sdf.select(
'*', filter_with_mask('int_arr', 'bool_arr', bool_sdf)
).toPandas()
Results in:
id int_arr bool_arr masked_arr
a [0, 1, 2, 3, 4] [True, True, True, True, False] [0, 1, 2, 3]
b [5, 6, 7, 8, 9] [False, True, False, True, False] [6, 8]
This should be possible with pyspark >= 2.4.0 and python >= 3.6.

Resources