Alternative of groupby in Pyspark to improve performance of Pyspark code

Alternative of groupby in Pyspark to improve performance of Pyspark code - apache-spark

My Pyspark data frame looks like this. I have to remove group by function from pyspark code to increase the performance of the code. I have to perform operations on 100k data.
[Initial Data]
To create Dataframe
df = spark.createDataFrame([
(0, ['-9.53', '-9.35', '0.18']),
(1, ['-7.77', '-7.61', '0.16']),
(2, ['-5.80', '-5.71', '0.10']),
(0, ['1', '2', '3']),
(1, ['4', '5', '6']),
(2, ['8', '98', '32'])
], ["id", "Array"])
And the expected output is produced using this code.
import pyspark.sql.functions as f
df.groupBy('id').agg(f.collect_list(f.col("Array")).alias('Array')).\
select("id",f.flatten("Array")).show()
I have to achieve the output in this format. The above code is giving me this output. I have to achieve the same by removing the groupby function.
+---+-------------------------------+
|id |flatten(Array) |
+---+-------------------------------+
|0 |[-9.53, -9.35, 0.18, 1, 2, 3] |
|1 |[-7.77, -7.61, 0.16, 4, 5, 6] |
|2 |[-5.80, -5.71, 0.10, 8, 98, 32]|
+---+-------------------------------+

If you don't want to do group by, you can use window functions:
import pyspark.sql.functions as f
from pyspark.sql.window import Window
df2 = df.select(
"id",
f.flatten(f.collect_list(f.col("Array")).over(Window.partitionBy("id"))).alias("Array")
).distinct()
df2.show(truncate=False)
+---+-------------------------------+
|id |Array |
+---+-------------------------------+
|0 |[-9.53, -9.35, 0.18, 1, 2, 3] |
|1 |[-7.77, -7.61, 0.16, 4, 5, 6] |
|2 |[-5.80, -5.71, 0.10, 8, 98, 32]|
+---+-------------------------------+
You can also try
df.select(
'id',
f.explode('Array').alias('Array')
).groupBy('id').agg(f.collect_list('Array').alias('Array'))
Although I'm not sure if it'll be faster.

Related

Spark - Wide/sparse dataframe persistence

I want to persist a very wide Spark Dataframe (>100'000 columns) that is sparsely populated (>99% of values are null) while keeping only non-null values (to avoid storage cost):
What is the best format for such use case (HBase, Avro, Parquet, ...) ?
What should be specified Spark side to ignore nulls when writing?
Note that I've tried already Parquet and Avro with a simple df.write statement - for a df of size ca. 100x130k Parquet is performing the worst (ca. 55MB) vs. Avro (ca. 15MB). To me this suggests that ALL null values are stored.
Thanks !

Spark to JSON / SparseVector (from thebluephantom)
In pyspark and using ml. Convert to Scala otherwise.
%python
from pyspark.sql.types import StructType, StructField, DoubleType
from pyspark.ml.linalg import SparseVector, VectorUDT
temp_rdd = sc.parallelize([
(0.0, SparseVector(4, {1: 1.0, 3: 5.5})),
(1.0, SparseVector(4, {0: -1.0, 2: 0.5}))])
schema = StructType([
StructField("label", DoubleType(), False),
StructField("features", VectorUDT(), False)
])
df = temp_rdd.toDF(schema)
df.printSchema()
df.write.json("/FileStore/V.json")
df2 = spark.read.schema(schema).json("/FileStore/V.json")
df2.show()
returns upon read:
+-----+--------------------+
|label| features|
+-----+--------------------+
| 1.0|(4,[0,2],[-1.0,0.5])|
| 0.0| (4,[1,3],[1.0,5.5])|
+-----+--------------------+
Spark to Avro / Avro2TF (from py-r)
The Avro2TF library presented in this tutorial seems to be an interesting alternative that directly leverages Avro. As a result, a sparse vector would be encoded as follows:
+---------------------+--------------------+
|genreFeatures_indices|genreFeatures_values|
+---------------------+--------------------+
| [2, 4, 1, 8, 11]|[1.0, 1.0, 1.0, 1...|
| [11, 10, 3]| [1.0, 1.0, 1.0]|
| [2, 4, 8]| [1.0, 1.0, 1.0]|
| [11, 10]| [1.0, 1.0]|
| [4, 8]| [1.0, 1.0]|
| [2, 4, 7, 3]|[1.0, 1.0, 1.0, 1.0]|

Append a value after every element in PySpark list Dataframe

I am having a dataframe like this
Data ID
[1,2,3,4] 22
I want to create a new column and each and every entry in the new column will be value from Data field appended with ID by ~ symbol, like below
Data ID New_Column
[1,2,3,4] 22 [1|22~2|22~3|22~4|22]
Note : In Data field the array size is not fixed one. It may not have entry or N number of entry will be there.
Can anyone please help me to solve!

package spark
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object DF extends App {
val spark = SparkSession.builder()
.master("local")
.appName("DataFrame-example")
.getOrCreate()
import spark.implicits._
val df = Seq(
(22, Seq(1,2,3,4)),
(23, Seq(1,2,3,4,5,6,7,8)),
(24, Seq())
).toDF("ID", "Data")
val arrUDF = udf((id: Long, array: Seq[Long]) => {
val r = array.size match {
case 0 => ""
case _ => array.map(x => s"$x|$id").mkString("~")
}
s"[$r]"
})
val resDF = df.withColumn("New_Column", lit(arrUDF('ID, 'Data)))
resDF.show(false)
//+---+------------------------+-----------------------------------------+
//|ID |Data |New_Column |
//+---+------------------------+-----------------------------------------+
//|22 |[1, 2, 3, 4] |[1|22~2|22~3|22~4|22] |
//|23 |[1, 2, 3, 4, 5, 6, 7, 8]|[1|23~2|23~3|23~4|23~5|23~6|23~7|23~8|23]|
//|24 |[] |[] |
//+---+------------------------+-----------------------------------------+
}

Spark 2.4+
Pyspark equivalent for the same goes like
df = spark.createDataFrame([(22, [1,2,3,4]),(23, [1,2,3,4,5,6,7,8]),(24, [])],['Id','Data'])
df.show()
+---+--------------------+
| Id| Data|
+---+--------------------+
| 22| [1, 2, 3, 4]|
| 23|[1, 2, 3, 4, 5, 6...|
| 24| []|
+---+--------------------+
df.withColumn('ff', f.when(f.size('Data')==0,'').otherwise(f.expr('''concat_ws('~',transform(Data, x->concat(x,'|',Id)))'''))).show(20,False)
+---+------------------------+---------------------------------------+
|Id |Data |ff |
+---+------------------------+---------------------------------------+
|22 |[1, 2, 3, 4] |1|22~2|22~3|22~4|22 |
|23 |[1, 2, 3, 4, 5, 6, 7, 8]|1|23~2|23~3|23~4|23~5|23~6|23~7|23~8|23|
|24 |[] | |
+---+------------------------+---------------------------------------+
If you want final output as array
df.withColumn('ff',f.array(f.when(f.size('Data')==0,'').otherwise(f.expr('''concat_ws('~',transform(Data, x->concat(x,'|',Id)))''')))).show(20,False)
+---+------------------------+-----------------------------------------+
|Id |Data |ff |
+---+------------------------+-----------------------------------------+
|22 |[1, 2, 3, 4] |[1|22~2|22~3|22~4|22] |
|23 |[1, 2, 3, 4, 5, 6, 7, 8]|[1|23~2|23~3|23~4|23~5|23~6|23~7|23~8|23]|
|24 |[] |[] |
+---+------------------------+-----------------------------------------+
Hope this helps

A udf can help:
def func(array, suffix):
return '~'.join([str(x) + '|' + str(suffix) for x in array])
from pyspark.sql.types import StringType
from pyspark.sql import functions as F
my_udf = F.udf(func, StringType())
df.withColumn("New_Column", my_udf("Data", "ID")).show()
prints
+------------+---+-------------------+
| Data| ID| New_Column |
+------------+---+-------------------+
|[1, 2, 3, 4]| 22|22~1|22~2|22~3|22~4|
+------------+---+-------------------+

How to aggregate on percentiles in PySpark?

I want to be able to aggregate based on percentiles (or more accurate in my case, complement percentiles)
Consider the following code:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[
['a', 1, 'w'],
['a', 1, 'y'],
['a', 11, 'x'],
['a', 111, 'zzz'],
['a', 1111, 'zz'],
['a', 1111, 'zz'],
['b', 2, 'w'],
['b', 2, 'w'],
['b', 2, 'w'],
['b', 22, 'y'],
['b', 2222, 'x'],
['b', 2222, 'z'],
],
['grp', 'val1', 'val2'])
grouped = df.groupby('grp').agg(
F.count('*').alias('count'),
F.expr('percentile(val1, array(0.5, 0.75)) as percentiles'),
# val2 manipulation....
)
grouped.show()
In addition to the grouping and the percentiles calculation, I would like to count the distinct values of val2 in the complement percentiles respectively.
For group b for example, the 50th percentile of val1 is 12 and the complement percentile is the last 3 rows which contain 3 distinct values of val2 (y,x,z).
Similarly, the 75th percentile of is 1672 and the complement percentile is the last 2 rows which contain 2 distinct values (x,z).
So my desired output would be:
+---+-----+--------------+--------------|
|grp|count| percentiles|distinct count|
+---+-----+--------------+--------------|
| a| 6| [61.0, 861.0]|[2, 1] |
| b| 6|[12.0, 1672.0]|[3, 2] |
+---+-----+--------------+--------------|
How can I acheive this?

For spark 2.3.2, you can use Window function to calculate percentiles, find val2s satisfying condition associated with percentiles and then do aggregation:
from pyspark.sql import Window, functions as F
w1 = Window.partitionBy('grp')
df1 = df.withColumn('percentiles', F.expr('percentile(val1, array(0.5, 0.75))').over(w1)) \
.withColumn('c1', F.expr('IF(val1>percentiles[0],val2,NULL)')) \
.withColumn('c2', F.expr('IF(val1>percentiles[1],val2,NULL)'))
grouped = df1.groupby('grp').agg(
F.count('*').alias('count'),
F.first('percentiles').alias('percentiles'),
F.array(F.countDistinct('c1'), F.countDistinct('c2')).alias('distinct_count')
)
grouped.show()
+---+-----+--------------+--------------+
|grp|count| percentiles|distinct_count|
+---+-----+--------------+--------------+
| b| 6|[12.0, 1672.0]| [3, 2]|
| a| 6| [61.0, 861.0]| [2, 1]|
+---+-----+--------------+--------------+

Subset one array column with another (boolean) array column

I have a Dataframe like this (in Pyspark 2.3.1):
from pyspark.sql import Row
my_data = spark.createDataFrame([
Row(a=[9, 3, 4], b=['a', 'b', 'c'], mask=[True, False, False]),
Row(a=[7, 2, 6, 4], b=['w', 'x', 'y', 'z'], mask=[True, False, True, False])
])
my_data.show(truncate=False)
#+------------+------------+--------------------------+
#|a |b |mask |
#+------------+------------+--------------------------+
#|[9, 3, 4] |[a, b, c] |[true, false, false] |
#|[7, 2, 6, 4]|[w, x, y, z]|[true, false, true, false]|
#+------------+------------+--------------------------+
Now I'd like to use the mask column in order to subset the a and b columns:
my_desired_output = spark.createDataFrame([
Row(a=[9], b=['a']),
Row(a=[7, 6], b=['w', 'y'])
])
my_desired_output.show(truncate=False)
#+------+------+
#|a |b |
#+------+------+
#|[9] |[a] |
#|[7, 6]|[w, y]|
#+------+------+
What's the "idiomatic" way to achieve this? The current solution I have involves map-ing over the underlying RDD and subsetting with Numpy, which seems inelegant:
import numpy as np
def subset_with_mask(row):
mask = np.asarray(row.mask)
a_masked = np.asarray(row.a)[mask].tolist()
b_masked = np.asarray(row.b)[mask].tolist()
return Row(a=a_masked, b=b_masked)
my_desired_output = spark.createDataFrame(my_data.rdd.map(subset_with_mask))
Is this the best way to go, or is there something better (less verbose and/or more efficient) I can do using Spark SQL tools?

One option is to use a UDF, which you can optionally specialize by the data type in the array:
import numpy as np
import pyspark.sql.functions as F
import pyspark.sql.types as T
def _mask_list(lst, mask):
return np.asarray(lst)[mask].tolist()
mask_array_int = F.udf(_mask_list, T.ArrayType(T.IntegerType()))
mask_array_str = F.udf(_mask_list, T.ArrayType(T.StringType()))
my_desired_output = my_data
my_desired_output = my_desired_output.withColumn(
'a', mask_array_int(F.col('a'), F.col('mask'))
)
my_desired_output = my_desired_output.withColumn(
'b', mask_array_str(F.col('b'), F.col('mask'))
)

UDFs mentioned in the previous answer is probably the way to go prior to the array functions added in Spark 2.4. For the sake of completeness, here is a "pure SQL" implementation before 2.4.
from pyspark.sql.functions import *
df = my_data.withColumn("row", monotonically_increasing_id())
df1 = df.select("row", posexplode("a").alias("pos", "a"))
df2 = df.select("row", posexplode("b").alias("pos", "b"))
df3 = df.select("row", posexplode("mask").alias("pos", "mask"))
df1\
.join(df2, ["row", "pos"])\
.join(df3, ["row", "pos"])\
.filter("mask")\
.groupBy("row")\
.agg(collect_list("a").alias("a"), collect_list("b").alias("b"))\
.select("a", "b")\
.show()
Output:
+------+------+
| a| b|
+------+------+
|[7, 6]|[w, y]|
| [9]| [a]|
+------+------+

A better way to do this is to use pyspark.sql.functions.expr, filter, and transform:
import pandas as pd
from pyspark.sql import (
functions as F,
SparkSession
)
spark = SparkSession.builder.master('local[4]').getOrCreate()
bool_df = pd.DataFrame([
['a', [0, 1, 2, 3, 4], [True]*4 + [False]],
['b', [5, 6, 7, 8, 9], [False, True, False, True, False]]
], columns=['id', 'int_arr', 'bool_arr'])
bool_sdf = spark.createDataFrame(bool_df)
def filter_with_mask(in_col, mask_col, out_name="masked_arr"):
filt_input = f'arrays_zip({in_col}, {mask_col})'
filt_func = f'x -> x.{mask_col}'
trans_func = f'x -> x.{in_col}'
result = F.expr(f'''transform(
filter({filt_input}, {filt_func}), {trans_func}
)''').alias
return result
Using the function:
bool_sdf.select(
'*', filter_with_mask('int_arr', 'bool_arr', bool_sdf)
).toPandas()
Results in:
id int_arr bool_arr masked_arr
a [0, 1, 2, 3, 4] [True, True, True, True, False] [0, 1, 2, 3]
b [5, 6, 7, 8, 9] [False, True, False, True, False] [6, 8]
This should be possible with pyspark >= 2.4.0 and python >= 3.6.

How to do Rdd and broadcasted Rdd multiplication in pyspark?

I have two data frames like Below:
data frame1:(df1)
+---+----------+
|id |features |
+---+----------+
|8 |[5, 4, 5] |
|9 |[4, 5, 2] |
+---+----------+
data frame2:(df2)
+---+----------+
|id |features |
+---+----------+
|1 |[1, 2, 3] |
|2 |[4, 5, 6] |
+---+----------+
after that i have converted into Df to Rdd
rdd1=df1.rdd
if I am doing rdd1.collect() result is like below
[Row(id=8, f=[5, 4, 5]), Row(id=9, f=[4, 5, 2])]
rdd2=df2.rdd
broadcastedrddif = sc.broadcast(rdd2.collectAsMap())
now if I am doing broadcastedrddif.value
{1: [1, 2, 3], 2: [4, 5, 6]}
now i want to do sum of multiplication of rdd1 and broadcastedrddif i.e it should return output like below.
((8,[(1,(5*1+2*4+5*3)),(2,(5*4+4*5+5*6))]),(9,[(1,(4*1+5*2+2*3)),(2,(4*4+5*5+2*6)]) ))
so my final output should be
((8,[(1,28),(2,70)]),(9,[(1,20),(2,53)]))
where (1, 28) is a tuple not a float.
Please help me on this.

I did not understand why you used sc.broadcast() but i used it anyway...
Very useful in this case mapValues on the last RDD and I used a list comprehension to execute the operations using the dictionary.
x1=sc.parallelize([[8,5,4,5], [9,4,5,2]]).map(lambda x: (x[0], (x[1],x[2],x[3])))
x1.collect()
x2=sc.parallelize([[1,1,2,3], [2,4,5,6]]).map(lambda x: (x[0], (x[1],x[2],x[3])))
x2.collect()
#I took immediately an RDD because is more simply to test
broadcastedrddif = sc.broadcast(x2.collectAsMap())
d2=broadcastedrddif.value
def sum_prod(x,y):
c=0
for i in range(0,len(x)):
c+=x[i]*y[i]
return c
x1.mapValues(lambda x: [(i, sum_prod(list(x),list(d2[i]))) for i in [k for k in d2.keys()]]).collect()
Out[19]: [(8, [(1, 28), (2, 70)]), (9, [(1, 20), (2, 53)])]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Alternative of groupby in Pyspark to improve performance of Pyspark code - apache-spark

Related

Spark - Wide/sparse dataframe persistence

Append a value after every element in PySpark list Dataframe

How to aggregate on percentiles in PySpark?

Subset one array column with another (boolean) array column

How to do Rdd and broadcasted Rdd multiplication in pyspark?

Categories

Resources