My Pyspark data frame looks like this. I have to remove group by function from pyspark code to increase the performance of the code. I have to perform operations on 100k data.
[Initial Data]
To create Dataframe
df = spark.createDataFrame([
(0, ['-9.53', '-9.35', '0.18']),
(1, ['-7.77', '-7.61', '0.16']),
(2, ['-5.80', '-5.71', '0.10']),
(0, ['1', '2', '3']),
(1, ['4', '5', '6']),
(2, ['8', '98', '32'])
], ["id", "Array"])
And the expected output is produced using this code.
import pyspark.sql.functions as f
df.groupBy('id').agg(f.collect_list(f.col("Array")).alias('Array')).\
select("id",f.flatten("Array")).show()
I have to achieve the output in this format. The above code is giving me this output. I have to achieve the same by removing the groupby function.
+---+-------------------------------+
|id |flatten(Array) |
+---+-------------------------------+
|0 |[-9.53, -9.35, 0.18, 1, 2, 3] |
|1 |[-7.77, -7.61, 0.16, 4, 5, 6] |
|2 |[-5.80, -5.71, 0.10, 8, 98, 32]|
+---+-------------------------------+
If you don't want to do group by, you can use window functions:
import pyspark.sql.functions as f
from pyspark.sql.window import Window
df2 = df.select(
"id",
f.flatten(f.collect_list(f.col("Array")).over(Window.partitionBy("id"))).alias("Array")
).distinct()
df2.show(truncate=False)
+---+-------------------------------+
|id |Array |
+---+-------------------------------+
|0 |[-9.53, -9.35, 0.18, 1, 2, 3] |
|1 |[-7.77, -7.61, 0.16, 4, 5, 6] |
|2 |[-5.80, -5.71, 0.10, 8, 98, 32]|
+---+-------------------------------+
You can also try
df.select(
'id',
f.explode('Array').alias('Array')
).groupBy('id').agg(f.collect_list('Array').alias('Array'))
Although I'm not sure if it'll be faster.
Related
I want to persist a very wide Spark Dataframe (>100'000 columns) that is sparsely populated (>99% of values are null) while keeping only non-null values (to avoid storage cost):
What is the best format for such use case (HBase, Avro, Parquet, ...) ?
What should be specified Spark side to ignore nulls when writing?
Note that I've tried already Parquet and Avro with a simple df.write statement - for a df of size ca. 100x130k Parquet is performing the worst (ca. 55MB) vs. Avro (ca. 15MB). To me this suggests that ALL null values are stored.
Thanks !
Spark to JSON / SparseVector (from thebluephantom)
In pyspark and using ml. Convert to Scala otherwise.
%python
from pyspark.sql.types import StructType, StructField, DoubleType
from pyspark.ml.linalg import SparseVector, VectorUDT
temp_rdd = sc.parallelize([
(0.0, SparseVector(4, {1: 1.0, 3: 5.5})),
(1.0, SparseVector(4, {0: -1.0, 2: 0.5}))])
schema = StructType([
StructField("label", DoubleType(), False),
StructField("features", VectorUDT(), False)
])
df = temp_rdd.toDF(schema)
df.printSchema()
df.write.json("/FileStore/V.json")
df2 = spark.read.schema(schema).json("/FileStore/V.json")
df2.show()
returns upon read:
+-----+--------------------+
|label| features|
+-----+--------------------+
| 1.0|(4,[0,2],[-1.0,0.5])|
| 0.0| (4,[1,3],[1.0,5.5])|
+-----+--------------------+
Spark to Avro / Avro2TF (from py-r)
The Avro2TF library presented in this tutorial seems to be an interesting alternative that directly leverages Avro. As a result, a sparse vector would be encoded as follows:
+---------------------+--------------------+
|genreFeatures_indices|genreFeatures_values|
+---------------------+--------------------+
| [2, 4, 1, 8, 11]|[1.0, 1.0, 1.0, 1...|
| [11, 10, 3]| [1.0, 1.0, 1.0]|
| [2, 4, 8]| [1.0, 1.0, 1.0]|
| [11, 10]| [1.0, 1.0]|
| [4, 8]| [1.0, 1.0]|
| [2, 4, 7, 3]|[1.0, 1.0, 1.0, 1.0]|
I am having a dataframe like this
Data ID
[1,2,3,4] 22
I want to create a new column and each and every entry in the new column will be value from Data field appended with ID by ~ symbol, like below
Data ID New_Column
[1,2,3,4] 22 [1|22~2|22~3|22~4|22]
Note : In Data field the array size is not fixed one. It may not have entry or N number of entry will be there.
Can anyone please help me to solve!
package spark
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object DF extends App {
val spark = SparkSession.builder()
.master("local")
.appName("DataFrame-example")
.getOrCreate()
import spark.implicits._
val df = Seq(
(22, Seq(1,2,3,4)),
(23, Seq(1,2,3,4,5,6,7,8)),
(24, Seq())
).toDF("ID", "Data")
val arrUDF = udf((id: Long, array: Seq[Long]) => {
val r = array.size match {
case 0 => ""
case _ => array.map(x => s"$x|$id").mkString("~")
}
s"[$r]"
})
val resDF = df.withColumn("New_Column", lit(arrUDF('ID, 'Data)))
resDF.show(false)
//+---+------------------------+-----------------------------------------+
//|ID |Data |New_Column |
//+---+------------------------+-----------------------------------------+
//|22 |[1, 2, 3, 4] |[1|22~2|22~3|22~4|22] |
//|23 |[1, 2, 3, 4, 5, 6, 7, 8]|[1|23~2|23~3|23~4|23~5|23~6|23~7|23~8|23]|
//|24 |[] |[] |
//+---+------------------------+-----------------------------------------+
}
Spark 2.4+
Pyspark equivalent for the same goes like
df = spark.createDataFrame([(22, [1,2,3,4]),(23, [1,2,3,4,5,6,7,8]),(24, [])],['Id','Data'])
df.show()
+---+--------------------+
| Id| Data|
+---+--------------------+
| 22| [1, 2, 3, 4]|
| 23|[1, 2, 3, 4, 5, 6...|
| 24| []|
+---+--------------------+
df.withColumn('ff', f.when(f.size('Data')==0,'').otherwise(f.expr('''concat_ws('~',transform(Data, x->concat(x,'|',Id)))'''))).show(20,False)
+---+------------------------+---------------------------------------+
|Id |Data |ff |
+---+------------------------+---------------------------------------+
|22 |[1, 2, 3, 4] |1|22~2|22~3|22~4|22 |
|23 |[1, 2, 3, 4, 5, 6, 7, 8]|1|23~2|23~3|23~4|23~5|23~6|23~7|23~8|23|
|24 |[] | |
+---+------------------------+---------------------------------------+
If you want final output as array
df.withColumn('ff',f.array(f.when(f.size('Data')==0,'').otherwise(f.expr('''concat_ws('~',transform(Data, x->concat(x,'|',Id)))''')))).show(20,False)
+---+------------------------+-----------------------------------------+
|Id |Data |ff |
+---+------------------------+-----------------------------------------+
|22 |[1, 2, 3, 4] |[1|22~2|22~3|22~4|22] |
|23 |[1, 2, 3, 4, 5, 6, 7, 8]|[1|23~2|23~3|23~4|23~5|23~6|23~7|23~8|23]|
|24 |[] |[] |
+---+------------------------+-----------------------------------------+
Hope this helps
A udf can help:
def func(array, suffix):
return '~'.join([str(x) + '|' + str(suffix) for x in array])
from pyspark.sql.types import StringType
from pyspark.sql import functions as F
my_udf = F.udf(func, StringType())
df.withColumn("New_Column", my_udf("Data", "ID")).show()
prints
+------------+---+-------------------+
| Data| ID| New_Column |
+------------+---+-------------------+
|[1, 2, 3, 4]| 22|22~1|22~2|22~3|22~4|
+------------+---+-------------------+
I want to be able to aggregate based on percentiles (or more accurate in my case, complement percentiles)
Consider the following code:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[
['a', 1, 'w'],
['a', 1, 'y'],
['a', 11, 'x'],
['a', 111, 'zzz'],
['a', 1111, 'zz'],
['a', 1111, 'zz'],
['b', 2, 'w'],
['b', 2, 'w'],
['b', 2, 'w'],
['b', 22, 'y'],
['b', 2222, 'x'],
['b', 2222, 'z'],
],
['grp', 'val1', 'val2'])
grouped = df.groupby('grp').agg(
F.count('*').alias('count'),
F.expr('percentile(val1, array(0.5, 0.75)) as percentiles'),
# val2 manipulation....
)
grouped.show()
In addition to the grouping and the percentiles calculation, I would like to count the distinct values of val2 in the complement percentiles respectively.
For group b for example, the 50th percentile of val1 is 12 and the complement percentile is the last 3 rows which contain 3 distinct values of val2 (y,x,z).
Similarly, the 75th percentile of is 1672 and the complement percentile is the last 2 rows which contain 2 distinct values (x,z).
So my desired output would be:
+---+-----+--------------+--------------|
|grp|count| percentiles|distinct count|
+---+-----+--------------+--------------|
| a| 6| [61.0, 861.0]|[2, 1] |
| b| 6|[12.0, 1672.0]|[3, 2] |
+---+-----+--------------+--------------|
How can I acheive this?
For spark 2.3.2, you can use Window function to calculate percentiles, find val2s satisfying condition associated with percentiles and then do aggregation:
from pyspark.sql import Window, functions as F
w1 = Window.partitionBy('grp')
df1 = df.withColumn('percentiles', F.expr('percentile(val1, array(0.5, 0.75))').over(w1)) \
.withColumn('c1', F.expr('IF(val1>percentiles[0],val2,NULL)')) \
.withColumn('c2', F.expr('IF(val1>percentiles[1],val2,NULL)'))
grouped = df1.groupby('grp').agg(
F.count('*').alias('count'),
F.first('percentiles').alias('percentiles'),
F.array(F.countDistinct('c1'), F.countDistinct('c2')).alias('distinct_count')
)
grouped.show()
+---+-----+--------------+--------------+
|grp|count| percentiles|distinct_count|
+---+-----+--------------+--------------+
| b| 6|[12.0, 1672.0]| [3, 2]|
| a| 6| [61.0, 861.0]| [2, 1]|
+---+-----+--------------+--------------+
I have a Dataframe like this (in Pyspark 2.3.1):
from pyspark.sql import Row
my_data = spark.createDataFrame([
Row(a=[9, 3, 4], b=['a', 'b', 'c'], mask=[True, False, False]),
Row(a=[7, 2, 6, 4], b=['w', 'x', 'y', 'z'], mask=[True, False, True, False])
])
my_data.show(truncate=False)
#+------------+------------+--------------------------+
#|a |b |mask |
#+------------+------------+--------------------------+
#|[9, 3, 4] |[a, b, c] |[true, false, false] |
#|[7, 2, 6, 4]|[w, x, y, z]|[true, false, true, false]|
#+------------+------------+--------------------------+
Now I'd like to use the mask column in order to subset the a and b columns:
my_desired_output = spark.createDataFrame([
Row(a=[9], b=['a']),
Row(a=[7, 6], b=['w', 'y'])
])
my_desired_output.show(truncate=False)
#+------+------+
#|a |b |
#+------+------+
#|[9] |[a] |
#|[7, 6]|[w, y]|
#+------+------+
What's the "idiomatic" way to achieve this? The current solution I have involves map-ing over the underlying RDD and subsetting with Numpy, which seems inelegant:
import numpy as np
def subset_with_mask(row):
mask = np.asarray(row.mask)
a_masked = np.asarray(row.a)[mask].tolist()
b_masked = np.asarray(row.b)[mask].tolist()
return Row(a=a_masked, b=b_masked)
my_desired_output = spark.createDataFrame(my_data.rdd.map(subset_with_mask))
Is this the best way to go, or is there something better (less verbose and/or more efficient) I can do using Spark SQL tools?
One option is to use a UDF, which you can optionally specialize by the data type in the array:
import numpy as np
import pyspark.sql.functions as F
import pyspark.sql.types as T
def _mask_list(lst, mask):
return np.asarray(lst)[mask].tolist()
mask_array_int = F.udf(_mask_list, T.ArrayType(T.IntegerType()))
mask_array_str = F.udf(_mask_list, T.ArrayType(T.StringType()))
my_desired_output = my_data
my_desired_output = my_desired_output.withColumn(
'a', mask_array_int(F.col('a'), F.col('mask'))
)
my_desired_output = my_desired_output.withColumn(
'b', mask_array_str(F.col('b'), F.col('mask'))
)
UDFs mentioned in the previous answer is probably the way to go prior to the array functions added in Spark 2.4. For the sake of completeness, here is a "pure SQL" implementation before 2.4.
from pyspark.sql.functions import *
df = my_data.withColumn("row", monotonically_increasing_id())
df1 = df.select("row", posexplode("a").alias("pos", "a"))
df2 = df.select("row", posexplode("b").alias("pos", "b"))
df3 = df.select("row", posexplode("mask").alias("pos", "mask"))
df1\
.join(df2, ["row", "pos"])\
.join(df3, ["row", "pos"])\
.filter("mask")\
.groupBy("row")\
.agg(collect_list("a").alias("a"), collect_list("b").alias("b"))\
.select("a", "b")\
.show()
Output:
+------+------+
| a| b|
+------+------+
|[7, 6]|[w, y]|
| [9]| [a]|
+------+------+
A better way to do this is to use pyspark.sql.functions.expr, filter, and transform:
import pandas as pd
from pyspark.sql import (
functions as F,
SparkSession
)
spark = SparkSession.builder.master('local[4]').getOrCreate()
bool_df = pd.DataFrame([
['a', [0, 1, 2, 3, 4], [True]*4 + [False]],
['b', [5, 6, 7, 8, 9], [False, True, False, True, False]]
], columns=['id', 'int_arr', 'bool_arr'])
bool_sdf = spark.createDataFrame(bool_df)
def filter_with_mask(in_col, mask_col, out_name="masked_arr"):
filt_input = f'arrays_zip({in_col}, {mask_col})'
filt_func = f'x -> x.{mask_col}'
trans_func = f'x -> x.{in_col}'
result = F.expr(f'''transform(
filter({filt_input}, {filt_func}), {trans_func}
)''').alias
return result
Using the function:
bool_sdf.select(
'*', filter_with_mask('int_arr', 'bool_arr', bool_sdf)
).toPandas()
Results in:
id int_arr bool_arr masked_arr
a [0, 1, 2, 3, 4] [True, True, True, True, False] [0, 1, 2, 3]
b [5, 6, 7, 8, 9] [False, True, False, True, False] [6, 8]
This should be possible with pyspark >= 2.4.0 and python >= 3.6.
I have two data frames like Below:
data frame1:(df1)
+---+----------+
|id |features |
+---+----------+
|8 |[5, 4, 5] |
|9 |[4, 5, 2] |
+---+----------+
data frame2:(df2)
+---+----------+
|id |features |
+---+----------+
|1 |[1, 2, 3] |
|2 |[4, 5, 6] |
+---+----------+
after that i have converted into Df to Rdd
rdd1=df1.rdd
if I am doing rdd1.collect() result is like below
[Row(id=8, f=[5, 4, 5]), Row(id=9, f=[4, 5, 2])]
rdd2=df2.rdd
broadcastedrddif = sc.broadcast(rdd2.collectAsMap())
now if I am doing broadcastedrddif.value
{1: [1, 2, 3], 2: [4, 5, 6]}
now i want to do sum of multiplication of rdd1 and broadcastedrddif i.e it should return output like below.
((8,[(1,(5*1+2*4+5*3)),(2,(5*4+4*5+5*6))]),(9,[(1,(4*1+5*2+2*3)),(2,(4*4+5*5+2*6)]) ))
so my final output should be
((8,[(1,28),(2,70)]),(9,[(1,20),(2,53)]))
where (1, 28) is a tuple not a float.
Please help me on this.
I did not understand why you used sc.broadcast() but i used it anyway...
Very useful in this case mapValues on the last RDD and I used a list comprehension to execute the operations using the dictionary.
x1=sc.parallelize([[8,5,4,5], [9,4,5,2]]).map(lambda x: (x[0], (x[1],x[2],x[3])))
x1.collect()
x2=sc.parallelize([[1,1,2,3], [2,4,5,6]]).map(lambda x: (x[0], (x[1],x[2],x[3])))
x2.collect()
#I took immediately an RDD because is more simply to test
broadcastedrddif = sc.broadcast(x2.collectAsMap())
d2=broadcastedrddif.value
def sum_prod(x,y):
c=0
for i in range(0,len(x)):
c+=x[i]*y[i]
return c
x1.mapValues(lambda x: [(i, sum_prod(list(x),list(d2[i]))) for i in [k for k in d2.keys()]]).collect()
Out[19]: [(8, [(1, 28), (2, 70)]), (9, [(1, 20), (2, 53)])]