Spark - Wide/sparse dataframe persistence - apache-spark

I want to persist a very wide Spark Dataframe (>100'000 columns) that is sparsely populated (>99% of values are null) while keeping only non-null values (to avoid storage cost):
What is the best format for such use case (HBase, Avro, Parquet, ...) ?
What should be specified Spark side to ignore nulls when writing?
Note that I've tried already Parquet and Avro with a simple df.write statement - for a df of size ca. 100x130k Parquet is performing the worst (ca. 55MB) vs. Avro (ca. 15MB). To me this suggests that ALL null values are stored.
Thanks !

Spark to JSON / SparseVector (from thebluephantom)
In pyspark and using ml. Convert to Scala otherwise.
from pyspark.sql.types import StructType, StructField, DoubleType
from import SparseVector, VectorUDT
temp_rdd = sc.parallelize([
(0.0, SparseVector(4, {1: 1.0, 3: 5.5})),
(1.0, SparseVector(4, {0: -1.0, 2: 0.5}))])
schema = StructType([
StructField("label", DoubleType(), False),
StructField("features", VectorUDT(), False)
df = temp_rdd.toDF(schema)
df2 ="/FileStore/V.json")
returns upon read:
|label| features|
| 1.0|(4,[0,2],[-1.0,0.5])|
| 0.0| (4,[1,3],[1.0,5.5])|
Spark to Avro / Avro2TF (from py-r)
The Avro2TF library presented in this tutorial seems to be an interesting alternative that directly leverages Avro. As a result, a sparse vector would be encoded as follows:
| [2, 4, 1, 8, 11]|[1.0, 1.0, 1.0, 1...|
| [11, 10, 3]| [1.0, 1.0, 1.0]|
| [2, 4, 8]| [1.0, 1.0, 1.0]|
| [11, 10]| [1.0, 1.0]|
| [4, 8]| [1.0, 1.0]|
| [2, 4, 7, 3]|[1.0, 1.0, 1.0, 1.0]|


why VectorAssembler transform is returning a column in pyspark dataframe which contains both sparse and dense vector?

This is the dataset df:
After VectorAssembler transform function as follows
from import VectorAssembler
final_vect= VectorAssembler (inputCols=['sex_indexer','smoker_indexer','day_indexer','time_indexer','size','tip'], outputCol='Independent_feature')
this is vectorized transformed dataframe:
As we can see sparse vectors in few last rows of dataset.
why VectorAssembler is not working properly here. Any specific reason?
is there any other method to get vectorized data ?
The vectorizer actually works as expected. Spark stores vectors as SparseVectors when they contain many zeros and as DenseVectors otherwise.
df = spark.createDataFrame(
[(0.0, 0.0, 0.0, 0.0, 3.0, 3.35), (.1, .2, .3, .4, .5, .5)],
['a', 'b', 'c', 'd', 'e', 'f']
final_vect = VectorAssembler(inputCols=['a', 'b', 'c', 'd', 'e', 'f'], outputCol='X')
>>> final_vect.transform(df).show(truncate=False)
|a |b |c |d |e |f |X |
|0.0|0.0|0.0|0.0|3.0|3.35|(6,[4,5],[3.0,3.35]) |
|0.1|0.2|0.3|0.4|0.5|0.5 |[0.1,0.2,0.3,0.4,0.5,0.5]|
>>> final_vect.transform(df).collect()
[Row(a=0.0, b=0.0, c=0.0, d=0.0, e=3.0, f=3.35, X=SparseVector(6, {4: 3.0, 5: 3.35})), Row(a=0.1, b=0.2, c=0.3, d=0.4, e=0.5, f=0.5, X=DenseVector([0.1, 0.2, 0.3, 0.4, 0.5, 0.5]))]
Spark displays a sparse vector as a 3-tuple (size, indices, values) where size is the size of the vector, indices is the list of indices for the value is not zero and values contains the corresponding values.
The way it is displayed in python when you call collect is a bit clearer. You can see the name of the class and it displays a dictionary of non-zero values for sparse vectors.

PySpark RDD: Manipulating Inner Array

I have a dataset (for example)
sc = SparkContext()
x = [(1, [2, 3, 4, 5]), (2, [2, 7, 8, 10])]
y = sc.parallelize(x)
The print statement returns [(1, [2, 3, 4, 5])]
I now need to multiply everything in the sub-array by 2 across the RDD. Since I have already parallelized, I can't further break down "y.take(1)" to multiply [2, 3, 4, 5] by 2.
How can I essentially isolate the inner array across my worker nodes to then do the multiplication?
I think you can use map with a lambda function:
y = sc.parallelize(x).map(lambda x: (x[0], [2*t for t in x[1]]))
Then y.take(2) returns:
[(1, [4, 6, 8, 10]), (2, [4, 14, 16, 20])]
It will be more efficient if you use DataFrame API instead of RDDs - in this case all your processing will happen without serialization to Python that happens when you use RDD APIs.
For example you can use the transform function to apply transformation to array values:
import pyspark.sql.functions as F
df = spark.createDataFrame([(1, [2, 3, 4, 5]), (2, [2, 7, 8, 10])],
schema="id int, arr array<int>")
df2 ="id", F.transform("arr", lambda x: x*2).alias("arr"))
will give you desired:
| id| arr|
| 1| [4, 6, 8, 10]|
| 2|[4, 14, 16, 20]|

Alternative of groupby in Pyspark to improve performance of Pyspark code

My Pyspark data frame looks like this. I have to remove group by function from pyspark code to increase the performance of the code. I have to perform operations on 100k data.
[Initial Data]
To create Dataframe
df = spark.createDataFrame([
(0, ['-9.53', '-9.35', '0.18']),
(1, ['-7.77', '-7.61', '0.16']),
(2, ['-5.80', '-5.71', '0.10']),
(0, ['1', '2', '3']),
(1, ['4', '5', '6']),
(2, ['8', '98', '32'])
], ["id", "Array"])
And the expected output is produced using this code.
import pyspark.sql.functions as f
I have to achieve the output in this format. The above code is giving me this output. I have to achieve the same by removing the groupby function.
|id |flatten(Array) |
|0 |[-9.53, -9.35, 0.18, 1, 2, 3] |
|1 |[-7.77, -7.61, 0.16, 4, 5, 6] |
|2 |[-5.80, -5.71, 0.10, 8, 98, 32]|
If you don't want to do group by, you can use window functions:
import pyspark.sql.functions as f
from pyspark.sql.window import Window
df2 =
|id |Array |
|0 |[-9.53, -9.35, 0.18, 1, 2, 3] |
|1 |[-7.77, -7.61, 0.16, 4, 5, 6] |
|2 |[-5.80, -5.71, 0.10, 8, 98, 32]|
You can also try
Although I'm not sure if it'll be faster.

Subset one array column with another (boolean) array column

I have a Dataframe like this (in Pyspark 2.3.1):
from pyspark.sql import Row
my_data = spark.createDataFrame([
Row(a=[9, 3, 4], b=['a', 'b', 'c'], mask=[True, False, False]),
Row(a=[7, 2, 6, 4], b=['w', 'x', 'y', 'z'], mask=[True, False, True, False])
#|a |b |mask |
#|[9, 3, 4] |[a, b, c] |[true, false, false] |
#|[7, 2, 6, 4]|[w, x, y, z]|[true, false, true, false]|
Now I'd like to use the mask column in order to subset the a and b columns:
my_desired_output = spark.createDataFrame([
Row(a=[9], b=['a']),
Row(a=[7, 6], b=['w', 'y'])
#|a |b |
#|[9] |[a] |
#|[7, 6]|[w, y]|
What's the "idiomatic" way to achieve this? The current solution I have involves map-ing over the underlying RDD and subsetting with Numpy, which seems inelegant:
import numpy as np
def subset_with_mask(row):
mask = np.asarray(row.mask)
a_masked = np.asarray(row.a)[mask].tolist()
b_masked = np.asarray(row.b)[mask].tolist()
return Row(a=a_masked, b=b_masked)
my_desired_output = spark.createDataFrame(
Is this the best way to go, or is there something better (less verbose and/or more efficient) I can do using Spark SQL tools?
One option is to use a UDF, which you can optionally specialize by the data type in the array:
import numpy as np
import pyspark.sql.functions as F
import pyspark.sql.types as T
def _mask_list(lst, mask):
return np.asarray(lst)[mask].tolist()
mask_array_int = F.udf(_mask_list, T.ArrayType(T.IntegerType()))
mask_array_str = F.udf(_mask_list, T.ArrayType(T.StringType()))
my_desired_output = my_data
my_desired_output = my_desired_output.withColumn(
'a', mask_array_int(F.col('a'), F.col('mask'))
my_desired_output = my_desired_output.withColumn(
'b', mask_array_str(F.col('b'), F.col('mask'))
UDFs mentioned in the previous answer is probably the way to go prior to the array functions added in Spark 2.4. For the sake of completeness, here is a "pure SQL" implementation before 2.4.
from pyspark.sql.functions import *
df = my_data.withColumn("row", monotonically_increasing_id())
df1 ="row", posexplode("a").alias("pos", "a"))
df2 ="row", posexplode("b").alias("pos", "b"))
df3 ="row", posexplode("mask").alias("pos", "mask"))
.join(df2, ["row", "pos"])\
.join(df3, ["row", "pos"])\
.agg(collect_list("a").alias("a"), collect_list("b").alias("b"))\
.select("a", "b")\
| a| b|
|[7, 6]|[w, y]|
| [9]| [a]|
A better way to do this is to use pyspark.sql.functions.expr, filter, and transform:
import pandas as pd
from pyspark.sql import (
functions as F,
spark = SparkSession.builder.master('local[4]').getOrCreate()
bool_df = pd.DataFrame([
['a', [0, 1, 2, 3, 4], [True]*4 + [False]],
['b', [5, 6, 7, 8, 9], [False, True, False, True, False]]
], columns=['id', 'int_arr', 'bool_arr'])
bool_sdf = spark.createDataFrame(bool_df)
def filter_with_mask(in_col, mask_col, out_name="masked_arr"):
filt_input = f'arrays_zip({in_col}, {mask_col})'
filt_func = f'x -> x.{mask_col}'
trans_func = f'x -> x.{in_col}'
result = F.expr(f'''transform(
filter({filt_input}, {filt_func}), {trans_func}
return result
Using the function:
'*', filter_with_mask('int_arr', 'bool_arr', bool_sdf)
Results in:
id int_arr bool_arr masked_arr
a [0, 1, 2, 3, 4] [True, True, True, True, False] [0, 1, 2, 3]
b [5, 6, 7, 8, 9] [False, True, False, True, False] [6, 8]
This should be possible with pyspark >= 2.4.0 and python >= 3.6.

How to do Rdd and broadcasted Rdd multiplication in pyspark?

I have two data frames like Below:
data frame1:(df1)
|id |features |
|8 |[5, 4, 5] |
|9 |[4, 5, 2] |
data frame2:(df2)
|id |features |
|1 |[1, 2, 3] |
|2 |[4, 5, 6] |
after that i have converted into Df to Rdd
if I am doing rdd1.collect() result is like below
[Row(id=8, f=[5, 4, 5]), Row(id=9, f=[4, 5, 2])]
broadcastedrddif = sc.broadcast(rdd2.collectAsMap())
now if I am doing broadcastedrddif.value
{1: [1, 2, 3], 2: [4, 5, 6]}
now i want to do sum of multiplication of rdd1 and broadcastedrddif i.e it should return output like below.
((8,[(1,(5*1+2*4+5*3)),(2,(5*4+4*5+5*6))]),(9,[(1,(4*1+5*2+2*3)),(2,(4*4+5*5+2*6)]) ))
so my final output should be
where (1, 28) is a tuple not a float.
Please help me on this.
I did not understand why you used sc.broadcast() but i used it anyway...
Very useful in this case mapValues on the last RDD and I used a list comprehension to execute the operations using the dictionary.
x1=sc.parallelize([[8,5,4,5], [9,4,5,2]]).map(lambda x: (x[0], (x[1],x[2],x[3])))
x2=sc.parallelize([[1,1,2,3], [2,4,5,6]]).map(lambda x: (x[0], (x[1],x[2],x[3])))
#I took immediately an RDD because is more simply to test
broadcastedrddif = sc.broadcast(x2.collectAsMap())
def sum_prod(x,y):
for i in range(0,len(x)):
return c
x1.mapValues(lambda x: [(i, sum_prod(list(x),list(d2[i]))) for i in [k for k in d2.keys()]]).collect()
Out[19]: [(8, [(1, 28), (2, 70)]), (9, [(1, 20), (2, 53)])]
