Custom aggregation on PySpark dataframes [duplicate] - apache-spark

This question already has answers here:
Applying UDFs on GroupedData in PySpark (with functioning python example)
(4 answers)
Closed 1 year ago.
I have a PySpark DataFrame with one column as one hot encoded vectors. I want to aggregate the different one hot encoded vectors by vector addition after groupby
e.g. df[userid,action] Row1: ["1234","[1,0,0]] Row2: ["1234", [0 1 0]]
I want the output as row: ["1234", [ 1 1 0]] so the vector is a sum of all vectors grouped by userid.
How can I achieve this? PySpark sum aggregate operation does not support the vector addition.

You have several options:
Create a user defined aggregate function. The problem is that you will need to write the user defined aggregate function in scala and wrap it to use in python.
You can use the collect_list function to collect all values to a list and then write a UDF to combine them.
You can move to RDD and use aggregate or aggregate by key.
Both options 2 & 3 would be relatively inefficient (costing both cpu and memory).

Related

How to efficiently perform a lookup from a column in a large spark dataframe into a small (broadcastable) array

I have a smallish (a couple of thousand) list/array of pairs of doubles and a very large (> 100 million rows) spark dataframe. In the large dataframe I have a column containing an integer which i want to use to index into the smaller list. I want to return a dataframe with all the original columns and the related two values from the list.
I could obviously create a dataframe from the list and do an inner join but that seems inefficient as the optimiser doesn't know it only needs to get the single pair from the small list and that it can index directly into the list using the integer column from the large dataframe.
What's the most efficient way of doing this? Happy for answers using any api - scala, pyspark, sql, dataframe or rdd.

Combine ‘n’ data files to make a single Spark Dataframe [duplicate]

This question already has answers here:
How to perform union on two DataFrames with different amounts of columns in Spark?
(22 answers)
Closed 4 years ago.
I have ‘n’ number of delimited data sets, CSVs may be. But one of them might have a few extra columns. I am trying to read all of them as dataframes and put them in one. How can I merge them as an unionAll and make them a single dataframe ?
P.S: I can do this when I know what is ‘n’. And, it’s a simple unionAll when the column counts are equal.
There is another approach other than the solutions mentioned in first two comments.
Read all CSV files to a single RDD producing RDD[String].
Map to create Rdd[Row] with appropriate length while filling missing values with null or any suitable values.
Create dataFrame schema.
Create DataFrame from RDD[Row] using created Schema.
This may not be a good approach if the CSVs has large number of columns.
Hope this helps

Spark/Scala any working difference between groupBy function of Rdd and DataFrame [duplicate]

This question already has an answer here:
DataFrame / Dataset groupBy behaviour/optimization
(1 answer)
Closed 4 years ago.
I have checked and a bit curious to know the groupBy function of RDD and DataFrame. Is there is any performance difference or something else?
Please suggest.
Come to think of a difference between a DataFrame.groupBy and an RDD.groupBy, RDD's groupBy variant doesn't preserve the order unlike the DataFrame's groupBy variant.
df.orderBy($"date").groupBy($"id").agg(first($"date") as "start_date")
The above works as expected i.e. the aggregated results will be ordered by date. Since the name sounds the same for both RDD and DataFrame, one might think it will work as expected in RDD as well but nope, it's not the case. The reason is the implementation of RDD's groupBy and DataFrame's groupBy is very different. RDD's groupBy may shuffle data according to the keys.

How to use Spark dataset GroupBy() [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 4 years ago.
I have a Hive table with the schema:
id bigint
name string
updated_dt bigint
There are many records having same id, but different name and updated_dt. For each id, I want to return the record (whole row) with the largest updated_dt.
My current approach is:
After reading data from Hive, I can use case class to convert data to RDD, and then use groupBy() to group by all the records with the same id together, and later picks the one with the largest updated_dt. Something like:
dataRdd.groupBy(_.id).map(x => x._2.toSeq.maxBy(_.updated_dt))
However, since I use Spark 2.1, it first convert data to dataset using case class, and then the above approach coverts data to RDD in order to use groupBy(). There may be some overhead converting dataset to RDD. So I was wondering if I can achieve this at the dataset level without converting to RDD?
Thanks a lot
Here is how you can do it using Dataset:
data.groupBy($"id").agg(max($"updated_dt") as "Max")
There is not much overhead if you convert it to RDD. If you choose to do using RDD, It can be more optimized by using .reduceByKey() instead of using .groupBy():
dataRdd.keyBy(_.id).reduceByKey((a,b) => if(a.updated_dt > b.updated_dt) a else b).values

More efficient way to Iterate & compute over columns [duplicate]

This question already has answers here:
Spark columnar performance
(2 answers)
Closed 5 years ago.
I have a very wide dataframe > 10,000 columns and I need to compute the percent of nulls in each. Right now I am doing:
threshold=0.9
for c in df_a.columns[:]:
if df_a[df_a[c].isNull()].count() >= (df_a.count()*threshold):
# print(c)
df_a=df_a.drop(c)
Of course this is a slow process and crashes on occasion. Is there a more efficient method I am missing?
Thanks!
There are few strategies you can take depending upon the size of the dataframe. The code looks good to me. You need to go through each column and count the number of null values.
One strategy is to cache the input dataframe. That will enable faster filtering. This however works if the dataframe is not huge
Also
df_a=df_a.drop(c)
I am little skeptical with this as this is changing the dataframe in the loop. Better to keep the null column names and drop from the dataframe later in a separate loop.
If the dataframe is huge and you can't cache it completely you can partition the dataframe in to some finite manageable columns. Like take 100 column each and cache that smaller dataframe and run the analysis 100 times in a loop.
Now you might want to keep track of the analyzed column list separate from the yet to be analyzed columns in this case. That way even if the job fails you can start the analysis from the rest of the columns.
You should avoid iterating when using pyspark, since it does not distribute the computations anymore.
Using count on a column will compute the count of non-null elements.
threshold = 0.9
import pyspark.sql.functions as psf
count_df = df_a\
.agg(*[psf.count("*").alias("count")]+ [psf.count(c).alias(c) for c in df_a.columns])\
.toPandas().transpose()
The first element is the number of lines in the dataframe:
total_count = count_df.iloc[0, 0]
kept_cols = count_df[count_df[0] > (1 - threshold)*total_count].iloc[1:,:]
df_a.select(list(kept_cols.index))

Resources