How to use Spark dataset GroupBy() [duplicate] - apache-spark

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 4 years ago.
I have a Hive table with the schema:
id bigint
name string
updated_dt bigint
There are many records having same id, but different name and updated_dt. For each id, I want to return the record (whole row) with the largest updated_dt.
My current approach is:
After reading data from Hive, I can use case class to convert data to RDD, and then use groupBy() to group by all the records with the same id together, and later picks the one with the largest updated_dt. Something like:
dataRdd.groupBy(_.id).map(x => x._2.toSeq.maxBy(_.updated_dt))
However, since I use Spark 2.1, it first convert data to dataset using case class, and then the above approach coverts data to RDD in order to use groupBy(). There may be some overhead converting dataset to RDD. So I was wondering if I can achieve this at the dataset level without converting to RDD?
Thanks a lot

Here is how you can do it using Dataset:
data.groupBy($"id").agg(max($"updated_dt") as "Max")
There is not much overhead if you convert it to RDD. If you choose to do using RDD, It can be more optimized by using .reduceByKey() instead of using .groupBy():
dataRdd.keyBy(_.id).reduceByKey((a,b) => if(a.updated_dt > b.updated_dt) a else b).values

Related

What's the difference between RDD and Dataframe in Spark? [duplicate]

This question already has answers here:
Difference between DataFrame, Dataset, and RDD in Spark
(14 answers)
Closed 3 years ago.
Hi I am relatively new to apache spark. I wanted to understand the difference between RDD,dataframe and datasets.
For example, I am pulling data from s3 bucket.
df=spark.read.parquet("s3://output/unattributedunattributed*")
In this case when I am loading data from s3, what would be RDD? Also since RDD is immutable , I can change value for df so df couldn't be rdd.
Appreciate if someone can explain the difference between RDD,dataframe and datasets.
df=spark.read.parquet("s3://output/unattributedunattributed*")
With this statement, you are creating a data frame.
To create RDD use
df=spark.textFile("s3://output/unattributedunattributed*")
RDD stands for Resilient Distributed Datasets. It is Read-only partition collection of records. RDD is the fundamental data structure of Spark. It allows a programmer to perform in-memory computations
In Dataframe, data organized into named columns. For example a table in a relational database. It is an immutable distributed collection of data. DataFrame in Spark allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction.
If you want to apply a map or filter to the whole dataset, use RDD
If you want to work on an individual column or want to perform operations/calculations on a column then use Dataframe.
for example, if you want to replace 'A' in whole data with 'B'
then RDD is useful.
rdd = rdd.map(lambda x: x.replace('A','B')
if you want to update the data type of the column, then use Dataframe.
dff = dff.withColumn("LastmodifiedTime_timestamp", col('LastmodifiedTime_time').cast('timestamp')
RDD can be converted into Dataframe and vice versa.

Combine ‘n’ data files to make a single Spark Dataframe [duplicate]

This question already has answers here:
How to perform union on two DataFrames with different amounts of columns in Spark?
(22 answers)
Closed 4 years ago.
I have ‘n’ number of delimited data sets, CSVs may be. But one of them might have a few extra columns. I am trying to read all of them as dataframes and put them in one. How can I merge them as an unionAll and make them a single dataframe ?
P.S: I can do this when I know what is ‘n’. And, it’s a simple unionAll when the column counts are equal.
There is another approach other than the solutions mentioned in first two comments.
Read all CSV files to a single RDD producing RDD[String].
Map to create Rdd[Row] with appropriate length while filling missing values with null or any suitable values.
Create dataFrame schema.
Create DataFrame from RDD[Row] using created Schema.
This may not be a good approach if the CSVs has large number of columns.
Hope this helps

Spark/Scala any working difference between groupBy function of Rdd and DataFrame [duplicate]

This question already has an answer here:
DataFrame / Dataset groupBy behaviour/optimization
(1 answer)
Closed 4 years ago.
I have checked and a bit curious to know the groupBy function of RDD and DataFrame. Is there is any performance difference or something else?
Please suggest.
Come to think of a difference between a DataFrame.groupBy and an RDD.groupBy, RDD's groupBy variant doesn't preserve the order unlike the DataFrame's groupBy variant.
df.orderBy($"date").groupBy($"id").agg(first($"date") as "start_date")
The above works as expected i.e. the aggregated results will be ordered by date. Since the name sounds the same for both RDD and DataFrame, one might think it will work as expected in RDD as well but nope, it's not the case. The reason is the implementation of RDD's groupBy and DataFrame's groupBy is very different. RDD's groupBy may shuffle data according to the keys.

Custom aggregation on PySpark dataframes [duplicate]

This question already has answers here:
Applying UDFs on GroupedData in PySpark (with functioning python example)
(4 answers)
Closed 1 year ago.
I have a PySpark DataFrame with one column as one hot encoded vectors. I want to aggregate the different one hot encoded vectors by vector addition after groupby
e.g. df[userid,action] Row1: ["1234","[1,0,0]] Row2: ["1234", [0 1 0]]
I want the output as row: ["1234", [ 1 1 0]] so the vector is a sum of all vectors grouped by userid.
How can I achieve this? PySpark sum aggregate operation does not support the vector addition.
You have several options:
Create a user defined aggregate function. The problem is that you will need to write the user defined aggregate function in scala and wrap it to use in python.
You can use the collect_list function to collect all values to a list and then write a UDF to combine them.
You can move to RDD and use aggregate or aggregate by key.
Both options 2 & 3 would be relatively inefficient (costing both cpu and memory).

Join Spark dataframe with Cassandra table [duplicate]

Dataframe A (millions of records) one of the column is create_date,modified_date
Dataframe B 500 records has start_date and end_date
Current approach:
Select a.*,b.* from a join b on a.create_date between start_date and end_date
The above job takes half hour or more to run.
how can I improve the performance
DataFrames currently doesn't have an approach for direct joins like that. It will fully read both tables before performing a join.
https://issues.apache.org/jira/browse/SPARK-16614
You can use the RDD API to take advantage of the joinWithCassandraTable function
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#using-joinwithcassandratable
As others suggested, one of the approach is to broadcast the smaller dataframe. This can be done automatically also by configuring the below parameter.
spark.sql.autoBroadcastJoinThreshold
If the dataframe size is smaller than the value specified here, Spark automatically broadcasts the smaller dataframe instead of performing a join. You can read more about this here.

Resources