Apply a custom function to a spark dataframe group - apache-spark

I have a very big table of time series data that have these columns:
Timestamp
LicensePlate
UberRide#
Speed
Each collection of LicensePlate/UberRide data should be processed considering the whole set of data. In others words, I do not need to proccess the data row by row, but all rows grouped by (LicensePlate/UberRide) together.
I am planning to use spark with dataframe api, but I am confused on how can I perform a custom calculation over spark grouped dataframe.
What I need to do is:
Get all data
Group by some columns
Foreach spark dataframe group apply a f(x). Return a custom object foreach group
Get the results by applying g(x) and returning a single custom object
How can I do steps 3 and 4? Any hints over which spark API (dataframe, dataset, rdd, maybe pandas...) should I use?
The whole workflow can be seen below:

What you are looking for exists since Spark 2.3: Pandas vectorized UDFs. It allows to group a DataFrame and apply custom transformations with pandas, distributed on each group:
df.groupBy("groupColumn").apply(myCustomPandasTransformation)
It is very easy to use so I will just put a link to Databricks' presentation of pandas UDF.
However, I don't know such a practical way to make grouped transformations in Scala yet, so any additional advice is welcome.
EDIT: in Scala, you can achieve the same thing since earlier versions of Spark, using Dataset's groupByKey + mapGroups/flatMapGroups.

While Spark provides some ways to integrate with Pandas it doesn't make Pandas distributed. So whatever you do with Pandas in Spark is simply local (either to driver or executor when used inside transformations) operation.
If you're looking for a distributed system with Pandas-like API you should take a look at dask.
You can define User Defined Aggregate functions or Aggregators to process grouped Datasets but this part of the API is directly accessible only in Scala. It is not that hard to write a Python wrapper when you create one.
RDD API provides a number of functions which can be used to perform operations in groups starting with low level repartition / repartitionAndSortWithinPartitions and ending with a number of *byKey methods (combineByKey, groupByKey, reduceByKey, etc.).
Which one is applicable in your case depends on the properties of the function you want to apply (is it associative and commutative, can it work on streams, does it expect specific order).
The most general but inefficient approach can be summarized as follows:
h(rdd.keyBy(f).groupByKey().mapValues(g).collect())
where f maps from value to key, g corresponds to per-group aggregation and h is a final merge. Most of the time you can do much better than that so it should be used only as the last resort.
Relatively complex logic can be expressed using DataFrames / Spark SQL and window functions.
See also Applying UDFs on GroupedData in PySpark (with functioning python example)

Related

Why RDD, Dataframe and Dataset in Spark are being calling as Api?

I started reading the book called "Spark definitive guide-big data processing made simple" to learn Spark. While I was reading I saw a line saying "A DataFrame is the most common Structured API and simply represents a table of data with rows and columns." I am not able to understand why are RDDs and DataFrames being called APIs?
They're called APIs because they're essentially just different interfaces to exactly the same data. DataFrame can be built on top of RDD and RDD can be extracted from DataFrame. They just have different sets of functions defined on that data, main differences are semantics and the way you work with data, RDD being lower level API and DataFrame being higher level API. For example you can use Spark SQL interface with DataFrame which provides all common SQL functions, but if you decide to use RDDs, you would need to write SQL functions yourself using RDD transformations.
And of course, they both exist because it really comes down to your use case.

Disadvantages of Spark Dataset over DataFrame

I know the advantages of Dataset (type safety etc), but i can't find any documentation related Spark Datasets Limitations.
Are there any specific scenarios where Spark Dataset is not recommended and better to use DataFrame.
Currently all our data engineering flows are using Spark (Scala)DataFrame.
We would like to make use of Dataset, for all our new flows. So knowing all the limitations/disadvantages of Dataset would help us.
EDIT: This is not similar to Spark 2.0 Dataset vs DataFrame, which explains some operations on Dataframe/Dataset. or other questions, which most of them explains the differences between rdd, dataframe and dataset and how they evolved. This is targeted to know, when NOT to use Datasets
There are a few scenarios where I find that a Dataframe (or Dataset[Row]) is more useful than a typed dataset.
For example, when I'm consuming data without a fixed schema, like JSON files containing records of different types with different fields. Using a Dataframe I can easily "select" out the fields I need without needing to know the whole schema, or even use a runtime configuration to specify the fields I'll access.
Another consideration is that Spark can better optimize the built-in Spark SQL operations and aggregations than UDAFs and custom lambdas. So if you want to get the square root of a value in a column, that's a built-in function (df.withColumn("rootX", sqrt("X"))) in Spark SQL but doing it in a lambda (ds.map(X => Math.sqrt(X))) would be less efficient since Spark can't optimize your lambda function as effectively.
There are also many untyped Dataframe functions (like statistical functions) that are implemented for Dataframes but not typed Datasets, and you'll often find that even if you start out with a Dataset, by the time you've finished your aggregations you're left with a Dataframe because the functions work by creating new columns, modifying the schema of your dataset.
In general I don't think you should migrate from working Dataframe code to typed Datasets unless you have a good reason to. Many of the Dataset features are still flagged as "experimental" as of Spark 2.4.0, and as mentioned above not all Dataframe features have Dataset equivalents.
Limitations of Spark Datasets:
Datasets used to be less performant (not sure if that's been fixed yet)
You need to define a new case class whenever you change the Dataset schema, which is cumbersome
Datasets don't offer as much type safety as you might expect. We can pass the reverse function a date object and it'll return a garbage response rather than erroring out.
import java.sql.Date
case class Birth(hospitalName: String, birthDate: Date)
val birthsDS = Seq(
Birth("westchester", Date.valueOf("2014-01-15"))
).toDS()
birthsDS.withColumn("meaningless", reverse($"birthDate")).show()
+------------+----------+-----------+
|hospitalName| birthDate|meaningless|
+------------+----------+-----------+
| westchester|2014-01-15| 51-10-4102|
+------------+----------+-----------+

how to make different sum over the same line in Spark

I have a spark dataframe with, some numeric columns.
I would like to make several aggregationg operations on these columns creating a new column for each function, some of which may also be user defined.
The easy solution would be using dataframe and withColumn. For istance, if I wanted to calculate the mean (by hand) and the function my_function on fields field_1 and field_2 I would do:
df=df.withColumn("mean",(df["field_1"]+df["field_2])/2)
df=df.withColumn("foo", my_function(df["field_1"],df["field_2]))
My doubt is about efficiency. Each of the 2 above functions scans the whole database while a smarter approach would calculate both results using one single scan.
Any hint on how to do that?
Thanks
Mauro
TL;DR You're trying to solve problem which doesn't exist
SQL transformations are lazy and declarative. Series of operations is converted into logical execution plan, and then into physical execution plan. At the first stage Spark optimizer has freedom to reorder, combine or even remove any part of the plan. You have to however, distinguish between two cases:
Python udf.
SQL expression.
The first requires separate conversion to Python RDD. It cannot be combined with native processing. The second one is processed natively using generated code.
Once you request the results physical plan is converted into stages and executed.

Implementing a Spark SQL UserDefinedAggregateFunction that performs multiple passes over a column

I've been experimenting with the UserDefinedAggregateFunction class to write aggregate functions for use in Spark SQL.
It works well for implementing single pass operations like sum(), avg() etc., but is there a trick you can use to perform multiple passes over a column?
For example, Calculating variance using the naive approach. i.e. With a first pass calculating the column mean and then a second pass that uses this value to calculate the variance. I know that there are single pass algorithms for doing this that give good approximations (as in fact implemented by Spark). I was just using this as an example of a two-pass operation.
It would be nice to be able to do the following,
spark.sql("SELECT product, MultiPassAgg(price) FROM products GROUP BY product")
I appreciate that I can do this kind of thing using Dataset / DataFrame operations in stages etc., but I was just looking clean approach as illustrated in the SQL above.
Any ideas or suggestions?
This should be possible, though the following suggestion could potentially use a large amount of memory if a large number of rows are involved in any given partition.
In the implementation of your UserDefinedAggregateFunction, set up the bufferSchema having a StructField that includes a DataType that is a collection (such as ArrayType) to act as an internal collection of inputs provided via update.
Then, in update you append each input to your collection, and in merge you combine all of the collections into a single collection. This allows you to have the full partition available for use in evaluate.
Finally, during evaluate you can operate across the entire collection of rows in any way you see fit.

Avoid the use of Java data structures in Apache Spark to avoid copying the data

I have a MySQL database with a single table containing about 100 million records (~25GB, ~5 columns). Using Apache Spark, I extract this data via a JDBC connector and store it in a DataFrame.
From here, I do some pre-processing of the data (e.g. replacing the NULL values), so I absolutely need to go through each record.
Then I would like to perform dimensionality reduction and feature selection (e.g. using PCA), perform clustering (e.g. K-Means) and later on do the testing of the model on new data.
I have implemented this in Spark's Java API, but it is too slow (for my purposes) since I do a lot of copying of the data from a DataFrame to a java.util.Vector and java.util.List (to be able to iterate over all records and do the pre-processing), and later back to a DataFrame (since PCA in Spark expects a DataFrame as input).
I have tried extracting information from the database into a org.apache.spark.sql.Column but cannot find a way to iterate over it.
I also tried avoiding the use of Java data structures (such as List and Vector) by using the org.apache.spark.mllib.linalg.{DenseVector, SparseVector}, but cannot get that to work either.
Finally, I also considered using JavaRDD (by creating it from a DataFrame and a custom schema), but couldn't work it out entirely.
After a lengthy description, my question is: is there a way to do all steps mentioned in the first paragraph, without copying all the data into a Java data structure?
Maybe one of the options I tried could actually work, but I just can't seem to find out how, as the docs and literature on Spark are a bit scarce.
From the wording of your question, it seems there is some confusion about the stages of Spark processing.
First, we tell Spark what to do by specifying inputs and transformations. At this point, the only things that are known are (a) the number of partitions at various stages of processing and (b) the schema of the data. org.apache.spark.sql.Column is used at this stage to identify the metadata associated with a column. However, it doesn't contain any of the data. In fact, there is no data at all at this stage.
Second, we tell Spark to execute an action on a dataframe/dataset. This is what kicks off processing. The input is read and flows through the various transformations and into the final action operation, be it collect or save or something else.
So, that explains why you cannot "extract information from the database into" a Column.
As for the core of your question, it's hard to comment without seeing your code and knowing exactly what it is you are trying to accomplish but it is safe to say that much migrating between types is a bad idea.
Here are a couple of questions that might help guide you to a better outcome:
Why can't you perform the data transformations you need by operating directly on the Row instances?
Would it be convenient to wrap some of your transformation code into a UDF or UDAF?
Hope this helps.

Resources