Pyspark UDF experience - apache-spark

there.
I am very new to Pyspark and I am learning the UDF myself. I realize UDF sometimes will slow down your code. I want to know about your experience. What UDF function did you apply(cannot be achieved with Pyspark code only). Is there any useful UDF function that helps you clean the data? Except for the Pyspark document, is there any source that can help me learn the UDF function?

You can find most of your needed functionality within the standard library functions spark has.
import pyspark.sql.functions - Check the docs here https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions
Now, sometimes you do have to create custom UDF's but be aware that it does slow down since spark has to evaluate it for every dataframe row.
try to avoid this as much as you can.
When you don't have any other option, use it, but try to minimize the complexity and the external libraries you use.
Another approach is to use an RDD, which means you convert your dataframe to an rdd (MYDF.rdd)
And right after you call mapPartitions or map which accept a function that manipulate your data.
It basically sends chunks each time as a list of spark Row entity.
Read more about mapPartitions or map here: https://sparkbyexamples.com/spark/spark-map-vs-mappartitions-transformation/

Related

Is it possible to use a `dask` array as input for `pyspark`?

Is it possible to use a dask array as input for pyspark?
I have a dask array that I like to feed to pyspark.mllib.clustering.StreamingKMeans.
There was once a proof-of-concept for using Dask as a preprocessing layer for handing off work to Spark, where the dask and spark workers were co-located. I don't believe the effort was ever pushed far or used in any kind of production, so the short answer is "no", there's no way to directly pass a dask array to spark. As things stand, you would need to compute the whole thing the client, or write to a storage system that both frameworks can see

Disadvantages of Spark Dataset over DataFrame

I know the advantages of Dataset (type safety etc), but i can't find any documentation related Spark Datasets Limitations.
Are there any specific scenarios where Spark Dataset is not recommended and better to use DataFrame.
Currently all our data engineering flows are using Spark (Scala)DataFrame.
We would like to make use of Dataset, for all our new flows. So knowing all the limitations/disadvantages of Dataset would help us.
EDIT: This is not similar to Spark 2.0 Dataset vs DataFrame, which explains some operations on Dataframe/Dataset. or other questions, which most of them explains the differences between rdd, dataframe and dataset and how they evolved. This is targeted to know, when NOT to use Datasets
There are a few scenarios where I find that a Dataframe (or Dataset[Row]) is more useful than a typed dataset.
For example, when I'm consuming data without a fixed schema, like JSON files containing records of different types with different fields. Using a Dataframe I can easily "select" out the fields I need without needing to know the whole schema, or even use a runtime configuration to specify the fields I'll access.
Another consideration is that Spark can better optimize the built-in Spark SQL operations and aggregations than UDAFs and custom lambdas. So if you want to get the square root of a value in a column, that's a built-in function (df.withColumn("rootX", sqrt("X"))) in Spark SQL but doing it in a lambda (ds.map(X => Math.sqrt(X))) would be less efficient since Spark can't optimize your lambda function as effectively.
There are also many untyped Dataframe functions (like statistical functions) that are implemented for Dataframes but not typed Datasets, and you'll often find that even if you start out with a Dataset, by the time you've finished your aggregations you're left with a Dataframe because the functions work by creating new columns, modifying the schema of your dataset.
In general I don't think you should migrate from working Dataframe code to typed Datasets unless you have a good reason to. Many of the Dataset features are still flagged as "experimental" as of Spark 2.4.0, and as mentioned above not all Dataframe features have Dataset equivalents.
Limitations of Spark Datasets:
Datasets used to be less performant (not sure if that's been fixed yet)
You need to define a new case class whenever you change the Dataset schema, which is cumbersome
Datasets don't offer as much type safety as you might expect. We can pass the reverse function a date object and it'll return a garbage response rather than erroring out.
import java.sql.Date
case class Birth(hospitalName: String, birthDate: Date)
val birthsDS = Seq(
Birth("westchester", Date.valueOf("2014-01-15"))
).toDS()
birthsDS.withColumn("meaningless", reverse($"birthDate")).show()
+------------+----------+-----------+
|hospitalName| birthDate|meaningless|
+------------+----------+-----------+
| westchester|2014-01-15| 51-10-4102|
+------------+----------+-----------+

how to make different sum over the same line in Spark

I have a spark dataframe with, some numeric columns.
I would like to make several aggregationg operations on these columns creating a new column for each function, some of which may also be user defined.
The easy solution would be using dataframe and withColumn. For istance, if I wanted to calculate the mean (by hand) and the function my_function on fields field_1 and field_2 I would do:
df=df.withColumn("mean",(df["field_1"]+df["field_2])/2)
df=df.withColumn("foo", my_function(df["field_1"],df["field_2]))
My doubt is about efficiency. Each of the 2 above functions scans the whole database while a smarter approach would calculate both results using one single scan.
Any hint on how to do that?
Thanks
Mauro
TL;DR You're trying to solve problem which doesn't exist
SQL transformations are lazy and declarative. Series of operations is converted into logical execution plan, and then into physical execution plan. At the first stage Spark optimizer has freedom to reorder, combine or even remove any part of the plan. You have to however, distinguish between two cases:
Python udf.
SQL expression.
The first requires separate conversion to Python RDD. It cannot be combined with native processing. The second one is processed natively using generated code.
Once you request the results physical plan is converted into stages and executed.

Apply a custom function to a spark dataframe group

I have a very big table of time series data that have these columns:
Timestamp
LicensePlate
UberRide#
Speed
Each collection of LicensePlate/UberRide data should be processed considering the whole set of data. In others words, I do not need to proccess the data row by row, but all rows grouped by (LicensePlate/UberRide) together.
I am planning to use spark with dataframe api, but I am confused on how can I perform a custom calculation over spark grouped dataframe.
What I need to do is:
Get all data
Group by some columns
Foreach spark dataframe group apply a f(x). Return a custom object foreach group
Get the results by applying g(x) and returning a single custom object
How can I do steps 3 and 4? Any hints over which spark API (dataframe, dataset, rdd, maybe pandas...) should I use?
The whole workflow can be seen below:
What you are looking for exists since Spark 2.3: Pandas vectorized UDFs. It allows to group a DataFrame and apply custom transformations with pandas, distributed on each group:
df.groupBy("groupColumn").apply(myCustomPandasTransformation)
It is very easy to use so I will just put a link to Databricks' presentation of pandas UDF.
However, I don't know such a practical way to make grouped transformations in Scala yet, so any additional advice is welcome.
EDIT: in Scala, you can achieve the same thing since earlier versions of Spark, using Dataset's groupByKey + mapGroups/flatMapGroups.
While Spark provides some ways to integrate with Pandas it doesn't make Pandas distributed. So whatever you do with Pandas in Spark is simply local (either to driver or executor when used inside transformations) operation.
If you're looking for a distributed system with Pandas-like API you should take a look at dask.
You can define User Defined Aggregate functions or Aggregators to process grouped Datasets but this part of the API is directly accessible only in Scala. It is not that hard to write a Python wrapper when you create one.
RDD API provides a number of functions which can be used to perform operations in groups starting with low level repartition / repartitionAndSortWithinPartitions and ending with a number of *byKey methods (combineByKey, groupByKey, reduceByKey, etc.).
Which one is applicable in your case depends on the properties of the function you want to apply (is it associative and commutative, can it work on streams, does it expect specific order).
The most general but inefficient approach can be summarized as follows:
h(rdd.keyBy(f).groupByKey().mapValues(g).collect())
where f maps from value to key, g corresponds to per-group aggregation and h is a final merge. Most of the time you can do much better than that so it should be used only as the last resort.
Relatively complex logic can be expressed using DataFrames / Spark SQL and window functions.
See also Applying UDFs on GroupedData in PySpark (with functioning python example)

Avoid the use of Java data structures in Apache Spark to avoid copying the data

I have a MySQL database with a single table containing about 100 million records (~25GB, ~5 columns). Using Apache Spark, I extract this data via a JDBC connector and store it in a DataFrame.
From here, I do some pre-processing of the data (e.g. replacing the NULL values), so I absolutely need to go through each record.
Then I would like to perform dimensionality reduction and feature selection (e.g. using PCA), perform clustering (e.g. K-Means) and later on do the testing of the model on new data.
I have implemented this in Spark's Java API, but it is too slow (for my purposes) since I do a lot of copying of the data from a DataFrame to a java.util.Vector and java.util.List (to be able to iterate over all records and do the pre-processing), and later back to a DataFrame (since PCA in Spark expects a DataFrame as input).
I have tried extracting information from the database into a org.apache.spark.sql.Column but cannot find a way to iterate over it.
I also tried avoiding the use of Java data structures (such as List and Vector) by using the org.apache.spark.mllib.linalg.{DenseVector, SparseVector}, but cannot get that to work either.
Finally, I also considered using JavaRDD (by creating it from a DataFrame and a custom schema), but couldn't work it out entirely.
After a lengthy description, my question is: is there a way to do all steps mentioned in the first paragraph, without copying all the data into a Java data structure?
Maybe one of the options I tried could actually work, but I just can't seem to find out how, as the docs and literature on Spark are a bit scarce.
From the wording of your question, it seems there is some confusion about the stages of Spark processing.
First, we tell Spark what to do by specifying inputs and transformations. At this point, the only things that are known are (a) the number of partitions at various stages of processing and (b) the schema of the data. org.apache.spark.sql.Column is used at this stage to identify the metadata associated with a column. However, it doesn't contain any of the data. In fact, there is no data at all at this stage.
Second, we tell Spark to execute an action on a dataframe/dataset. This is what kicks off processing. The input is read and flows through the various transformations and into the final action operation, be it collect or save or something else.
So, that explains why you cannot "extract information from the database into" a Column.
As for the core of your question, it's hard to comment without seeing your code and knowing exactly what it is you are trying to accomplish but it is safe to say that much migrating between types is a bad idea.
Here are a couple of questions that might help guide you to a better outcome:
Why can't you perform the data transformations you need by operating directly on the Row instances?
Would it be convenient to wrap some of your transformation code into a UDF or UDAF?
Hope this helps.

Resources