How to post-process Spark SQL results w/o using UDF - apache-spark

I read
https://medium.com/teads-engineering/spark-performance-tuning-from-the-trenches-7cbde521cf60
It suggests not to use UDF to save deserialization/serialization cost.
In my case, I did a query like this
select MYFUN(f1, f2, ...)
from A ...
I use MYFUN to post-process the query results row by row, for example, sending them to another service.
def my_fun(f1, f2, ...):
service.send(f1, f2, ...)
session.udf.register('MYFUN', my_fun)
W/o using UDF, I may want to save the query results to a Python data frame, or a Parque table on hdfs then reading by a dataframe, and process the dataframe one by one.
The problem is the result table size is large, may be 1M rows.
In such a case, does it still make sense to remove the UDF?
What is the best practice to populate a Spark SQL result to another service?

Python UDFs are not recommended from a performance point of view, but there is nothing wrong in using them when needed, as in this case: the serialization/deserialization cost is probably ridiculous compared to the I/O waits introduced by your send. So it probably doesn't make sense to remove the UDF.
In a more general case, there are two ways with which you can reduce the memory footprint of processing a dataframe. One you already mentioned, is save to file and process the file.
Another way is using toLocalIterator on your dataframe. This way you will iterate on each of the dataframe's partitions: you can repartition the dataframe to make partitions of an arbitrary size:
df =df.repartition(100)
for partition in df.toLocalIterator():
for row in partition:
send(row)
This way your local memory requirements are reduced to the biggest partition of your repartitioned dataframe.

Related

Spark SQL Update/Delete

Currently, I am working on a project using pySpark that reads in a few Hive tables, stores them as dataframes, and I have to perform a few updates/filters on them. I am avoiding using Spark syntax at all costs to create a framework that will only take SQL in a parameter file that will be run using my pySpark framework.
Now the problem is that I have to perform UPDATE/DELETE queries on my final dataframe, are there any possible work arounds to performing these operations on my dataframe?
Thank you so much!
A DataFrame is immutable , you can not change it, so you are not able to update/delete.
If you want to "delete" there is a .filter option (it will create a new DF excluding records based on the validation that you applied on filter).
If you want to "update", the closer equivalent is .map, where you can "modify" your record and that value will be on a new DF, the thing is that function will iterate all the records on the .df.
Another thing that you need to keep in mind is: if you load data into a df from some source (ie. Hive table) and perform some operations. That updated data wont be reflected on your source data. DF's live on memory, until you persist that data.
So, you can not work with DF like a sql-table for those operations. Depending on your requirements you need to analyze if Spark is a solution for your specific problem.

Spark DataFrame / Dataset groupBy optimization via bucketBy

I'm researching options for a use-case where we store the dataset as parquet files and want to run efficient groupBy queries for a specific key later on when we read the data.
I've read a bit about the optimizations for groupBy, however couldn't really find much about it (other than RDD level reduceByKey).
What I have in mind is, if the dataset is written bucketed by the key that will also be used in the groupBy. Theoretically the groupBy could be optimized, since all the rows containing the key will be co-located (and even consecutive if it's also stored sorted on the same key).
One idea i have in mind is to apply the transformation via mapPartitions then groupBy, however, this will require breaking down my functions into two, it's not really desirable. I believe for some class of functions (say sum/count) spark would optimize the query with a similar fashion as well, but the optimization would be kicked in by the choice of function, and would work regardless of the co-location of the rows, but not because of the co-location.
Can spark leverage the co-location of the rows to optimize the groupBy using any function subsequently?
It seems like bucketing's main use-case is for doing JOINs on the bucketed key, which allows Spark to avoid doing a shuffle across the whole table. If Spark knows that the rows are already partitioned across the buckets, I don't see why it wouldn't know to use the pre-partitioned buckets in a GROUP BY. You might need to sort by the group by key as well though.
I'm also interested in this use-case so will be trying it out and seeing if a shuffle occurs.

What is the fastest way to get a large number of time ranges using Apache Spark?

I have about 100 GB of time series data in Hadoop. I'd like to use Spark to grab all data from 1000 different time ranges.
I have tried this using Apache Hive by creating an extremely long SQL statement that has about 1000 'OR BETWEEN X AND Y OR BETWEEN Q AND R' statements.
I have also tried using Spark. In this technique I've created a dataframe that has the time ranges in question and loaded that into spark with:
spark_session.CreateDataFrame()
and
df.registerTempTable()
With this, I'm doing a join with the newly created timestamp dataframe and the larger set of timestamped data.
This query is taking an extremely long time and I'm wondering if there's a more efficient way to do this.
Especially if the data is not partitioned or ordered in any special way, you or Spark need to scan it all no matter what.
I would define a predicate given the set of time ranges:
import scala.collection.immutable.Range
val ranges: List[Range] = ??? // load your ranges here
def matches(timestamp: Int): Boolean = {
// This is not efficient, a better data structure than a List
// should be used, but this is just an example
ranges.contains(_.contains(timestamp))
}
val data: RDD[(Int, T)] = ??? // load the data in an RDD
val filtered = data.filter(x => matches(x.first))
You can do the same with DataFrame/DataSet and UDFs.
This works well if the set of ranges is provided in the driver. If instead it comes from a table, like the 100G data, first collect it back in the driver, if not too big.
Your Spark job goes through 100GB dataset to select relevant data.
I don’t think there is big difference between using SQL or data frame api, as under the hood the full scan happening anyway.
I would consider re-structuring your data, so it is optimised for specific queries.
In your cases partitioning by time can give quite significant improvement (for ex. HIVE table with partitioning).
If you perform search using the same field, that has been used for partitioning - Spark job will only look into relevant partitions.

What is prefer , bucket or repartition?

I have 2 spark jobs one is pre-process and the second is the process.
Process job needs to calculate for each user in the data.
I want to avoid shuffle like groupBy so I think about to save the result of the pre-process as bucket by user in Parquet or to re-partition by user and save the result .
What is prefer ? and why
Choice between partitionBy and bucketBy can be reduced to determine data cardinality:
Low cardinality -> partition
Hight cardinality -> bucket
However neither is used for aggregations. There are used for predicate pushdown, nothing more. Therefore they won't be of much use when your goal is to to avoid shuffle like groupBy although it might change in the future with the new API.
Please read this twice or thrice to understand this.
In my recommendation you should use repartition as partitionby has lot of shuffle. As it will create the folder in HDFS with all the partition key and further it will add data into different files which is very expensive process. Also bucketby attribute adds up the same but create files inside the folder in order of their previous partition.
Repartition on the other hand create hash table of all the data being stored in the file which is sorted by the key you mention here. And data shuffle is just to match the number of files u mention in the repartition attribute whihh is less expensive and pretty fast. Also if u want to groupby on this data that the running time will be same as of partitionby. By repartition u just reduce the run time for pre-process.

SparkSQL Dataframe access internal partitions (Columns)

I am working with some algorithms that require intensive computations over a dataset of numeric variables. I have really improved my pipeline by using parquet for storage and reading the data via the sparkSQL API, which allow me to read only the variables (columns) that are required for obtaining a given statistic.
However, once I obtain the corresponding DataFrame, I need to compute a single value by performing series of operations for each case in the data (rows) and then aggregating them. Neither the SQL operations or a UDF function are efficient for this, so I convert such Dataframe into an RDD and then apply a mapPartitions function to it.
df.rdd.mapPartitions( ... )
The resulting RDD is an RDD[Row] object, which is organised row-wise. So when triggering the map partitions, I have access to an iterator of Row objects, so I access my dataset case by case.
I wonder if there is an implicit way to access the data column by column, as the SparkSQL docs says that Dataframe structure is based on a columnar storage, so there must be a conversion when obtaining the RDD[Row] from it.
There must a way to access the internal partitions of a dataframe, in order to access the raw columns to operate with them as it will be much more natural for me to define my algorithm (like working with the columns of a matrix).
I dont know if this conversion is way deep into the implementation or can be accessed. I know that spark reads the parquet file (which is also column-wise organised) via the HDFS API, so I don't really know where the information is lost...
Any thoughts?

Resources