SparkSQL Dataframe access internal partitions (Columns) - apache-spark

I am working with some algorithms that require intensive computations over a dataset of numeric variables. I have really improved my pipeline by using parquet for storage and reading the data via the sparkSQL API, which allow me to read only the variables (columns) that are required for obtaining a given statistic.
However, once I obtain the corresponding DataFrame, I need to compute a single value by performing series of operations for each case in the data (rows) and then aggregating them. Neither the SQL operations or a UDF function are efficient for this, so I convert such Dataframe into an RDD and then apply a mapPartitions function to it.
df.rdd.mapPartitions( ... )
The resulting RDD is an RDD[Row] object, which is organised row-wise. So when triggering the map partitions, I have access to an iterator of Row objects, so I access my dataset case by case.
I wonder if there is an implicit way to access the data column by column, as the SparkSQL docs says that Dataframe structure is based on a columnar storage, so there must be a conversion when obtaining the RDD[Row] from it.
There must a way to access the internal partitions of a dataframe, in order to access the raw columns to operate with them as it will be much more natural for me to define my algorithm (like working with the columns of a matrix).
I dont know if this conversion is way deep into the implementation or can be accessed. I know that spark reads the parquet file (which is also column-wise organised) via the HDFS API, so I don't really know where the information is lost...
Any thoughts?

Related

Spark SQL Update/Delete

Currently, I am working on a project using pySpark that reads in a few Hive tables, stores them as dataframes, and I have to perform a few updates/filters on them. I am avoiding using Spark syntax at all costs to create a framework that will only take SQL in a parameter file that will be run using my pySpark framework.
Now the problem is that I have to perform UPDATE/DELETE queries on my final dataframe, are there any possible work arounds to performing these operations on my dataframe?
Thank you so much!
A DataFrame is immutable , you can not change it, so you are not able to update/delete.
If you want to "delete" there is a .filter option (it will create a new DF excluding records based on the validation that you applied on filter).
If you want to "update", the closer equivalent is .map, where you can "modify" your record and that value will be on a new DF, the thing is that function will iterate all the records on the .df.
Another thing that you need to keep in mind is: if you load data into a df from some source (ie. Hive table) and perform some operations. That updated data wont be reflected on your source data. DF's live on memory, until you persist that data.
So, you can not work with DF like a sql-table for those operations. Depending on your requirements you need to analyze if Spark is a solution for your specific problem.

Spark DataFrame / Dataset groupBy optimization via bucketBy

I'm researching options for a use-case where we store the dataset as parquet files and want to run efficient groupBy queries for a specific key later on when we read the data.
I've read a bit about the optimizations for groupBy, however couldn't really find much about it (other than RDD level reduceByKey).
What I have in mind is, if the dataset is written bucketed by the key that will also be used in the groupBy. Theoretically the groupBy could be optimized, since all the rows containing the key will be co-located (and even consecutive if it's also stored sorted on the same key).
One idea i have in mind is to apply the transformation via mapPartitions then groupBy, however, this will require breaking down my functions into two, it's not really desirable. I believe for some class of functions (say sum/count) spark would optimize the query with a similar fashion as well, but the optimization would be kicked in by the choice of function, and would work regardless of the co-location of the rows, but not because of the co-location.
Can spark leverage the co-location of the rows to optimize the groupBy using any function subsequently?
It seems like bucketing's main use-case is for doing JOINs on the bucketed key, which allows Spark to avoid doing a shuffle across the whole table. If Spark knows that the rows are already partitioned across the buckets, I don't see why it wouldn't know to use the pre-partitioned buckets in a GROUP BY. You might need to sort by the group by key as well though.
I'm also interested in this use-case so will be trying it out and seeing if a shuffle occurs.

What's the overhead of converting an RDD to a DataFrame and back again?

It was my assumption that Spark Data Frames were built from RDDs. However, I recently learned that this is not the case, and Difference between DataFrame, Dataset, and RDD in Spark does a good job explaining that they are not.
So what is the overhead of converting an RDD to a DataFrame, and back again? Is it negligible or significant?
In my application, I create a DataFrame by reading a text file into an RDD and then custom-encoding every line with a map function that returns a Row() object. Should I not be doing this? Is there a more efficient way?
RDDs have a double role in Spark. First of all is the internal data structure for tracking changes between stages in order to manage failures and secondly until Spark 1.3 was the main interface for interaction with users. Therefore after after Spark 1.3 Dataframes constitute the main interface offering much richer functionality than RDDs.
There is no significant overhead when converting one Dataframe to RDD with df.rdd since the dataframes they already keep an instance of their RDDs initialized therefore returning a reference to this RDD should not have any additional cost. On the other side, generating a dataframe from an RDD requires some extra effort. There are two ways to convert an RDD to dataframe 1st by calling rdd.toDF() and 2nd with spark.createDataFrame(rdd, schema). Both methods will evaluate lazily although there will be an extra overhead regarding the schema validation and execution plan (you can check the toDF() code here for more details). Of course that would be identical to the overhead that you have just by initializing your data with spark.read.text(...) but with one less step, the conversion from RDD to dataframe.
This the first reason that I would go directly with Dataframes instead of working with two different Spark interfaces.
The second reason is that when using the RDD interface you are missing some significant performance features that dataframes and datasets offer related to Spark optimizer (catalyst) and memory management (tungsten).
Finally I would use the RDDs interface only if I need some features that are missing in dataframes such as key-value pairs, zipWithIndex function etc. But even then you can access those via df.rdd which is costless as already mentioned. As for your case , I believe that would be faster to use directly a dataframe and use the map function of that dataframe to ensure that Spark leverages the usage of tungsten ensuring efficient memory management.

How to post-process Spark SQL results w/o using UDF

I read
https://medium.com/teads-engineering/spark-performance-tuning-from-the-trenches-7cbde521cf60
It suggests not to use UDF to save deserialization/serialization cost.
In my case, I did a query like this
select MYFUN(f1, f2, ...)
from A ...
I use MYFUN to post-process the query results row by row, for example, sending them to another service.
def my_fun(f1, f2, ...):
service.send(f1, f2, ...)
session.udf.register('MYFUN', my_fun)
W/o using UDF, I may want to save the query results to a Python data frame, or a Parque table on hdfs then reading by a dataframe, and process the dataframe one by one.
The problem is the result table size is large, may be 1M rows.
In such a case, does it still make sense to remove the UDF?
What is the best practice to populate a Spark SQL result to another service?
Python UDFs are not recommended from a performance point of view, but there is nothing wrong in using them when needed, as in this case: the serialization/deserialization cost is probably ridiculous compared to the I/O waits introduced by your send. So it probably doesn't make sense to remove the UDF.
In a more general case, there are two ways with which you can reduce the memory footprint of processing a dataframe. One you already mentioned, is save to file and process the file.
Another way is using toLocalIterator on your dataframe. This way you will iterate on each of the dataframe's partitions: you can repartition the dataframe to make partitions of an arbitrary size:
df =df.repartition(100)
for partition in df.toLocalIterator():
for row in partition:
send(row)
This way your local memory requirements are reduced to the biggest partition of your repartitioned dataframe.

Spark SQL with different data sources

Is it possible to create data frames from 2 different sources and perform operations on those.
For example,
df1 = <create from a file or folder from S3>
df2 = <create from a hive table>
df1.join(df2).where("df1Key" === "df2Key")
If this is possible, what are the implications in doing so?
Yes.. It is possible to read from different datasource and perform operations on it.
In fact many application will need those kind of requirements.
df1.join(df2).where("df1Key" === "df2Key")
This will do Cartesian join and then apply filter on it.
df1.join(df2,$"df1Key" === $"df2Key")
This should provide same output.
Dataframe is a source independent abstraction. I would encourage you to read the original paper on RDD and the wiki
The abstraction is source independent and keeps track of the location of the data and underlying DAG of operation. Dataframe APIs provides the schema of an RDD.
You can have dataframe from any source but they all homogenized to have same APIs. Dataframe APIs provides Dataframe reader interface which any underlying source can implement to create a dataframe on top of it. Here is another example of cassandra connector for dataframe
One caveat is the speed of data retrieval from the different sources might vary. For example if your data is in s3 vs data in HDFS then probably the operations on the dataframe created on top of HDFS might be faster. But nonetheless you will be able to perform any joins on the dataframes created from different sources.

Resources