Does dropping columns that are not used in computation affect performance in spark?

I have a large dataset (hundreds of millions of rows) that I need to heavily process using spark with Databricks. This dataset has tens of columns, typically an integer, float, or array of integers.
My question is: does it make any difference if I drop some columns that are not needed before processing the data? In terms of memory and/or processing speed?

It depends what are you going to do with this dataset. Spark is smart enough to figure out which column are really needed, but its not always that easy. For example when you use UDF (user defined fucntion) which is operating on case class with all column defined, all column are going to be select from source as from Spark perspective such UDF is a black box.
You can check which column are selected for your job via SparkUI. For example check out this blog post:
In your plan you can look for this line: PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:string>
In ReadSchema you will be able to figure out which column are read by Spark and if they are really needed in our processing


Spark 2.4.6 + JDBC Reader: When predicate pushdown set to false, is data read in parallel by spark from the engine?

I am trying to extract data from a big table in SAP HANA, which is around 1.5tb in size, and the best way is to run in parallel across nodes and threads. Spark JDBC is the perfect candidate for the task, but in order to actually extract in parallel it requires partition column, lower/upper bound and number of partitions option to be set. To make the operation of the extraction easier, I considered adding an added partition column which would be the row_number() function and use MIN(), MAX() as lower/upper bounds respectively. And then the operations team just would be required to provide the number of partitions to have.
The problem is that HANA runs out of memory and it is very likely that row_number() is too costly on the engine. I can only imagine that over 100 threads run the same query during every fetch to apply the where filters and retrieve the corresponding chunk.
So my question is, if I disable the predicate pushdown option, how does spark behave? is it only read by one executor and then the filters are applied on spark side? Or does it do some magic to split the fetching part from the DB?
What could you suggest for extracting such a big table using the available JDBC reader?
Thanks in advance.
Before executing your primary query from Spark, run pre-ingestion query to fetch the size of the Dataset being loaded, i.e. as you have mentioned Min(), Max() etc.
Expecting that the data is uniformly distributed between Min and Max keys, you can partition across executors in Spark by providing Min/Max/Number of Executors.
You don't need(want) to change your primary datasource by adding additional columns to support data ingestion in this case.

How to post-process Spark SQL results w/o using UDF

I read
It suggests not to use UDF to save deserialization/serialization cost.
In my case, I did a query like this
select MYFUN(f1, f2, ...)
from A ...
I use MYFUN to post-process the query results row by row, for example, sending them to another service.
def my_fun(f1, f2, ...):
service.send(f1, f2, ...)
session.udf.register('MYFUN', my_fun)
W/o using UDF, I may want to save the query results to a Python data frame, or a Parque table on hdfs then reading by a dataframe, and process the dataframe one by one.
The problem is the result table size is large, may be 1M rows.
In such a case, does it still make sense to remove the UDF?
What is the best practice to populate a Spark SQL result to another service?
Python UDFs are not recommended from a performance point of view, but there is nothing wrong in using them when needed, as in this case: the serialization/deserialization cost is probably ridiculous compared to the I/O waits introduced by your send. So it probably doesn't make sense to remove the UDF.
In a more general case, there are two ways with which you can reduce the memory footprint of processing a dataframe. One you already mentioned, is save to file and process the file.
Another way is using toLocalIterator on your dataframe. This way you will iterate on each of the dataframe's partitions: you can repartition the dataframe to make partitions of an arbitrary size:
df =df.repartition(100)
for partition in df.toLocalIterator():
for row in partition:
This way your local memory requirements are reduced to the biggest partition of your repartitioned dataframe.

Spark: Most efficient way to sort and partition data to be written as parquet

My data is in principle a table, which contains a column ID and a column GROUP_ID, besides other 'data'.
In the first step I am reading CSV's into Spark, do some processing to prepare the data for the second step, and write the data as parquet.
The second step does a lot of groupBy('GROUP_ID') and Window.partitionBy('GROUP_ID').orderBy('ID').
The goal now is -- in order to avoid shuffling in the second step -- to efficiently load the data in the first step, as this is a one-timer.
Question Part 1: AFAIK, Spark preserves the partitioning when loading from parquet (which is actually the basis of any "optimized write consideration" to be made) - correct?
I came up with three possibilities:
df.orderBy('ID').repartition(n, 'TRIP_ID').write.parquet('/path/to/parquet')
df.repartition(n, 'TRIP_ID').sortWithinPartitions('ID').write.parquet('/path/to/parquet')
I would set n such that the individual parquet files would be ~100MB.
Question Part 2: Is it correct that the three options produce "the same"/similar results in regard of the goal (avoid shuffling in the 2nd step)? If not, what is the difference? And which one is 'better'?
Question Part 3: Which of the three options performs better regarding step 1?
Thanks for sharing your knowledge!
EDIT 2017-07-24
After doing some tests (writing to and reading from parquet) it seems that Spark is not able to recover partitionBy and orderBy information by default in the second step. The number of partitions (as obtained from df.rdd.getNumPartitions() seems to be determined by the number of cores and/or by spark.default.parallelism (if set), but not by the number of parquet partitions. So answer for question 1 would be WRONG, and questions 2 and 3 would be irrelevant.
So it turns out the REAL QUESTION is: is there a way to tell Spark, that the data is already partitioned by column X and sorted by column Y?
You probably will be interested in bucketing support in Spark.
See details here
.bucketBy(4, "id")
Notice Spark 2.4 added support for bucket pruning (like partition pruning)
More direct functionality you're looking at is Hive' bucketed-sorted tables
This is not yet available in Spark (see PS section below)
Also notice that the sorting information will not be loaded by Spark automatically, but since the data is already sorted.. the sorting operation on it will actually be much faster as not much work to do - e.g. one pass on data just to confirm that it is already sorted.
Spark and Hive bucketing are slightly different.
This is umbrella ticket to provide a compatibility in Spark for bucketed tables created in Hive -
As far as I know, NO there is no way to read data from parquet and tell Spark that it is already partitioned by some expression and ordered.
In short, one file on HDFS etc. is too big for one Spark partition. And even if you read whole file to one partition playing with Parquet properties such as parquet.split.files=false, parquet.task.side.metadata=true etc. there are would be most costs compare to just one shuffle.
Try bucketBy. Also, partition discovery can help.

Spark DataFrame vs sqlContext

For the purposes of comparison, suppose we have a table "T" with two columns "A","B". We also have a hiveContext operating in some HDFS database. We make a data frame:
In theory, which of the following is faster:
sqlContext.sql("SELECT A,SUM(B) FROM T GROUP BY A")
where "df" is a dataframe referring to T. For these simple kinds of aggregate operations, is there any reason why one should prefer one method over the other?
No, these should boil down to the same execution plan. Underneath the Spark SQL engine is using the same optimization engine, the catalyst optimizer. You can always check this yourself by looking at the spark UI, or even calling explain on the resultant DataFrame.
Spark developers have made great effort to optimise. The performance between DataFrame Scala and DataFrame SQL is undistinguishable. Even for DataFrame Python, the differ is when collect data to driver.
It opens a new world
It doesn't have to be one vs. another
We can just choose what ever way we comfortable with
The performance comparison published by databricks

Which is efficient, Dataframe or RDD or hiveql?

I am newbie to Apache Spark.
My job is read two CSV files, select some specific columns from it, merge it, aggregate it and write the result into a single CSV file.
For example,
I want to get a third CSV file with
I am loading both the CSV into dataframes.
And then able to get the third dataframe using several methods join,select,filter,drop present in dataframe
I am also able to do the same using several
And I am also able to do the same using executing hiveql using HiveContext
I want to know which is the efficient way if my CSV files are huge and why?
This blog contains the benchmarks. Dataframes is much more efficient than RDD
Here is the snippet from blog
At a high level, there are two kinds of optimizations. First, Catalyst applies logical optimizations such as predicate pushdown. The optimizer can push filter predicates down into the data source, enabling the physical execution to skip irrelevant data. In the case of Parquet files, entire blocks can be skipped and comparisons on strings can be turned into cheaper integer comparisons via dictionary encoding. In the case of relational databases, predicates are pushed down into the external databases to reduce the amount of data traffic.
Second, Catalyst compiles operations into physical plans for execution and generates JVM bytecode for those plans that is often more optimized than hand-written code. For example, it can choose intelligently between broadcast joins and shuffle joins to reduce network traffic. It can also perform lower level optimizations such as eliminating expensive object allocations and reducing virtual function calls. As a result, we expect performance improvements for existing Spark programs when they migrate to DataFrames.
Here is the performance benchmark
Both DataFrames and spark sql queries are optimized using the catalyst engine, so I would guess they will produce similar performance
(assuming you are using version >= 1.3)
And both should be better than simple RDD operations, because for RDDs, spark don't have any knowledge about the types of your data, so it can't do any special optimizations
Overall direction for Spark is to go with dataframes, so that query is optimized through catalyst
