I am currently running some Spark code and I need to query a data frame that is taking a long time (over 1 hour) per query. I need to query multiple times to check if the data frame is in fact correct.
I am relatively new to Spark and I understand that Spark uses lazy evaluation which means that the commands are executed only once I do a call for some action (in my case .show()).
Is there a way to do this process once for the whole DF and then quickly call on the data?
Currently I am saving the DF as a temporary table and then running queries in beeline (HIVE). This seems a little bit overkill as I have to save the table in a database first, which seems like a waste of time.
I have looked into the following functions .persist, .collect but I am confused on how to use them and query from them.
I would really like to learn the correct way of doing this.
Yes, you can keep your RDD in memory using rddName.cache() (or persists()) . More information about RDD Persistence can be found here

Using a temporary table ( registerTempTable (spark 1.6) or createOrReplaceTempView (spark2.x)) does not "save" any data. It only creates a view with the lifetime of you spark session. If you wish to save the table, you should use .saveAsTable, but I assume that this is not what you are looking for.
Using .cache is equivalent to .persist(StorageLevel.MEMORY). If your table is large and thus can't fit in memory, you should use .persist(StorageLevel.MEMORY_AND_DISK).
PySpark: pull data to driver and then upload to dataframe

I am trying to create a pyspark dataframe from data stored in an external database. I use the pyodbc module to connect to the database and pull the required data, after which I use spark.createDataFrame to send my data to the cluster for analysis.
I run the script using --deploy-mode client, so the driver runs on the master node, but the executors can be distributed to other machines. The problem is pyodbc is not installed on any of the worker nodes (this is fine since I don't want them all querying the database anyway), so when I try to import this module in my scripts, I get an import error (unless all the executors happen to be on the master node).
My question is how can I specify that I want a certain portion of my code (in this case, importing pyodbc and querying the database) to run on the driver only? I am thinking something along the lines of
if __name__ == '__driver__':
<do stuff>
<wait until stuff is done>
Your imports in your python driver DO only run on the master. The only time you will see errors on your executors about missing imports is if you are referencing some object/function from one of those imports in a function you are calling on a driver. I would look carefully at any python code you are running in RDD/DataFrame calls for unintended references. If you post your code, we can give you more specific guidance.
Also, routing data through your driver is usually not a great idea because it will not scale well. If you have lots of data you are going to try and force all through a single point which defeats the purpose of distributed processing!
Depending on what database you are using is, there is probably a Spark Connector implemented to load it directly into a dataframe. If you are using ODBC then maybe you are using SQL Server? For example, in that case you should be able to use JDBC drivers, like for example in this post:
This is not how spark is supposed to work. Spark collections (RDDs or DataFrames) are inherently distributed. What you're describing is to create a dataset locally, by reading the whole dataset into drivers memory, and then sending it over to executors for further processing by creating an RDD or DataFrame out of it. That does not make much sense.
How to execute some instructions on selected nodes in a cluster?

I don't have any RDD to use, I just want to execute some of my own functions on some nodes of my cluster, with Apache Spark. So I don't have any data to distribute, but only code (which depends on the node that is executing it).
Is it possible ? Is Spark compatible with this goal ?
Is it possible?
I think it is possible and I've been asked about it few times already (so had time to think about it :))
Is Spark compatible with this goal?
The way Spark could handle it is to launch as many executors as you want to use nodes for the distributed work. That's the job of a cluster manager to spread the work across a cluster of nodes and so Spark can only use what nodes are given.
With the nodes assigned you simply execute a computation on fake dataset to build a RDD on top of.
If the computation runs on a node that should not be used, you can hostname inside the code and see what node you are on and decide on whether to continue or stop.
Spark on localhost

For testing purposes, while I donĀ“t have production cluster, I am using spark locally:
print('Setting SparkContext...')
sconf = SparkConf()
sc = SparkContext(conf=sconf)
print('Setting SparkContext...OK!')
Also, I am using a very very small dataset, consisting of only 20 rows in a postgresql database ( ~2kb)
Also(!), my code is quite simple as well, only grouping 20 rows by a key and applying a trivial map operation
params = [object1, object2]
rdd = df.rdd.keyBy(lambda x: (x.a, x.b, x.c)) \
.groupByKey() \
.mapValues(lambda value: self.__data_interpolation(value, params))
def __data_interpolation(self, data, params):
# TODO: only for testing
return data
What bothers me is that the whole execution takes about 5 minutes!!
Inspecting the Spark UI, I see that most of the time was spent in Stage 6: byKey method. (Stage 7, collect() method was also slow...)
Some info:
These numbers make no sense to me... Why do I need 22 tasks, executing for 54 sec, to process less than 1 kb of data
Can it be a network issue, trying to figure out the ip address of localhost?
I don't know... Any clues?
It appears the main reason for the slower performance in your code snippet is due to the use of groupByKey(). The issue with groupByKey is that it ends up shuffling all of the key-value pairs resulting in a lot of data unnecessarily being transferred. A good reference to explain this issue is Avoid GroupByKey.
To work around this issue, you can:
Try using reduceByKey which should be faster (more info is also included in the above Avoid GroupByKey link).
Use DataFrames (instead of RDDs) as DFs include performance optimizations (and the DF GroupBy statement is faster than the RDD version). As well, as you're using Python, you can avoid the Python-to-JVM issues with PySpark RDDs. More information on this can be seen in PySpark Internals
By the way, reviewing the Spark UI diagram above, the #22 refers to the task # within the DAG (not the number of tasks executed).
I suppose the "postgresql" is the key to solve that puzzle.
keyBy is probably the first operation that really uses the data so it's execution time is bigger as it needs to get the data from external database. You can verify it by adding at the beginning:
df.count() # to fill the cache
If I am right, you need to optimize the database. It may be:
Network issue (slow network to DB server)
Complicated (and slow) SQL on this database (try it using postgre shell)
Some authorization difficulties on DB server
Problem with JDBC driver you use
From what I have seen happening in my system while running spark:
When we run a spark job it internally creates map and reduce tasks and runs them. In your case, to run the data you have, it created 22 such tasks. I bigger the size of data the number may be big.
Spark Decision tree fit runs in 1 task

I am trying to "train" a DecisionTreeClassifier using Apache Spark running in a cluster in Amazon EMR. Even though I can see that there are around 50 Executors added and that the features are created by querying a Postgres database using SparkSQL and stored in a DataFrame.
The DesisionTree fit method takes for many hours even though the Dataset is not that big (10.000 db entries with a couple of hundreds of bytes each row).I can see that there is only one task for this so I assume this is the reason that it's been so slow.
Where should I look for the reason that this is running in one task?
Is it the way that I retrieve the data?
I am sorry if this is a bit vague but I don't know if the code that retrieves the data is relevant, or is it a parameter in the algorithm (although I didn't find anything online), or is it just Spark tuning?
I would appreciate any direction!
Thanks in advance.
Spark relies on data locality. It seems that all the data is located in a single place. Hence spark uses a single partition to process it. You could apply a repartition or state the number of partitions you would like to use at load time. I would also look into the decision tree Api and see if you can set the number of partitions for it specifically.
Baseline for measuring Apache Spark jobs execution times

I am fairly new to Apache Spark. I have been using it for several months, but this is my first project that uses it.
I use Spark to compute dynamic reports from data, stored in a NoSQL database (Cassandra). So far I have created several reports and they are computed correctly. Inside them I use DataFrame .unionAll(), .join(), .count(), .map(), etc.
I am running a 1.4.1 Spark cluster on my local machine with the following setup:
I have also populated the database with test data which is around 10-12k records per table.
By using the driver's web UI (http://localhost:4040/), I have noticed that the jobs are taking 40s-50s to execute, so lately I have been researching ways to tune Apache Spark and the jobs.
I have configured Spark to use the KryoSerializer, I have set the to lzf, I have optimized the jobs as much as I can and as much as my knowledge allows me to.
This led to the jobs taking 20s-30s to compute (which I think is a good improvement). The problem is that because this is my first Spark project, I have no baseline to compare the jobs times, so I have no idea if the execution is slow or fast and whether there is some problem in the code or with the Spark config.
What is the best way to proceed? Is there a graph or benchmark that shows how much time an action with N data should take?
