Spark on localhost - apache-spark

For testing purposes, while I donĀ“t have production cluster, I am using spark locally:
print('Setting SparkContext...')
sconf = SparkConf()
sconf.setAppName('myLocalApp')
sconf.setMaster('local[*]')
sc = SparkContext(conf=sconf)
print('Setting SparkContext...OK!')
Also, I am using a very very small dataset, consisting of only 20 rows in a postgresql database ( ~2kb)
Also(!), my code is quite simple as well, only grouping 20 rows by a key and applying a trivial map operation
params = [object1, object2]
rdd = df.rdd.keyBy(lambda x: (x.a, x.b, x.c)) \
.groupByKey() \
.mapValues(lambda value: self.__data_interpolation(value, params))
def __data_interpolation(self, data, params):
# TODO: only for testing
return data
What bothers me is that the whole execution takes about 5 minutes!!
Inspecting the Spark UI, I see that most of the time was spent in Stage 6: byKey method. (Stage 7, collect() method was also slow...)
Some info:
These numbers make no sense to me... Why do I need 22 tasks, executing for 54 sec, to process less than 1 kb of data
Can it be a network issue, trying to figure out the ip address of localhost?
I don't know... Any clues?

It appears the main reason for the slower performance in your code snippet is due to the use of groupByKey(). The issue with groupByKey is that it ends up shuffling all of the key-value pairs resulting in a lot of data unnecessarily being transferred. A good reference to explain this issue is Avoid GroupByKey.
To work around this issue, you can:
Try using reduceByKey which should be faster (more info is also included in the above Avoid GroupByKey link).
Use DataFrames (instead of RDDs) as DFs include performance optimizations (and the DF GroupBy statement is faster than the RDD version). As well, as you're using Python, you can avoid the Python-to-JVM issues with PySpark RDDs. More information on this can be seen in PySpark Internals
By the way, reviewing the Spark UI diagram above, the #22 refers to the task # within the DAG (not the number of tasks executed).
HTH!

I suppose the "postgresql" is the key to solve that puzzle.
keyBy is probably the first operation that really uses the data so it's execution time is bigger as it needs to get the data from external database. You can verify it by adding at the beginning:
df.cache()
df.count() # to fill the cache
df.rdd.keyBy....
If I am right, you need to optimize the database. It may be:
Network issue (slow network to DB server)
Complicated (and slow) SQL on this database (try it using postgre shell)
Some authorization difficulties on DB server
Problem with JDBC driver you use

From what I have seen happening in my system while running spark:
When we run a spark job it internally creates map and reduce tasks and runs them. In your case, to run the data you have, it created 22 such tasks. I bigger the size of data the number may be big.
Hope this helps.

Related

How to use Pyspark's csv reader on every element of Pyspark RDD? (without "reference SparkContext from a broadcast variable")

I want to use Pyspark to read in hundreds of csv files, create a single dataframe that is (roughly) the concatenation of all the csvs. Since each csv can fit in memory, but not more than one or two at a time, this seems like a good fit for Pyspark. My strategy is not working, and I think it is because I want to make a Pyspark dataframe in the kernel function of my map function resulting in an error:
# initiate spark session and other variables
sc = SparkSession.builder.master("local").appName("Test").config(
"spark.driver.bindAddress", "127.0.0.1").getOrCreate()
file_path_list = [path1, path2] ## list of string path variables
# make an rdd object so i can use .map:
rdd = sc.sparkContext.parallelize(file_path_list)
# make a kernel function for my future .map() application
def kernel_f(path):
df = sc.read.options(delimiter=",", header=True).csv(path)
return df
# apply .map
rdd2 = rdd.map(kernel_f)
# see first dataframe (so excited)
rdd2.take(2)[0].show(3)
this throws an error:
PicklingError: Could not serialize object: RuntimeError: It appears
that you are attempting to reference SparkContext from a broadcast
variable, action, or transformation. SparkContext can only be used on
the driver, not in code that it run on workers. For more information,
see SPARK-5063.
My next step (supposing no error had appeared) was to use a reduce step to concatenate all the members (dataframes with same schema) of that rdd2
It seems related to this post but I don't understand the answer.
Questions:
I think this means is that since my kernel_f calls sc. methods, it is against the rules. Is that right?
I (think I) could use plain-old python (not pyspark) function map to apply the kernel_f to my file_path_list, then use plain-old functools.reduce to concatenate all these into a single pyspark dataframe, but that doesn't seem to leverage pyspark much. Does this seem like a good route?
Can you teach me a good, ideally a "tied-for-best" way to do this?
I don't have a definitive answer but just comments that might help. First off, I think the easiest way to do this is read the CSVs with a wildcard like shown here
A Spark cluster is composed of the scheduler and the workers. You use the SparkSession to pass work to the scheduler. It seems they don't allow workers sending work to the scheduler, which seems like it can be an anti-pattern in a lot of use cases.
The design pattern is also weird here because you are not actually passing a DataFrame back. Spark operations are lazy unlike Pandas so that read is not happening immediately. I feel like if it worked, it would pass a DAG back, not data.
It doesn't sound good because you want loading of files to be lazy. Given you can't use spark to read on a worker, you'd have to use Pandas/Python which evaluate immediately. You will run out of memory trying this even more.
Speaking of memory, Spark lets you perform out-of-memory computation but there are limits to how big can be out-of-memory relative to the memory available. You will inevitably run into errors if you really don't have enough memory by a considerable margin.
I think you should use the wildcard as shown above.

Spark SQL joining multiple tables design

I am developing a Spark SQL analytics solutions using set of tables. Suppose there are 5 tables which i need to building my solution and finally i am creating one output table.
Here is my flow
dataframe1 = table1 join table2
dataframe2 = dataframe1 join table3
dataframe3 = datamframe2 + filter + agg
dataframe4 = dataframe3 join table4 join table 5
// finally
dataframe4.saveAsTable
When I save final dataframe that's when all the above dataframe is evaluated.
Is my approach is good? or
Do i need to cache/persist intermediate dataframes?
This is a very generic question and it is hard to provide a definitive answer.
Depending on the size of tables you would want to do broadcast hint for any of tables that are relatively small.
You can do this via
table_i.join(broadcast(table_j), ....)
This behaviour depends on the value in:
Now broadcast hint will be honoured only if Spark is able to evaluate the value of the table so you might need to cache().
Another option is via Spark checkpoints that can help to truncate local plan for optimisation (also this allows you to resume jobs from checkpoint location, it is similar to writing to HDFS but with some overhead).
In case of broadcasting few houndres of Mb tables, you might need to increase your kryo buffer:
--conf spark.kryoserializer.buffer.max=1g
It also depends which join types you will use.
You would probably want to do filter and aggregagtion as early as possible since it will reduce the join surface.
There are many other considerations to be consider in order to properly optimise this. In case of power law distribution of join keys in any of the joins you would need to do salting and explode smaller table.
In your case, in principle, there is not really a cache or persist required Why?
As there are no reuse paths evident (for other Actions or other Transformations within the same Action), it is all sequential.
Also, lazy evaluation and Catalyst.
Try the .explain and see how Spark will process.
However, due to memory eviction possibilities on the Cluster, there may be the need to re-compute on a Worker. There are various settings that you could apply via .cache and .persist, but Spark handles memory and disk spills without explicit .cache or .persist. See https://sparkbyexamples.com/spark/spark-difference-between-cache-and-persist/
Also, using .cache can affect performance. So use .explain. See here an excellent posting: Spark: Explicit caching can interfere with Catalyst optimizer's ability to optimize some queries?
So, each case is different but yours seems Ok to answer as I have. In summary: An RDD or DF that is not cached, nor check-pointed, is re-evaluated again each time an Action is invoked on that RDD or DF or if re-accessed within the current Action and no skipped stage situation applies. In your case no issue. Doing otherwise would slow your App down in fact.

Spark createDataFrame(df.rdd, df.schema) vs checkPoint for breaking lineage

I'm currently using
val df=longLineageCalculation(....)
val newDf=sparkSession.createDataFrame(df.rdd, df.schema)
newDf.join......
In order to save time when calculating plans, however docs say that checkpointing is the suggested way to "cut" lineage. BUT I don't want to pay the price of saving the RDD to disk.
My process is a batch process which is not-so-long and can be restarted without issues, so checkpointing is not benefit for me (I think).
What are the problems which can arise using "my" method? (Docs suggests checkpointing, which is more expensive, instead of this one for breaking lineages and I would like to know the reason)
Only think I can guess is that if some node fails after my "lineage breaking" maybe my process will fail while the checkpointed one would have worked correctly? (what If the DF is cached instead of checkpointed?)
Thanks!
EDIT:
From SMaZ answer, my own knowledge and the article which he provided. Using createDataframe (which is a Dev-API, so use at "my"/your own risk) will keep the lineage in memory (not a problem for me since I don't have memory problems and the lineage is not big).
With this, it looks (not tested 100%) that Spark should be able to rebuild whatever is needed if it fails.
As I'm not using the data in the following executions, I'll go with
cache+createDataframe versus checkpointing (which If i'm not wrong, is
actually cache+saveToHDFS+"createDataFrame").
My process is not that critical (if it crashes) since an user will be always expecting the result and they launch it manually, so if it gives problems, they can relaunch (+Spark will relaunch it) or call me, so I can take some risk anyways, but I'm 99% sure there's no risk :)
Let me start with creating dataframe with below line :
val newDf=sparkSession.createDataFrame(df.rdd, df.schema)
If we take close look into SparkSession class then this method is annotated with #DeveloperApi. To understand what this annotation means please take a look into below lines from DeveloperApi class
A lower-level, unstable API intended for developers.
Developer API's might change or be removed in minor versions of Spark.
So it is not advised to use this method for production solutions, called as Use at your own risk implementation in open source world.
However, Let's dig deeper what happens when we call createDataframe from RDD. It is calling the internalCreateDataFrame private method and creating LogicalRDD.
LogicalRDD is created when:
Dataset is requested to checkpoint
SparkSession is requested to create a DataFrame from an RDD of internal binary rows
So it is nothing but the same as checkpoint operation without saving the dataset physically. It is just creating DataFrame From RDD Of Internal Binary Rows and Schema. This might truncate the lineage in memory but not at the Physical level.
So I believe it's just the overhead of creating another RDDs and can not be used as a replacement of checkpoint.
Now, Checkpoint is the process of truncating lineage graph and saving it to a reliable distributed/local file system.
Why checkpoint?
If computation takes a long time or lineage is too long or Depends too many RDDs
Keeping heavy lineage information comes with the cost of memory.
The checkpoint file will not be deleted automatically even after the Spark application terminated so we can use it for some other process
What are the problems which can arise using "my" method? (Docs
suggests checkpointing, which is more expensive, instead of this one
for breaking lineages and I would like to know the reason)
This article will give detail information on cache and checkpoint. IIUC, your question is more on where we should use the checkpoint. let's discuss some practical scenarios where checkpointing is helpful
Let's take a scenario where we have one dataset on which we want to perform 100 iterative operations and each iteration takes the last iteration result as input(Spark MLlib use cases). Now during this iterative process lineage is going to grow over the period. Here checkpointing dataset at a regular interval(let say every 10 iterations) will assure that in case of any failure we can start the process from last failure point.
Let's take some batch example. Imagine we have a batch which is creating one master dataset with heavy lineage or complex computations. Now after some regular intervals, we are getting some data which should use earlier calculated master dataset. Here if we checkpoint our master dataset then it can be reused for all subsequent processes from different sparkSession.
My process is a batch process which is not-so-long and can be
restarted without issues, so checkpointing is not benefit for me (I
think).
That's correct, If your process is not heavy-computation/Big-lineage then there is no point of checkpointing. Thumb rule is if your dataset is not used multiple time and can be re-build faster than the time is taken and resources used for checkpoint/cache then we should avoid it. It will give more resources to your process.
I think the sparkSession.createDataFrame(df.rdd, df.schema) will impact the fault tolerance property of spark.
But the checkpoint() will save the RDD in hdfs or s3 and hence if failure occurs, it will recover from the last checkpoint data.
And in case of createDataFrame(), it just breaks the lineage graph.

Spark code is taking a long time to return query. Help on speeding this up

I am currently running some Spark code and I need to query a data frame that is taking a long time (over 1 hour) per query. I need to query multiple times to check if the data frame is in fact correct.
I am relatively new to Spark and I understand that Spark uses lazy evaluation which means that the commands are executed only once I do a call for some action (in my case .show()).
Is there a way to do this process once for the whole DF and then quickly call on the data?
Currently I am saving the DF as a temporary table and then running queries in beeline (HIVE). This seems a little bit overkill as I have to save the table in a database first, which seems like a waste of time.
I have looked into the following functions .persist, .collect but I am confused on how to use them and query from them.
I would really like to learn the correct way of doing this.
Many thanks for the help in advance!!
Yes, you can keep your RDD in memory using rddName.cache() (or persists()) . More information about RDD Persistence can be found here
Using a temporary table ( registerTempTable (spark 1.6) or createOrReplaceTempView (spark2.x)) does not "save" any data. It only creates a view with the lifetime of you spark session. If you wish to save the table, you should use .saveAsTable, but I assume that this is not what you are looking for.
Using .cache is equivalent to .persist(StorageLevel.MEMORY). If your table is large and thus can't fit in memory, you should use .persist(StorageLevel.MEMORY_AND_DISK).
Also it is possible that you simple need more nodes in you cluster. In case you are running locally, make sure you deploy with --master local[*] to use all available cores on your machine. If you are running on a stand alone cluster or with a cluster manager like Yarn or Mesos, you should make sure that all necessary/available resources are assigned to you job.

Writing a dataframe to disk taking an unrealistically long time in Pyspark (Spark 2.1.1)

I'm running Pyspark on a single server with multiple CPUs. All other operations (reading, joining, filtering, custom UDFs) are executed quickly except for writing to disk. The dataframe I'm trying to save is of size around ~400 gb with 200 partitions.
sc.getConf().getAll()
The driver memory is 16g, and working directory has enough space (> 10TB)
I'm trying to save using the following command:
df.repartition(1).write.csv("out.csv")
Wondering if anyone has run into the same issue. Also will changing any of the config parameters before pyspark is invoked help solve the issue?
Edits (a few clarifications):
When I mean other operations were executed quickly, there was always an action after transformation, in my case they were row counts. So all the operations were executed super fast. Still haven't gotten around why writing takes such a ridiculous amount of time.
One of my colleagues brought up the fact that the disks in our server might have a limit on concurrent writing which might be slowing things down, still investigating on this. Interested in knowing if others are seeing slow write times on a Spark cluster too. I have confirmation from one user regarding this on AWS cluster.
All other operations (reading, joining, filtering, custom UDFs)
There are because there are transformations - they don't do anything until data has to be saved.
The dataframe I'm trying to save is of size around ~400 gb
(...)
I'm trying to save using the following command:
df.repartition(1).write.csv("out.csv")
That just cannot work well. Even ignoring part where you use a single machine, saving 400GB with a single thread (!) is just hopeless. Even if it succeeds, it is not better than using plain bash script.
Skipping over Spark - sequential writes for 400GB will take a substantial amount of time, even on average size disk. And given multiple shuffle (join, repartition) data will be written to disk multiple times.
After a lot of trial and error, I realized that the issue was due to the method I used to read the file from disk. I was using the in-built read.csv function, and when I switched over to the read function in databricks-csv package the problem went away. I'm now able to write files to disk at a reasonable time. It's really strange, maybe it's a bug in 2.1.1 or databricks csv package is really optimized.
1.read.csv method
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("model") \
.config("spark.worker.dir", "xxxx") \
.getOrCreate()
df = spark.read.load("file.csv", format="csv", header = True)
write.csv("file_after_processing.csv")
2.Using the databricks-csv package
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('file.csv')
train.write.format('com.databricks.spark.csv').save('file_after_processing.csv')

Resources