How are the task results being processed on Spark? - apache-spark

I am new to Spark and I am currently try to understand the architecture of spark.
As far as I know, the spark cluster manager assigns tasks to worker nodes and sends them partitions of the data. Once there, each worker node performs the transformations (like mapping etc.) on its own specific partition of the data.
What I don't understand is: where do all the results of these transformations from the various workers go to? are they being sent back to the cluster manager / driver and once there reduced (e.g. sum of values of each unique key)? If yes, is there a specific way this happens?
Would be nice if someone is able to enlighten me, neither the spark docs nor other Resources concerning the architecture haven't been able to do so.

Good question, I think you are asking how does a shuffle work...
Here is a good explanation.
When does shuffling occur in Apache Spark?

Related

PySpark Parallelism

I am new to spark and am trying to implement reading data from a parquet file and then after some transformation returning it to web ui as a paginated way. Everything works no issue there.
So now I want to improve the performance of my application, after some google and stack search I found out about pyspark parallelism.
What I know is that :
pyspark parallelism works by default and It creates a parallel process based on the number of cores the system has.
Also for this to work data should be partitioned.
Please correct me if my understanding is not right.
Questions/doubt:
I am reading data from one parquet file, so my data is not partitioned and if I use the .repartition() method on my dataframe that is expensive. so how should I use PySpark Parallelism here ?
Also I could not find any simple implementation of pyspark parallelism, which could explain how to use it.
In spark cluster 1 core reads one partition so if you are on multinode spark cluster
then you need to leave some meory for existing system manager like Yarn etc.
https://spoddutur.github.io/spark-notes/distribution_of_executors_cores_and_memory_for_spark_application.html
you can use reparation and specify number of partitions
df.repartition(n)
where n is the number of partition. Repartition is for parlelleism, it will be ess expensive then process your single file without any partition.

PySpark: How to speed up sqlContext.read.json?

I am using below pyspark code to read thousands of JSON files from an s3 bucket
sc = SparkContext()
sqlContext = SQLContext(sc)
sqlContext.read.json("s3://bucknet_name/*/*/*.json")
This takes a lot of time to read and parse JSON files(~16 mins). How can I parallelize or speed up the process?
The short answer is : It depends (on the underlying infrastructure) and the distribution within data (called the skew which only applies when you're performing anything that causes a shuffle).
If the code you posted is being run on say: AWS' EMR or MapR, it's best to optimize the number of executors on each cluster node such that the number of cores per executor is from three to five. This number is important from the point of reading and writing to S3.
Another possible reason, behind the slowness, can be the dreaded corporate proxy. If all your requests to the S3 service are being routed via a corporate proxy, then the latter is going to be huge bottleneck. It's best to bypass proxy via the NO_PROXY JVM argument on the EMR cluster to the S3 service.
This talk from Cloudera alongside their excellent blogs one and two is an excellent introduction to tuning the cluster. Since we're using sql.read.json the underlying Dataframe will be split into number of partitions given by the yarn param sql.shuffle.paritions described here. It's best to set it at 2 * Number of Executors * Cores per Executor. That will definitely speed up reading, on a cluster whose calculated value exceeds 200
Also, as mentioned in the above answer, if you know the schema of the json, it may speed things up when inferSchema is set to true.
I would also implore you to look at the Spark UI and dig into the DAG for slow jobs. It's an invaluable tool for performance tuning on Spark.
I am planning on consolidating as many infrastructure optimizations on AWS' EMR into a blog. Will update the answer with the link once done.
There are at least two ways to speed up this process:
Avoid wildcards in the path if you can. If it is possible, provide a full list of paths to be loaded instead.
Provide the schema argument to avoid schema inference.

Spark Decision tree fit runs in 1 task

I am trying to "train" a DecisionTreeClassifier using Apache Spark running in a cluster in Amazon EMR. Even though I can see that there are around 50 Executors added and that the features are created by querying a Postgres database using SparkSQL and stored in a DataFrame.
The DesisionTree fit method takes for many hours even though the Dataset is not that big (10.000 db entries with a couple of hundreds of bytes each row).I can see that there is only one task for this so I assume this is the reason that it's been so slow.
Where should I look for the reason that this is running in one task?
Is it the way that I retrieve the data?
I am sorry if this is a bit vague but I don't know if the code that retrieves the data is relevant, or is it a parameter in the algorithm (although I didn't find anything online), or is it just Spark tuning?
I would appreciate any direction!
Thanks in advance.
Spark relies on data locality. It seems that all the data is located in a single place. Hence spark uses a single partition to process it. You could apply a repartition or state the number of partitions you would like to use at load time. I would also look into the decision tree Api and see if you can set the number of partitions for it specifically.
Basically, partitions are your level of parallelism.

Monitor Spark actual work time vs. communication time

On a Spark cluster, if the jobs are very small, I assume that the clustering will be inefficient since most of the time will be spent on communication between nodes, rather than utilizing the processors on the nodes.
Is there a way to monitor how much time out of a job submitted with spark-submit is wasted on communication, and how much on actual computation?
I could then monitor this ratio to check how efficient my file aggregation scheme or processing algorithm is in terms of distribution efficiency.
I looked through the Spark docs, and couldn't find anything relevant, though I'm sure I'm missing something. Ideas anyone?
You can see this information in the Spark UI, asuming you are running Spark 1.4.1 or higher (sorry but I don't know how to do this for earlier versions of Spark).
Here is a sample image:
Here is the page that the image came from.
A brief summary: You can view a timeline of all the events happening in your Spark job within the Spark UI. From there, you can zoom in on each individual job and each individual task. Each task is divided into shceduler delay, serialization / deserialization, computation, shuffle, etc.
Now, this is obviously a very pretty UI but you might want something more robust so that you can check this info programmatically. It appears here that you can use the REST API to export the logging info in JSON.

Apache Spark node asking master for more data?

I'm trying to benchmark a few approaches to putting an image processing algorithm into apache spark. For one step in this algorithm, a computation on a pixel in the image will depend on an unknown amount of surrounding data, so we can't partition the image with guaranteed sufficient overlap a priori.
One solution to that problem I need to benchmark is for a worker node to ask the master node for more data when it encounters a pixel with insufficient surrounding data. I'm not convinced this is the way to do things, but I need to benchmark it anyway because of reasons.
Unfortunately, after a bunch of googling and reading docs I can't find any way for a processingFunc called as part of sc.parallelize(partitions).map(processingFunc) to query the master node for more data from a different partition mid-computation.
Does a way for a worker node to ask the master for more data exist in spark, or will I need to hack something together that kind of goes around spark?
Master Node in Spark is for allocating the resources to a particular job and once the resources are allocated, the Driver ships the complete code with all its dependencies to the various executors.
The first step in every code is to load the data to the Spark cluster. You can read the data from any underlying data repository like Database, filesystem, webservices etc.
Once data is loaded it is wrapped into an RDD which is partitioned across the nodes in the cluster and further stored in the workers/ Executors Memory. Though you can control the number of partitions by leveraging various RDD API's but you should do it only when you have valid reasons to do so.
Now all operations are performed over RDD's using its various methods/ Operations exposed by RDD API. RDD keep tracks of partitions and partitioned data and depending upon the need or request it automatically query the appropriate partition.
In nutshell, you do not have to worry about the way data is partitioned by RDD or which partition stores which data and how they communicate with each other but if you do care, then you can write your own custom partitioner, instructing Spark of how to partition your data.
Secondly if your data cannot be partitioned then I do not think Spark would be an ideal choice because that will result in processing of everything in 1 single machine which itself is contrary to the idea of distributed computing.
Not sure what is exactly your use case but there are people who have been leveraging Spark for Image processing. see here for the comments from Databricks

Resources