unable to launch more tasks in spark cluster - azure

I have a 6 node cluster with 8 cores and 32 gb ram each. I am reading a simple csv file from azure blob storage and writing to hive table.
when the job runs I see only a single task getting launched and single executor working and all the other executor and instances sitting idle/dead.
How to increase the number of tasks so the job can run faster.
any help appreciated

I'm guessing that your csv file is in one block. Therefore your data is only on one partition and since Spark "only" creates one task per partition, you only have one.
You can call repartition(X) on your dataframe/rdd just after reading it to increase the number of partitions. Reading won't be faster but all your transformations and the writting will be parallelized.

Related

Partitioning in Spark 3.1 with java

I am using spark 3.1 with java. In my code, I am writing final result dataset into GCP storage, in that it creates multiple files as my dataset is large. I am spark job in GCP dataproc cluster. It is configured to use 250 worker nodes(each has 8 vCPUs). Spark command is configured to run 2 executers per node and 3 cores for each executor. When the spark job is trigger YARN ResourceManager is showing only 25% of worker cores being used to containers per node. Also I configured, shuffle partition size as 5500(spark.sql.shuffle.partitions=5500). And I used
mydataset.coalesce(4500) to reduce the number of result files creating in Cloud stoage. But it creates 5499 files for one of the dataset which has nearly 45000 rows and 3500 files for another dataset which has nearly 85000 rows. Its really confusing on what basis its creating file partition s. Can't I control that? Is there any default value there? If yes, can i get that default value in Java code?
Thanks in Advance

How Spark and S3 interact

I'm wondering how data is loaded into spark in below scenario:
There is 10 GB transaction data stored in S3 in parquet format, I'm going to run a Spark program to categorize every record in that 10 GB Parquet file (e.g. Income, Shopping, Dinning).
I have following questions:
How would this 10 GB distributed into different workers in the Spark Cluster? Does the 10 GB file loaded into Spark Master then Master split the data and send to executors?
If all these happen in memory? What if one of the executors crashed during a job run, will the master load the 10 GB file from S3 again and extract the subset of data that supposed to be processed by the crashed executor and send to another executor?
How would this 10 GB distributed into different workers in the Spark Cluster? Does the 10 GB file loaded into Spark Master then Master split the data and send to executors?
Answer:
Spark follows Master-Slave architecture. We have one master (Driver/Co-Ordinator) and multiple distributed worker nodes. Driver process runs on the master node and main method of the program runs in driver process. Driver process creates SparkSession or SparkContext. Driver process converts user code to tasks based on the transformation and actions operations in the code from the lineage graph. Driver creates the logical and physical plan and once physical plan is ready it co-ordinates with the cluster manager to get the executors to complete the task. Driver just keeps track of the state of the data(metadata) for each of the executors.
So, 10 GB file does not get loaded to the master node. S3 is a distributed Storage and spark reads from it in a splitted manner. Driver process just decides how the data would get splitted and what each executor needs to work on. Even if you cache the data it gets cached on the executors node only based on the partitions/data that the executor is working on. Also nothing gets triggered unless you call a action operation like count, collect etc. It creates a lineage graph plus DAG to keep track of this information.
If all these happen in memory? What if one of the executors crashed during a job run, will the master load the 10 GB file from S3 again and extract the subset of data that supposed to be processed by the crashed executor and send to another executor?
Answer:
As answered in first question, anything gets loaded into memory only when any action is performed. Loaded into memory does not mean it would be loaded into the driver memory. Depending upon the action data gets loaded into memory of driver or executors. If you have used collect operation everything gets loaded into the driver memory but for some other operation like count if you have cached dataframe then the data would get loaded into memory on each of the executor nodes.
Now if one of the executor crashes during the job ran, driver has the lineage graph information and the data (metadata) that the crash executor had, so it runs the same lineage graph on other executor and perform the task. This is what makes Spark resilient and fault tolerance.
Each worker will issue 1+ GET request on the ranges of the parquet file it has been given; more as it seeks around the files. The whole 10GB file is never loaded anywhere.
each worker will be doing its own read of its own split; this counts against the overall IO capacity of the store/shard.

SparkDataframe.load(),when I execute a load command where actually my data get stored?

If I am loading one table from cassandra using spark dataframe.load().Where will my data gets loaded.Is it in spark memory.Or in datanode blocks ,if I am using yarn resource manager.
It will try to store in memory per number of partitions on the Worker Nodes / which in this context is a slightly better term than Data Nodes.
It will spill to disk if not enough memory on the Worker Nodes.
Per number of Cores / Executors, processing will occur. E.g. if you have, say, 20 Executors with 1 Core each, your concurrency of processing is 20 and spilling will occur via eviction. If you run out of disk, an error will result.
Worker Nodes is a better term here compared to Data Nodes, unless you have HDFS and processing locally, then Worker Node is equal to Data Node. Although you could argue what's in a name?
Of course, an Action will need to have been initiated.
And repartition and join or union latterly in the data pipeline affect things, but that goes without saying.

Spark SQL slow execution with resource idle

I have a Spark SQL that used to execute < 10 mins now running at 3 hours after a cluster migration and need to deep dive on what it's actually doing. I'm new to spark and please don't mind if I'm asking something unrelated.
Increased spark.executor.memory but no luck.
Env: Azure HDInsight Spark 2.4 on Azure Storage
SQL: Read and Join some data and finally write result to a Hive metastore.
The spark.sql script ends with below code:
.write.mode("overwrite").saveAsTable("default.mikemiketable")
Application Behavior:
Within the first 15 mins, it loads and complete most tasks (199/200); left only 1 executor process alive and continually to shuffle read / write data. Because now it only leave 1 executor, we need to wait 3 hours until this application finish.
Left only 1 executor alive
Not sure what's the executor doing:
From time to time, we can tell the shuffle read increased:
Therefore I increased the spark.executor.memory to 20g, but nothing changed. From Ambari and YARN I can tell the cluster has many resources left.
Release of almost all executor
Any guidance is greatly appreciated.
I would like to start with some observations for your case:
From the tasks list you can see that that Shuffle Spill (Disk) and Shuffle Spill (Memory) have both very high values. The max block size for each partition during the exchange of data should not exceed 2GB therefore you should be aware to keep the size of shuffled data as low as possible. As rule of thumb you need to remember that the size of each partition should be ~200-500MB. For instance if the total data is 100GB you need at least 250-500 partitions to keep the partition size within the mentioned limits.
The co-existence of two previous it also means that the executor memory was not sufficient and Spark was forced to spill data to the disk.
The duration of the tasks is too high. A normal task should lasts between 50-200ms.
Too many killed executors is another sign which shows that you are facing OOM problems.
Locality is RACK_LOCAL which is considered one of the lowest values you can achieve within a cluster. Briefly, that means that the task is being executed in a different node than the data is stored.
As solution I would try the next few things:
Increase the number of partitions by using repartition() or via Spark settings with spark.sql.shuffle.partitions to a number that meets the requirements above i.e 1000 or more.
Change the way you store the data and introduce partitioned data i.e day/month/year using partitionBy

How will we come to know that the data are evenly distributed across cluster in Spark?

How will we come to know that the data are evenly distributed across cluster in Spark
You can check the same in Spark Web UI where you can see how many tasks are getting created and how are they executing in different nodes. You can also check whether your executors are getting skewed and taking time to write. You can also work on a real time example , take a file of 15 GB and process the file in your 4 node 16 GB 4 core machines. After reading do a re-partition of 10 and do some simple aggregation and write to some other directory. You will able to see how parallel tasks are getting create and executing in Task Nodes.

Resources