Spark - I cannot increase number of tasks in local mode - apache-spark

I tried to submit my application and change the coalese[k] in my code by different combinations:
Firstly, I read some data from my local disk:
val df = spark.read.option("encoding", "gbk").option("wholeFile",true).option("multiline",true).option("sep", "|+|").schema(schema).csv("file:///path/to/foo.txt")
Situation 1
I think local[*] means there are 56 cores in total. And I specify 4 * 4 = 16 tasks:
spark-submit:
spark-submit --master local[*] --class foo --driver-memory-8g --executor-memory 4g --executor-cores 4 --num-executors 4 foo.jar
spark.write:
df.coalesce(16).write.mode("overwrite").partitionBy("date").orc("hdfs://xxx:9000/user/hive/warehouse/ods/foo")
But when I have a look at spark history log server UI,there is only 1 task. In the data set, the 'date' column has only a single value.
So I tried another combination and removed partitionBy:
Situation 2
spark-submit:
spark-submit --master local[*] --class foo foo.jar
spark.write:
df.coalesce(16).write.mode("overwrite").orc("hdfs://xxxx:9000/user/hive/warehouse/ods/foo")
But the history server shows there is still only 1 task.
There are 56 cores and 256GB memory on my local machine.
I know in local-mode spark creates one JVM for both driver and executor, so it means we have one executor with the number of cores (let's say 56) of our computer (if we run it with Local[*]).
Here are the questions:
Could any one explain why my task number is always 1?
How can I increase the number of tasks so that I can make use of parallism?
Will my local file be read into different partitions?

Spark can read a csv file only with one executor as there is only a single file.
Compared to files which are located in a distributed files system such as HDFS where a single file can be stored in multiple partitions. That means your resulting Dataframe df has only a single partition. You can check that using df.rdd.getNumPartitions. See also my answer on How is a Spark Dataframe partitioned by default?
Note that coalesce will collapse partitions on the same worker, so calling coalesce(16) will not have any impact at all as the one partition of your Dataframe is anyway located already on a single worker.
In order to increase parallelism you may want to use repartition(16) instead.

Related

Dataproc Didn't Process Big Data in Parallel Using pyspark

I launched a DataProc cluster in GCP, with one master node and 3 work nodes. Every node has 8 vCPU and 30G memory.
I developed a pyspark code, which read one csv file from GCS. The csv file is about 30G in size.
df_raw = (
spark
.read
.schema(schema)
.option('header', 'true')
.option('quote', '"')
.option('multiline', 'true')
.csv(infile)
)
df_raw = df_raw.repartition(20, "Product")
print(df_raw.rdd.getNumPartitions())
Here is how I launched the pyspark into dataproc:
gcloud dataproc jobs submit pyspark gs://<my-gcs-bucket>/<my-program>.py \
--cluster=${CLUSTER} \
--region=${REGION} \
I got the partition number of only 1.
I attached the nodes usage image here for your reference.
Seems it used only one vCore from one worker node.
How to make this in parallel with multiple partitions and using all nodes and more vCores?
Tried repartition to 20, but it still only used one vCore from one work node, as below:
Pyspark default partition is 200. So I was surprised to see dataproc didn't use all available resources for this kind of task.
This isn't a dataproc issue, but a pure Spark/pyspark one.
In order to parallelize your data it needs to split into multiple partitions - a number larger than the number of executors (total worker cores) you have. (E.g. ~ *2, ~ *3, ...)
There are various ways to do this e.g.:
Split data into files or folders and parallelize the list of files/folders and work on each one (or use a database that already does this and keeps this partitioning in Spark read).
Repartition your data after you get a Spark DF e.g. read the number of executors and multiply them by N and repartition to this many partitions. When you do this, you must chose columns which divide your data well i.e. into many parts, not into a few parts only e.g. by day, by a customer ID, not by a status ID.
df = df.repartition(num_partitions, 'partition_by_col1', 'partition_by_col2')
The code runs on the master node and the parallel stages are distributed amongst the worker nodes, e.g.
df = (
df.withColumn(...).select(...)...
.write(...)
)
Since Spark functions are lazy, they only run when you reach a step like write or collect which causes the DF to be evaluated.
You might want to try to increase the number of executors by passing Spark configuration via --properties of Dataproc command line. So something like
gcloud dataproc jobs submit pyspark gs://<my-gcs-bucket>/<my-program>.py \
--cluster=${CLUSTER} \
--region=${REGION} \
--properties=spark.executor.instances=5

What happens when we allocate more executors to a spark job than numbers of partitions of a kafka topic

Suppose I am running spark batch job and I am setting
--num-executors 40
The job reads a kafka topic with 20 partitions.
The job writes to a kafka topic with 20 partitions.
My question is :
How many executors will be used by the spark job
a. While reading from kafka
b. While writing to kafka
What changes when I set below parameter while running the same job with 40 executors
--conf spark.dynamicAllocation.enabled=false
First of all to answer the question directly spark will use 20 executors only(as the input kafka partitions), remaining executors will be allocated any task.
Also the executors usage will be depends on the transformations and actions that you are going to perform with the data. For example
If you applied foreach function then , partition count will be same and executors will be the same.
If you applied map and re partitioned then based on the new partition executors will be invoked.
The Best practice is to maintain 2 to 3 times the partitions that the default partitions.
So once you have RDD , use the sparkcontext.defaultParalleism() to get default partitions after that re partition RDD to 2 to 3 times of it.
should be like this
newRDD= RDD.repartition(2*sparkcontext.defaultParalleism());
If spark.dynamicAllocation.enabled=false , then spark can't allocate the required executors based on the load.
Always use spark.dynamicAllocation.enabled=true and re partition RDD to 2 to 3 times of default size.

Need help in understanding pyspark execution on yarn as Master

I have already some picture of yarn architecture as well as spark architecture.But when I try to understand them together(thats what happens
when apark job runs on YARN as master) on a Hadoop cluster, I am getting in to some confusions.So first I will say my understanding with below example and then I will
come to my confusions
Say I have a file "orderitems" stored on HDFS with some replication factor.
Now I am processing the data by reading this file in to a spark RDD (say , for calculating order revenue).
I have written the code and configured the spark submit as given below
spark-submit \
--master yarn \
--conf spark.ui.port=21888 \
--num-executors 2 \
--executor-memory 512M \
src/main/python/order_revenue.py
Lets assume that I have created the RDD with a partition of 5 and I have executed in yarn-client mode.
Now As per my understanding , once I submit the spark job on YARN,
Request goes to Application manager which is a component of resource
manager.
Application Manager will find one node manager and ask it to launch a
container.
This is the first container of an application and we will call it an
Application Master.
Application master takes over the responsibility of executing and monitoring
the job.
Since I have submitted on client mode,driver program will run on my edge Node/Gateway Node.
I have provided num-executors as 2 and executor memory as 512 mb
Also I have provided no.of partitions for RDD as 5 which means , it will create 5 partitions of data read
and distribute over 5 nodes.
Now here my few confusions over this
I have read in user guide that, partitions of rdd will be distributed to different nodes. Does these nodes are same as the
'Data Nodes' of HDFS cluster? I mean here its 5 partitions, does
this mean its in 5 data nodes?
I have mentioned num-executors as 2.So this 5 partitions of data will utilizes 2 executors(CPU).So my nextquestion is , from where
this 2 executors (CPU) will be picked? I mean 5 partitions are in 5 nodes
right , so does these 2 executors are also in any of these nodes?
The scheduler is responsible for allocating resources to the various running applications subject to constraints of capacities,
queues etc. And also a Container is a Linux Control Group which is
a linux kernel feature that allows users to allocate
CPU,memory,Disk I/O and Bandwidth to a user process. So my final
question is Containers are actually provided by "scheduler"?
I am confused here. I have referred architecture, release document and some videos and got messed up.
Expecting helping hands here.
To answer your questions first:
1) Very simply, Executor is spark's worker node and driver is manager node and have nothing to do with hadoop nodes. Assume executors to be processing units (say 2 here) and repartition(5) divides data in 5 chunks to be by these 2 executors and on some basis these data chunks will be divided amongst 2 executors. Repartition data does not create nodes
Spark cluster architecture:
Spark on yarn client mode:
Spark on yarn cluster mode:
For other details you can read the blog post https://sujithjay.com/2018/07/24/Understanding-Apache-Spark-on-YARN/
and https://0x0fff.com/spark-architecture/

Spark job fails when cluster size is large, succeeds when small

I have a spark job which takes in three inputs and does two outer joins. The data is in key-value format (String, Array[String]). Most important part of the code is:
val partitioner = new HashPartitioner(8000)
val joined = inputRdd1.fullOuterJoin(inputRdd2.fullOuterJoin(inputRdd3, partitioner), partitioner).cache
saveAsSequenceFile(joined, filter="X")
saveAsSequenceFile(joined, filter="Y")
I'm running the job on EMR with r3.4xlarge driver node and 500 m3.xlarge worker nodes. The spark-submit parameters are:
spark-submit --deploy-mode client --master yarn-client --executor-memory 3g --driver-memory 100g --executor-cores 3 --num-executors 4000 --conf spark.default.parallelism=8000 --conf spark.storage.memoryFraction=0.1 --conf spark.shuffle.memoryFraction=0.2 --conf spark.yarn.executor.memoryOverhead=4000 --conf spark.network.timeout=600s
UPDATE: with this setting, number of executors seen in spark jobs UI were 500 (one per node)
The exception I see in the driver log is the following:
17/10/13 21:37:57 WARN HeartbeatReceiver: Removing executor 470 with no recent heartbeats: 616136 ms exceeds timeout 600000 ms
17/10/13 21:39:04 ERROR ContextCleaner: Error cleaning broadcast 5
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [600 seconds]. This timeout is controlled by spark.network.timeout at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcEnv.scala:214)
...
Some of the things I tried that failed:
I thought the problem would be because of there are too many executors being spawned and driver has an overhead of tracking these executors. I tried reducing the number of executors by increasing the executor-memory to 4g. This did not help.
I tried changing the instance type of driver to r3.8xlarge, this did not help either.
Surprisingly, when I reduce the number of worker nodes to 300, the job runs file. Does any one have any other hypothesis on why this would happen?
Well this is a little bit a problem to understand how is the allocation of Spark works.
According to your information, you have 500 nodes with 4 cores each. So, you have 4000 cores. What you are doing with your request is creating 4000 executors with 3 cores each. It means that you are requesting 12000 cores for your cluster and there is no thing like that.
This error of RPC timeout is regularly associated with how many jvms you started in the same machine, and that machine is not able to respond in proper time due to much thing happens at the same time.
You need to know that, --num-executors is better been associated to you nodes, and the number of cores should be associated to the cores you have in each node.
For example, the configuration of m3.xLarge is 4 cores with 15 Gb of RAM. What is the best configuration to run a job there? That depends what you are planning to do. See if you are going to run just one job I suggest you to set up like this:
spark-submit --deploy-mode client --master yarn-client --executor-memory 10g --executor-cores 4 --num-executors 500 --conf spark.default.parallelism=2000 --conf spark.yarn.executor.memoryOverhead=4000
This will allow you job to run fine, if you don't have problem to fit your data to your worker is better change the default.parallelism to 2000 or you are going to lost lot of time with shuffle.
But, the best approach I think that you can do is keeping the dynamic allocation that EMR enables it by default, just set the number of cores and the parallelism and the memory and you job will run like a charm.
I experimented with lot of configurations modifying one parameter at a time with 500 nodes. I finally got the job to work by lowering the number of partitions in the HashPartitioner from 8000 to 3000.
val partitioner = new HashPartitioner(3000)
So probably the driver is overwhelmed with a the large number of shuffles that has to be done when there are more partitions and hence the lower partition helps.

How to write Huge Data ( almost 800 GB) as a hive orc table in HDFS using SPARK?

I am working in Spark Project since last 3-4 months and recently.
I am doing some calculation with a huge history file (800 GB) and a small incremental file (3 GB).
The calculation is happening very fast in spark using hqlContext & dataframe, but when I am trying to write the calculated result as a hive table with orc format which will contain almost 20 billion of records with a data size of almost 800 GB is taking too much time (more than 2 hours and finally getting failed).
My cluster details are: 19 nodes , 1.41 TB of Total Memory, Total VCores are 361.
For tuneup I am using
--num-executors 67
--executor-cores 6
--executor-memory 60g
--driver-memory 50g
--driver-cores 6
--master yarn-cluster
--total-executor-cores 100
--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC"
at run time.
If I take a count of result, then it is completing within 15 minutes, but if I want to write that result in HDFS as hive table.
[ UPDATED_RECORDS.write.format("orc").saveAsTable("HIST_ORC_TARGET") ]
then I am facing the above issue.
Please provide me with a suggestion or anything regarding this as I am stuck in this case since last couple of days.
Code format:
val BASE_RDD_HIST = hqlContext.sql("select * from hist_orc")
val BASE_RDD_INCR = hqlContext.sql("select * from incr_orc")
some spark calculation using dataframe, hive query & udf.....
Finally:
result.write.format("orc").saveAsTable("HIST_ORC_TARGET_TABLE")
Hello friends I have found the answer of my own question few days back so here
I am writing that.
Whenever we execute any spark program we do not specify the queue parameter and some time the default queue has some limitations which does not allow you to execute as many executors or tasks that you want so it might cause a slow processing and later on a cause of job failure for memory issue as you are running less executors/tasks. So don't forget to mention a queue name at in your execution command:
spark-submit --class com.xx.yy.FactTable_Merging.ScalaHiveHql
--num-executors 25
--executor-cores 5
--executor-memory 20g
--driver-memory 10g
--driver-cores 5
--master yarn-cluster
--name "FactTable HIST & INCR Re Write After Null Merging Seperately"
--queue "your_queue_name"
/tmp/ScalaHiveProgram.jar
/user/poc_user/FactTable_INCR_MERGED_10_PARTITION
/user/poc_user/FactTable_HIST_MERGED_50_PARTITION

Resources