Dataproc Didn't Process Big Data in Parallel Using pyspark - apache-spark

I launched a DataProc cluster in GCP, with one master node and 3 work nodes. Every node has 8 vCPU and 30G memory.
I developed a pyspark code, which read one csv file from GCS. The csv file is about 30G in size.
df_raw = (
spark
.read
.schema(schema)
.option('header', 'true')
.option('quote', '"')
.option('multiline', 'true')
.csv(infile)
)
df_raw = df_raw.repartition(20, "Product")
print(df_raw.rdd.getNumPartitions())
Here is how I launched the pyspark into dataproc:
gcloud dataproc jobs submit pyspark gs://<my-gcs-bucket>/<my-program>.py \
--cluster=${CLUSTER} \
--region=${REGION} \
I got the partition number of only 1.
I attached the nodes usage image here for your reference.
Seems it used only one vCore from one worker node.
How to make this in parallel with multiple partitions and using all nodes and more vCores?
Tried repartition to 20, but it still only used one vCore from one work node, as below:
Pyspark default partition is 200. So I was surprised to see dataproc didn't use all available resources for this kind of task.

This isn't a dataproc issue, but a pure Spark/pyspark one.
In order to parallelize your data it needs to split into multiple partitions - a number larger than the number of executors (total worker cores) you have. (E.g. ~ *2, ~ *3, ...)
There are various ways to do this e.g.:
Split data into files or folders and parallelize the list of files/folders and work on each one (or use a database that already does this and keeps this partitioning in Spark read).
Repartition your data after you get a Spark DF e.g. read the number of executors and multiply them by N and repartition to this many partitions. When you do this, you must chose columns which divide your data well i.e. into many parts, not into a few parts only e.g. by day, by a customer ID, not by a status ID.
df = df.repartition(num_partitions, 'partition_by_col1', 'partition_by_col2')
The code runs on the master node and the parallel stages are distributed amongst the worker nodes, e.g.
df = (
df.withColumn(...).select(...)...
.write(...)
)
Since Spark functions are lazy, they only run when you reach a step like write or collect which causes the DF to be evaluated.

You might want to try to increase the number of executors by passing Spark configuration via --properties of Dataproc command line. So something like
gcloud dataproc jobs submit pyspark gs://<my-gcs-bucket>/<my-program>.py \
--cluster=${CLUSTER} \
--region=${REGION} \
--properties=spark.executor.instances=5

Related

Spark - I cannot increase number of tasks in local mode

I tried to submit my application and change the coalese[k] in my code by different combinations:
Firstly, I read some data from my local disk:
val df = spark.read.option("encoding", "gbk").option("wholeFile",true).option("multiline",true).option("sep", "|+|").schema(schema).csv("file:///path/to/foo.txt")
Situation 1
I think local[*] means there are 56 cores in total. And I specify 4 * 4 = 16 tasks:
spark-submit:
spark-submit --master local[*] --class foo --driver-memory-8g --executor-memory 4g --executor-cores 4 --num-executors 4 foo.jar
spark.write:
df.coalesce(16).write.mode("overwrite").partitionBy("date").orc("hdfs://xxx:9000/user/hive/warehouse/ods/foo")
But when I have a look at spark history log server UI,there is only 1 task. In the data set, the 'date' column has only a single value.
So I tried another combination and removed partitionBy:
Situation 2
spark-submit:
spark-submit --master local[*] --class foo foo.jar
spark.write:
df.coalesce(16).write.mode("overwrite").orc("hdfs://xxxx:9000/user/hive/warehouse/ods/foo")
But the history server shows there is still only 1 task.
There are 56 cores and 256GB memory on my local machine.
I know in local-mode spark creates one JVM for both driver and executor, so it means we have one executor with the number of cores (let's say 56) of our computer (if we run it with Local[*]).
Here are the questions:
Could any one explain why my task number is always 1?
How can I increase the number of tasks so that I can make use of parallism?
Will my local file be read into different partitions?
Spark can read a csv file only with one executor as there is only a single file.
Compared to files which are located in a distributed files system such as HDFS where a single file can be stored in multiple partitions. That means your resulting Dataframe df has only a single partition. You can check that using df.rdd.getNumPartitions. See also my answer on How is a Spark Dataframe partitioned by default?
Note that coalesce will collapse partitions on the same worker, so calling coalesce(16) will not have any impact at all as the one partition of your Dataframe is anyway located already on a single worker.
In order to increase parallelism you may want to use repartition(16) instead.

Does the Spark Shell JDBC read numPartitions value depend on the number of executors?

I have Spark set up in standalone mode on a single node with 2 cores and 16GB of RAM to make some rough POCs.
I want to load data from a SQL source using val df = spark.read.format('jdbc')...option('numPartitions',n).load(). When I tried to measure the time taken to read a table for different numPartitions values by calling a df.rdd.count, I saw the the time was the same regardless of the value I gave. I also noticed one the context web UI that the number of Active executors was 1, even though I gave SPARK_WORKER_INSTANCES=2 and SPARK_WORKER_CORES=1in my spark_env.sh file.
I have 2 questions:
Do the numPartitions actually created depend on the number of executors?
How do I start spark-shell with multiple executors in my current setup?
Thanks!
Number of partitions doesn't depend on your number of executors - althaugh there is best practice (partitions per cores), but it doesn't determined by the executors instances.
In case of reading from JDBC, to make it parallelize reading you need a partition column, e.g:
spark.read("jdbc")
.option("url", url)
.option("dbtable", "table")
.option("user", user)
.option("password", password)
.option("numPartitions", numPartitions)
.option("partitionColumn", "<partition_column>")
.option("lowerBound", 1)
.option("upperBound", 10000)
.load()
That will parallel the queries from the databases to 10,000/numPartitions results of each query.
About your second question, you can find all over spark configuration over here: https://spark.apache.org/docs/latest/configuration.html , (spark2-shell --num-executors, or the configuration --conf spark.executor.instances).
Specifing the number of the executors meaning dynamic allocation will be off so be aware of that.

What happens when we allocate more executors to a spark job than numbers of partitions of a kafka topic

Suppose I am running spark batch job and I am setting
--num-executors 40
The job reads a kafka topic with 20 partitions.
The job writes to a kafka topic with 20 partitions.
My question is :
How many executors will be used by the spark job
a. While reading from kafka
b. While writing to kafka
What changes when I set below parameter while running the same job with 40 executors
--conf spark.dynamicAllocation.enabled=false
First of all to answer the question directly spark will use 20 executors only(as the input kafka partitions), remaining executors will be allocated any task.
Also the executors usage will be depends on the transformations and actions that you are going to perform with the data. For example
If you applied foreach function then , partition count will be same and executors will be the same.
If you applied map and re partitioned then based on the new partition executors will be invoked.
The Best practice is to maintain 2 to 3 times the partitions that the default partitions.
So once you have RDD , use the sparkcontext.defaultParalleism() to get default partitions after that re partition RDD to 2 to 3 times of it.
should be like this
newRDD= RDD.repartition(2*sparkcontext.defaultParalleism());
If spark.dynamicAllocation.enabled=false , then spark can't allocate the required executors based on the load.
Always use spark.dynamicAllocation.enabled=true and re partition RDD to 2 to 3 times of default size.

How to process Kafka partitions separately and in parallel with Spark executors?

I use Spark 2.1.1.
I read messages from 2 Kafka partitions using Structured Streaming. I am submitting my application to Spark Standalone cluster with one worker and 2 executors (2 cores each).
./bin/spark-submit \
--class MyClass \
--master spark://HOST:IP \
--deploy-mode cluster \
/home/ApplicationSpark.jar
I want the functionality such that, the messages from each Kafka partition should be processed by each separate executor independently. But now what is happening is, executors read and .map the partition data separately, but after mapping the unbounded tables which is formed is used commonly and having data from both the partitions.
When I ran the structured query on table, the query has to deal with data from both the partitions (more amount of data).
select product_id, max(smr.order_time), max(product_price) , min(product_price)
from OrderRecords
group by WINDOW(order_time, "120 seconds"), product_id
where Kafka partition is on Product_id
Is there any way to run the same structured query parallel but separately on the data, from the Kafka partition to which the executor is mapped?
But now what is happening is, executors read and .map the partition data separately, but after mapping the unbounded tables which is formed is used commonly and having data from both the partitions. Hence when I ran the structured query on table, the query has to deal with data from both the partitions (more amount of data).
That's the key to understand what and how can be executed without causing shuffle and sending data across partitions (possibly even over the wire).
The definitive answer depends on what your queries are. If they work on groups of records where the groups are spread across multiple topic partitions and hence on two different Spark executors, you'd have to be extra careful with your algorithm/transformation to do the processing on separate partitions (using only what's available in partitions) and aggregating the results only.

How to specify/check # of partitions on Dataproc cluster

If I spin up a Dataproc cluster of 1 master n1-standard-4 and 4 worker machines, also n1-standard-4, how do I tell how many partitions are created by default? If I want to make sure I have 32 partitions, what syntax do I use in my PySpark script? I am reading in a .csv file from a Google Storage bucket.
Is it simply
myRDD = sc.textFile("gs://PathToFile", 32)
How do I tell how many partitions are running (using Dataproc jobs output screen?
Thanks
To get the number of parititons in an RDD: http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.getNumPartitions
To repartition an RDD: http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.repartition

Resources