How to specify/check # of partitions on Dataproc cluster - apache-spark

If I spin up a Dataproc cluster of 1 master n1-standard-4 and 4 worker machines, also n1-standard-4, how do I tell how many partitions are created by default? If I want to make sure I have 32 partitions, what syntax do I use in my PySpark script? I am reading in a .csv file from a Google Storage bucket.
Is it simply
myRDD = sc.textFile("gs://PathToFile", 32)
How do I tell how many partitions are running (using Dataproc jobs output screen?
Thanks

To get the number of parititons in an RDD: http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.getNumPartitions
To repartition an RDD: http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.repartition

Related

Partitioning in Spark 3.1 with java

I am using spark 3.1 with java. In my code, I am writing final result dataset into GCP storage, in that it creates multiple files as my dataset is large. I am spark job in GCP dataproc cluster. It is configured to use 250 worker nodes(each has 8 vCPUs). Spark command is configured to run 2 executers per node and 3 cores for each executor. When the spark job is trigger YARN ResourceManager is showing only 25% of worker cores being used to containers per node. Also I configured, shuffle partition size as 5500(spark.sql.shuffle.partitions=5500). And I used
mydataset.coalesce(4500) to reduce the number of result files creating in Cloud stoage. But it creates 5499 files for one of the dataset which has nearly 45000 rows and 3500 files for another dataset which has nearly 85000 rows. Its really confusing on what basis its creating file partition s. Can't I control that? Is there any default value there? If yes, can i get that default value in Java code?
Thanks in Advance

Dataproc Didn't Process Big Data in Parallel Using pyspark

I launched a DataProc cluster in GCP, with one master node and 3 work nodes. Every node has 8 vCPU and 30G memory.
I developed a pyspark code, which read one csv file from GCS. The csv file is about 30G in size.
df_raw = (
spark
.read
.schema(schema)
.option('header', 'true')
.option('quote', '"')
.option('multiline', 'true')
.csv(infile)
)
df_raw = df_raw.repartition(20, "Product")
print(df_raw.rdd.getNumPartitions())
Here is how I launched the pyspark into dataproc:
gcloud dataproc jobs submit pyspark gs://<my-gcs-bucket>/<my-program>.py \
--cluster=${CLUSTER} \
--region=${REGION} \
I got the partition number of only 1.
I attached the nodes usage image here for your reference.
Seems it used only one vCore from one worker node.
How to make this in parallel with multiple partitions and using all nodes and more vCores?
Tried repartition to 20, but it still only used one vCore from one work node, as below:
Pyspark default partition is 200. So I was surprised to see dataproc didn't use all available resources for this kind of task.
This isn't a dataproc issue, but a pure Spark/pyspark one.
In order to parallelize your data it needs to split into multiple partitions - a number larger than the number of executors (total worker cores) you have. (E.g. ~ *2, ~ *3, ...)
There are various ways to do this e.g.:
Split data into files or folders and parallelize the list of files/folders and work on each one (or use a database that already does this and keeps this partitioning in Spark read).
Repartition your data after you get a Spark DF e.g. read the number of executors and multiply them by N and repartition to this many partitions. When you do this, you must chose columns which divide your data well i.e. into many parts, not into a few parts only e.g. by day, by a customer ID, not by a status ID.
df = df.repartition(num_partitions, 'partition_by_col1', 'partition_by_col2')
The code runs on the master node and the parallel stages are distributed amongst the worker nodes, e.g.
df = (
df.withColumn(...).select(...)...
.write(...)
)
Since Spark functions are lazy, they only run when you reach a step like write or collect which causes the DF to be evaluated.
You might want to try to increase the number of executors by passing Spark configuration via --properties of Dataproc command line. So something like
gcloud dataproc jobs submit pyspark gs://<my-gcs-bucket>/<my-program>.py \
--cluster=${CLUSTER} \
--region=${REGION} \
--properties=spark.executor.instances=5

Understand how Spark is transforming input file to worker nodes

I have a Spark cluster with 3 worker nodes. Take the simplified word count as sample:
val textFile = sc.textFile("hdfs://input/words")
textFile.count
This application is creating a RDD, and calculating how many lines. Due to the input file is huge, when actually performing count function, does Spark splits the input into 3 parts and separately move them to the 3 worker nodes? If so, how does Spark partition the input file (how Spark determine which line send to which worker node)?
You are trying to process file "hdfs://input/words". This file is already split as soon as you store it on HDFS(Since you have taken example of HDFS file above). If file has 3 blocks, Spark will see it as 3 partitions of file.
Spark does not need to move file to worker nodes. since file is on HDFS. it is already on machines which will be used as worker nodes by spark.
I hope this is clear.

What is the correct way to query Hive on Spark for maximum performance?

Spark newbie here.
I have a pretty large table in Hive (~130M records, 180 columns) and I'm trying to use Spark to pack it as a parquet file.
I'm using the default EMR cluster configuration, 6 * r3.xlarge instances to submit my spark application written in Python. I then run it on YARN, in a cluster mode, usually giving a small amount of memory (couple of gb) to driver, and the rest of it to executors. Here's my code to do so:
from pyspark import SparkContext
from pyspark.sql import HiveContext
sc = SparkContext(appName="ParquetTest")
hiveCtx = HiveContext(sc)
data = hiveCtx.sql("select * from my_table")
data.repartition(20).write.mode('overwrite').parquet("s3://path/to/myfile.parquet")
Later, I submit it with something similar to this:
spark-submit --master yarn --deploy-mode cluster --num-executors 5 --driver-memory 4g --driver-cores 1 --executor-memory 24g --executor-cores 2 --py-files test_pyspark.py test_pyspark.py
However, my task takes forever to complete. Spark shuts down all but one worker very quickly after the job starts, since others are not being used, and it takes a few hours before it has all the data from Hive. The Hive table itself is not partitioned or clustered yet (I also need some advices on that).
Could you help me understand what I'm doing wrong, where should I go from here and how to get the maximum performance out of resources I have?
Thank you!
I had similar use case where I used spark to write to s3 and had performance issue. Primary reason was spark was creating lot of zero byte part files and replacing temp files to actual file name was slowing down the write process. Tried below approach as work around
Write output of spark to HDFS and used Hive to write to s3. Performance was much better as hive was creating less number of part files. Problem I had is(also had same issue when using spark), delete action on Policy was not provided in prod env because of security reasons. S3 bucket was kms encrypted in my case.
Write spark output to HDFS and Copied hdfs files to local and used aws s3 copy to push data to s3. Had second best results with this approach. Created ticket with Amazon and they suggested to go with this one.
Use s3 dist cp to copy files from HDFS to S3. This was working with no issues, but not performant

How does Spark parallelize the processing of a 1TB file?

Imaginary problem
A gigantic CSV log file, let's say 1 TB in size, the file is located on a USB drive
The log contains activities logs of users around the world, let's assume that the line contains 50 columns, among those there is Country.
We want a line count per country, descending order.
Let's assume the Spark cluster has enough nodes with RAM to process the entire 1TB in memory (20 nodes, 4 cores CPU, each node has 64GB RAM)
My Poorman's conceptual solution
Using SparkSQL & Databricks spark-csv
$ ./spark-shell --packages com.databricks:spark-csv_2.10:1.4.0
val dfBigLog = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("/media/username/myUSBdrive/bogusBigLog1TB.log")
dfBigLog.select("Country")
.groupBy("Country")
.agg(count($"Country") as "CountryCount")
.orderBy($"CountryCount".desc).show
Question 1: How does Spark parallelize the processing?
I suppose the majority of the execution time (99% ?) of the above solution is to read the 1TB file from the USB drive into the Spark cluster. Reading the file from the USB drive is not parallelizable. But after reading the entire file, what does Spark do under the hood to parallelize the processing?
How many nodes used for creating the DataFrame? (maybe only one?)
How many nodes used for groupBy & count? Let's assume there are 100+ countries (but Spark doesn't know that yet). How would Spark partition to distribute the 100+ country values on 20 nodes?
Question 2: How to make the Spark application the fastest possible?
I suppose the area of improvement would be to parallelize the reading of the 1TB file.
Convert the CSV File into a Parquet file format + using Snappy compression. Let's assume this can be done in advance.
Copy the Parquet file on HDFS. Let's assume the Spark cluster is within the same Hadoop cluster and the datanodes are independant from the 20 nodes Spark cluster.
Change the Spark application to read from HDFS. I suppose Spark would now use several nodes to read the file as Parquet is splittable.
Let's assume the Parquet file compressed by Snappy is 10x smaller, size = 100GB, HDFS block = 128 MB in size. Total 782 HDFS blocks.
But then how does Spark manage to use all the 20 nodes for both creating the DataFrame and the processing (groupBy and count)? Does Spark use all the nodes each time?
Question 1: How does Spark parallelize the processing (of reading a
file from a USB drive)?
This scenario is not possible.
Spark relies on a hadoop compliant filesystem to read a file. When you mount the USB drive, you can only access it from the local host. Attempting to execute
.load("/media/username/myUSBdrive/bogusBigLog1TB.log")
will fail in cluster configuration, as executors in the cluster will not have access to that local path.
It would be possible to read the file with Spark in local mode (master=local[*]) in which case you only will have 1 host and hence the rest of the questions would not apply.
Question 2: How to make the Spark application the fastest possible?
Divide and conquer.
The strategy outlined in the question is good. Using Parquet will allow Spark to do a projection on the data and only .select("Country") column, further reducing the amount of data required to be ingested and hence speeding things up.
The cornerstone to parallelism in Spark are partitions. Again, as we are reading from a file, Spark relies on the Hadoop filesystem. When reading from HDFS, the partitioning will be dictated by the splits of the file on HDFS. Those splits will be evenly distributed among the executors. That's how Spark will initially distribute the work across all available executors for the job.
I'm not deeply familiar with the Catalist optimizations, but I think I could assume that .groupBy("Country").agg(count($"Country") will become something similar to: rdd.map(country => (country,1)).reduceByKey(_+_)
The map operation will not affect partitioning, so can be applied on site.
The reduceByKey will be applied first locally on each partition and partial results will be combined on the driver. So most counting happens distributed in the cluster, and adding it up will be centralized.
Reading the file from the USB drive is not parallelizable.
USB drive or any other data source the same rules apply. Either source is accessible from the driver and all worker machines and data is accessed in parallel (up to the source limits) or data is not accessed at all you get an exception.
How many nodes used for creating the DataFrame? (maybe only one?)
Assuming that files is accessible from all machines it depends on a configuration. For starters you should take a look at the split size.
How many nodes used for the GroupBy & Count?
Once again it depends on a configuration.

Resources