Spark 2.0.0 Partition size for a parquet - apache-spark

I am trying to understand how I could improve (or increase) the parallelism of tasks that run for a particular spark job.
Here is my observation...
scala> spark.read.parquet("hdfs://somefile").toJavaRDD.partitions.size()
25
$ hadoop fs -ls hdfs://somefile | grep 'part-r' | wc -l
200
$ hadoop fs -du -h -s hdfs://somefile
2.2 G
I notice that, depending on what the repartition / coalesce the number of part files to HDFS is created appropriately during the save operation. Meaning the number of part files can be tweaked according to this parameter.
But, how do I control the read's 'partitions.size()'? Meaning, I want to have this to be 200 (without having to repartition it during the read operation so that I would be able have more number of tasks run for this job)
This has a major impact in-terms of the time it takes to perform query operations on this job.
On a side note, I do understand that 200 parquet part files for the above 2.2 G seems over-kill for a 128 MB block size. Ideally it should be 18 parts or so.
Please advice.

Related

How to maintain the partition in spark?

I have a parquet folder partitioned by sensor_name and each sensors has same count of readings. When I read it using select, my dataframe looks like below,
sensor_name | reading
---------------|---------------
a | 0.0
b | 2.0
c | 1.0
a | 0.0
b | 0.0
c | 1.0
...
I want to do some transformation for each sensor (say multiply by 10) and then store it as a parquet folder with the same partitioning (i.e) partition by sensor_name.
When I run below, I realized spark does its own partitioning
df.write.format("parquet").mode("overwrite").save("path")
So, I changed like below to do partitioning and it was tremendously slow,
df.write.format("parquet").partitionBy("sensor_name").mode("overwrite").save("path")
Then I tried to repartition and it was better than before but still slow,
df.repartition("sensor_name").write.format("parquet").partitionBy("sensor_name").mode("overwrite").save("path")
Is there a way to tell Spark not to repartition it and honor my partition while doing select?
Is there a way to tell Spark not to repartition it and honor my partition while doing select?
There is none. If you need to have physical partition on the disk, you need to use partitionBy, unless you want read the individual partition data, enrich it and write it to that directory. You will need to do the combination of python code (I would do that in scala though) + pyspark api.
The code you are using the most efficient one and spark would optimize it. You may be seeing performance bottleneck either if you are running on standalone mode or having join operation which involves a shuffle

hadoop fs -du output does not reflect replication factor

As discussed in several other questions (here and here), the hadoop fs -du -s -h command (or equivalently hdfs dfs -s -h) shows two values:
The pure file size
The file size taking into account replication
e.g.
19.9 M 59.6 M /path/folder/test.avro
So normally we'd expect the second number to be 3x the first number, on our cluster with replication factor 3.
But when checking up on a running Spark job recently, the first number was 246.9 K, and the second was 3.4 G - approximately 14,000 times larger!
Does this indicate a problem? Why isn't the replicated size 3x the raw size?
Is this because one of the values takes into account block size, and the other doesn't, perhaps?
The Hadoop documentation on this command isn't terribly helpful, stating only:
The du returns three columns with the following format
size disk_space_consumed_with_all_replicas full_path_name

load parquet file and keep same number hdfs partitions

I have a parquet file /df saved in hdfs with 120 partitions. The size of each partition on hdfs is around 43.5 M.
Total size
hdfs dfs -du -s -h /df
5.1 G 15.3 G /df
hdfs dfs -du -h /df
43.6 M 130.7 M /df/pid=0
43.5 M 130.5 M /df/pid=1
...
43.6 M 130.9 M /df/pid=119
I want to load that file into Spark and keep the same number of partitions.
However, Spark will automatically load the file into 60 partitions.
df = spark.read.parquet('df')
df.rdd.getNumPartitions()
60
HDFS settings:
'parquet.block.size' is not set.
sc._jsc.hadoopConfiguration().get('parquet.block.size')
returns nothing.
'dfs.blocksize' is set to 128.
float(sc._jsc.hadoopConfiguration().get("dfs.blocksize"))/2**20
returns
128
Changing either of those values to something lower does not result in the parquet file loading into the same number of partitions that are in hdfs.
For example:
sc._jsc.hadoopConfiguration().setInt("parquet.block.size", 64*2**20)
sc._jsc.hadoopConfiguration().setInt("dfs.blocksize", 64*2**20)
I realize 43.5 M is well below 128 M. However, for this application, I am going to immediately complete many transformations that will result in each of the 120 partitions getting much closer to 128 M.
I am trying to save myself having to repartition in the application imeadiately after loading.
Is there a way to force Spark to load the parquet file with the same number of partitions that are stored on the hdfs?
First, I'd start from checking on how Spark splits the data into partitions.
By default it depends on the nature and size of your data & cluster.
This article should provide you with the answer why your data frame was loaded to 60 partitions:
https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/content/sparksqlshufflepartitions_draft.html
In general - its Catalyst who takes care of all the optimization (including number of partitions), so unless there is really a good reason for custom settings, I'd let it do its job. If any of the transformations you use are wide, Spark will shuffle the data anyway.
I can use the spark.sql.files.maxPartitionBytes property to keep the partition sizes where I want when importing.
The Other Configuration Options documentation for the spark.sql.files.maxPartitionBytes property states:
The maximum number of bytes to pack into a single partition when reading files. This configuration is effective only when using file-based sources such as Parquet, JSON and ORC.
Example (where spark is a working SparkSession):
spark.conf.set("spark.sql.files.maxPartitionBytes", 67108864) ## 64Mbi
To control the number of partitions during transformations, I can set spark.sql.shuffle.partitions, for which the documentation states:
Configures the number of partitions to use when shuffling data for joins or aggregations.
Example (where spark is a working SparkSession):
spark.conf.set("spark.sql.shuffle.partitions", 500)
Additionally, I can set spark.default.parallelism, for which the Execution Behavior documentation states:
Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.
Example (where spark is a working SparkSession):
spark.conf.set("spark.default.parallelism", 500)

How to optimize Hadoop MapReduce compressing Spark output in Google Datproc?

The goal: Millions of rows in Cassandra need to be extracted and compressed into a single file as quickly and efficiently as possible (on a daily basis).
The current setup uses a Google Dataproc cluster to run a Spark job that extracts the data directly into a Google Cloud Storage bucket. I've tried two approaches:
Using the (now deprecated) FileUtil.copyMerge() to combine the roughly 9000 Spark partition files into a single uncompressed file, then submitting a Hadoop MapReduce job to compress that single file.
Leaving the roughly 9000 Spark partition files as the raw output, and submitting a Hadoop MapReduce job to merge and compress those files into a single file.
Some job details:
About 800 Million rows.
About 9000 Spark partition files outputted by the Spark job.
Spark job takes about an hour to complete running on a 1 Master, 4 Worker (4vCPU, 15GB each) Dataproc cluster.
Default Dataproc Hadoop block size, which is, I think 128MB.
Some Spark configuration details:
spark.task.maxFailures=10
spark.executor.cores=4
spark.cassandra.input.consistency.level=LOCAL_ONE
spark.cassandra.input.reads_per_sec=100
spark.cassandra.input.fetch.size_in_rows=1000
spark.cassandra.input.split.size_in_mb=64
The Hadoop job:
hadoop jar file://usr/lib/hadoop-mapreduce/hadoop-streaming-2.8.4.jar
-Dmapred.reduce.tasks=1
-Dmapred.output.compress=true
-Dmapred.compress.map.output=true
-Dstream.map.output.field.separator=,
-Dmapred.textoutputformat.separator=,
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
-input gs://bucket/with/either/single/uncompressed/csv/or/many/spark/partition/file/csvs
-output gs://output/bucket
-mapper /bin/cat
-reducer /bin/cat
-inputformat org.apache.hadoop.mapred.TextInputFormat
-outputformat org.apache.hadoop.mapred.TextOutputFormat
The Spark job took about 1 hour to extract Cassandra data to GCS bucket. Using the FileUtil.copyMerge() added about 45 minutes to that, was performed by the Dataproc cluster but underutilized resources as it ones seems to use 1 node. The Hadoop job to compress that single file took an additional 50 minutes. This is not an optimal approach, as the cluster has to stay up longer even though it is not using its full resources.
The info output from that job:
INFO mapreduce.Job: Counters: 55
File System Counters
FILE: Number of bytes read=5072098452
FILE: Number of bytes written=7896333915
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
GS: Number of bytes read=47132294405
GS: Number of bytes written=2641672054
GS: Number of read operations=0
GS: Number of large read operations=0
GS: Number of write operations=0
HDFS: Number of bytes read=57024
HDFS: Number of bytes written=0
HDFS: Number of read operations=352
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Killed map tasks=1
Launched map tasks=353
Launched reduce tasks=1
Rack-local map tasks=353
Total time spent by all maps in occupied slots (ms)=18495825
Total time spent by all reduces in occupied slots (ms)=7412208
Total time spent by all map tasks (ms)=6165275
Total time spent by all reduce tasks (ms)=2470736
Total vcore-milliseconds taken by all map tasks=6165275
Total vcore-milliseconds taken by all reduce tasks=2470736
Total megabyte-milliseconds taken by all map tasks=18939724800
Total megabyte-milliseconds taken by all reduce tasks=7590100992
Map-Reduce Framework
Map input records=775533855
Map output records=775533855
Map output bytes=47130856709
Map output materialized bytes=2765069653
Input split bytes=57024
Combine input records=0
Combine output records=0
Reduce input groups=2539721
Reduce shuffle bytes=2765069653
Reduce input records=775533855
Reduce output records=775533855
Spilled Records=2204752220
Shuffled Maps =352
Failed Shuffles=0
Merged Map outputs=352
GC time elapsed (ms)=87201
CPU time spent (ms)=7599340
Physical memory (bytes) snapshot=204676702208
Virtual memory (bytes) snapshot=1552881852416
Total committed heap usage (bytes)=193017675776
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=47132294405
File Output Format Counters
Bytes Written=2641672054
I expected this to perform as well as or better than the other approach, but it performed much worse. The Spark job remained unchanged. Skipping the FileUtil.copyMerge() and jumping straight into the Hadoop MapReduce job... the map portion of the job was only at about 50% after an hour and a half. Job was cancelled at that point, as it was clear it was not going to be viable.
I have complete control over the Spark job and the Hadoop job. I know we could create a bigger cluster, but I'd rather do that only after making sure the job itself is optimized. Any help is appreciated. Thanks.
Can you provide some more details of your Spark job? What API of Spark are you using - RDD or Dataframe?
Why not perform merge phase completely in Spark (with repartition().write()) and avoid chaining of Spark and MR jobs?

PySpark Number of Output Files

I am a Spark Newbie. I have a simple pyspark script. It reads a json file, flattens it and writes it to S3 location as parquet compressed file.
The read and transformation steps run very fast and uses 50 executors (which I set in the conf). But the write stage takes a long time and writes only one large file (480MB).
How is the number of files saved decided?
Can the write operation be sped up somehow?
Thanks,
Ram.
The number of files output is equal to the the number of partitions of the RDD being saved. In this sample, the RDD is repartitioned to control the number of output files.
Try:
repartition(numPartitions) - Reshuffle the data in the RDD randomly
to create either more or fewer partitions and balance it across them.
This always shuffles all data over the network.
>>> dataRDD.repartition(2).saveAsTextFile("/user/cloudera/sqoop_import/orders_test")
The number of files output is the same as the number of partitionds of the RDD.
$ hadoop fs -ls /user/cloudera/sqoop_import/orders_test
Found 3 items
-rw-r--r-- 1 cloudera cloudera 0 2016-12-28 12:52 /user/cloudera/sqoop_import/orders_test/_SUCCESS
-rw-r--r-- 1 cloudera cloudera 1499519 2016-12-28 12:52 /user/cloudera/sqoop_import/orders_test/part-00000
-rw-r--r-- 1 cloudera cloudera 1500425 2016-12-28 12:52 /user/cloudera/sqoop_import/orders_test/part-00001
Also check this: coalesce(numPartitions)
source-1 | source-2
Update:
The textFile method also takes an optional second argument for
controlling the number of partitions of the file. By default, Spark
creates one partition for each block of the file (blocks being 64MB by
default in HDFS), but you can also ask for a higher number of
partitions by passing a larger value. Note that you cannot have fewer
partitions than blocks.
... but this is minimum number of possible partitions so they are not guaranteed.
so if you want to partition on read, you should use this....
dataRDD=sc.textFile("/user/cloudera/sqoop_import/orders").repartition(2)
There are 2 different things to consider:-
HDFS Block size:- The block size of HDFS is configurable in HDFS-site.xml (128 Mb by default). If a file is having a size more than the block size, a new block will be assigned in the memory to the rest of the file data. But, that is not something you can see. It is done internally. The whole process is sequential.
Partitions:- When Spark comes into picture, so does parallelism. Ideally, if you do not manually provide the number of partitions, it would be equal to the block size in the default configuration. On the other hand, if you want to customize the number of partitioned files, you could go ahead and use the API , where n being the number of partition.
These partitions are visible to you in the HDFS when you browse it.
Also, To increase the performance, you could give some specifications such as num executors, executor memory, cores per executor, etc. while spark-submit / pyspark /spark-shell. The performance while writing any file depends on the format and compression codec used for the same wildly.
Thanks for reading.

Resources