hadoop fs -du output does not reflect replication factor - apache-spark

As discussed in several other questions (here and here), the hadoop fs -du -s -h command (or equivalently hdfs dfs -s -h) shows two values:
The pure file size
The file size taking into account replication
e.g.
19.9 M 59.6 M /path/folder/test.avro
So normally we'd expect the second number to be 3x the first number, on our cluster with replication factor 3.
But when checking up on a running Spark job recently, the first number was 246.9 K, and the second was 3.4 G - approximately 14,000 times larger!
Does this indicate a problem? Why isn't the replicated size 3x the raw size?
Is this because one of the values takes into account block size, and the other doesn't, perhaps?
The Hadoop documentation on this command isn't terribly helpful, stating only:
The du returns three columns with the following format
size disk_space_consumed_with_all_replicas full_path_name

Related

load parquet file and keep same number hdfs partitions

I have a parquet file /df saved in hdfs with 120 partitions. The size of each partition on hdfs is around 43.5 M.
Total size
hdfs dfs -du -s -h /df
5.1 G 15.3 G /df
hdfs dfs -du -h /df
43.6 M 130.7 M /df/pid=0
43.5 M 130.5 M /df/pid=1
...
43.6 M 130.9 M /df/pid=119
I want to load that file into Spark and keep the same number of partitions.
However, Spark will automatically load the file into 60 partitions.
df = spark.read.parquet('df')
df.rdd.getNumPartitions()
60
HDFS settings:
'parquet.block.size' is not set.
sc._jsc.hadoopConfiguration().get('parquet.block.size')
returns nothing.
'dfs.blocksize' is set to 128.
float(sc._jsc.hadoopConfiguration().get("dfs.blocksize"))/2**20
returns
128
Changing either of those values to something lower does not result in the parquet file loading into the same number of partitions that are in hdfs.
For example:
sc._jsc.hadoopConfiguration().setInt("parquet.block.size", 64*2**20)
sc._jsc.hadoopConfiguration().setInt("dfs.blocksize", 64*2**20)
I realize 43.5 M is well below 128 M. However, for this application, I am going to immediately complete many transformations that will result in each of the 120 partitions getting much closer to 128 M.
I am trying to save myself having to repartition in the application imeadiately after loading.
Is there a way to force Spark to load the parquet file with the same number of partitions that are stored on the hdfs?
First, I'd start from checking on how Spark splits the data into partitions.
By default it depends on the nature and size of your data & cluster.
This article should provide you with the answer why your data frame was loaded to 60 partitions:
https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/content/sparksqlshufflepartitions_draft.html
In general - its Catalyst who takes care of all the optimization (including number of partitions), so unless there is really a good reason for custom settings, I'd let it do its job. If any of the transformations you use are wide, Spark will shuffle the data anyway.
I can use the spark.sql.files.maxPartitionBytes property to keep the partition sizes where I want when importing.
The Other Configuration Options documentation for the spark.sql.files.maxPartitionBytes property states:
The maximum number of bytes to pack into a single partition when reading files. This configuration is effective only when using file-based sources such as Parquet, JSON and ORC.
Example (where spark is a working SparkSession):
spark.conf.set("spark.sql.files.maxPartitionBytes", 67108864) ## 64Mbi
To control the number of partitions during transformations, I can set spark.sql.shuffle.partitions, for which the documentation states:
Configures the number of partitions to use when shuffling data for joins or aggregations.
Example (where spark is a working SparkSession):
spark.conf.set("spark.sql.shuffle.partitions", 500)
Additionally, I can set spark.default.parallelism, for which the Execution Behavior documentation states:
Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.
Example (where spark is a working SparkSession):
spark.conf.set("spark.default.parallelism", 500)

Spark Partitions: Loading a file from the local file system on a Single Node Cluster

I am interested in finding out how Spark creates partitions when loading a file from the local file system.
I am using the Databricks Community Edition to learn Spark. While I load a file that is just a few kilobytes in size (about 300 kb) using the sc.textfile command, spark, by default creates 2 partitions (as given by partitions.length). When I load a file that is about 500 MB, it creates 8 partitions (which is equal to the number of cores in the machine).
enter image description here
What is the logic here?
Also, I learnt from documentation that if we are loading from the local file system and using a cluster, the file has to be in the same location on all the machines that belong to the cluster. Will this not create duplicates? How does Spark handle this scenario? If you can point to articles that throw light on this, it will be of great help.
Thanks!
When Spark reading from the Local file system the default number of Partitions (identified by defaultParallelism) is the number of all available cores.
sc.textFile calculates the number of partitions as the minimum between defaultParallelism ( available cores in case of Local FS) and 2.
def defaultMinPartitions: Int = math.min(defaultParallelism, 2)
Referred from: spark code
In 1st case: the file size - 300KB
Number of partitions are calculated as 2, as file size is very less.
In 2nd case: file size - 500MB
Number of partitions are equal to the defaultParallelism. In your case, it is 8.
When reading from HDFS, sc.textFile will take the maximum between minPartitions and the number of splits computed based on hadoop input split size divided by the block size.
However, when using textFile with compressed files (file.txt.gz not file.txt or similar), Spark disables splitting that makes for an RDD with only 1 partition (as reads against gzipped files cannot be parallelized).
For your 2nd query regarding reading data from Local path in Cluster:
Files need to be available on all the machines in the cluster, because Spark may launch the executors on machines in the cluster, and executors will read the file using (file://).
To avoid copying the files to all the machines, if your data is already in one of the network file systems like NFS, AFS, and MapR’s NFS layer, then you can use it as an input by just specifying a file:// path; Spark will handle it as long as the filesystem is mounted at the same path on each node. Every node needs to have the same path.
Please Refer to: https://community.hortonworks.com/questions/38482/loading-local-file-to-apache-spark.html

PySpark Number of Output Files

I am a Spark Newbie. I have a simple pyspark script. It reads a json file, flattens it and writes it to S3 location as parquet compressed file.
The read and transformation steps run very fast and uses 50 executors (which I set in the conf). But the write stage takes a long time and writes only one large file (480MB).
How is the number of files saved decided?
Can the write operation be sped up somehow?
Thanks,
Ram.
The number of files output is equal to the the number of partitions of the RDD being saved. In this sample, the RDD is repartitioned to control the number of output files.
Try:
repartition(numPartitions) - Reshuffle the data in the RDD randomly
to create either more or fewer partitions and balance it across them.
This always shuffles all data over the network.
>>> dataRDD.repartition(2).saveAsTextFile("/user/cloudera/sqoop_import/orders_test")
The number of files output is the same as the number of partitionds of the RDD.
$ hadoop fs -ls /user/cloudera/sqoop_import/orders_test
Found 3 items
-rw-r--r-- 1 cloudera cloudera 0 2016-12-28 12:52 /user/cloudera/sqoop_import/orders_test/_SUCCESS
-rw-r--r-- 1 cloudera cloudera 1499519 2016-12-28 12:52 /user/cloudera/sqoop_import/orders_test/part-00000
-rw-r--r-- 1 cloudera cloudera 1500425 2016-12-28 12:52 /user/cloudera/sqoop_import/orders_test/part-00001
Also check this: coalesce(numPartitions)
source-1 | source-2
Update:
The textFile method also takes an optional second argument for
controlling the number of partitions of the file. By default, Spark
creates one partition for each block of the file (blocks being 64MB by
default in HDFS), but you can also ask for a higher number of
partitions by passing a larger value. Note that you cannot have fewer
partitions than blocks.
... but this is minimum number of possible partitions so they are not guaranteed.
so if you want to partition on read, you should use this....
dataRDD=sc.textFile("/user/cloudera/sqoop_import/orders").repartition(2)
There are 2 different things to consider:-
HDFS Block size:- The block size of HDFS is configurable in HDFS-site.xml (128 Mb by default). If a file is having a size more than the block size, a new block will be assigned in the memory to the rest of the file data. But, that is not something you can see. It is done internally. The whole process is sequential.
Partitions:- When Spark comes into picture, so does parallelism. Ideally, if you do not manually provide the number of partitions, it would be equal to the block size in the default configuration. On the other hand, if you want to customize the number of partitioned files, you could go ahead and use the API , where n being the number of partition.
These partitions are visible to you in the HDFS when you browse it.
Also, To increase the performance, you could give some specifications such as num executors, executor memory, cores per executor, etc. while spark-submit / pyspark /spark-shell. The performance while writing any file depends on the format and compression codec used for the same wildly.
Thanks for reading.

How does spark determine the number of tasks?

I am a bit confused by the number of tasks that are created by Spark when reading a number of text files.
Here is the code:
val files = List["path/to/files/a/23",
"path/to/files/b/",
"path/to/files/c/0"]
val ds = spark.sqlContext.read.textFile(files :_*)
ds.count()
Each of the folders a, b, c contains 24 files, so that there are a total of 26 files since the complete b folder is read. Now if I execute an action, like .count(), the Spark UI shows me that there are 24 tasks. However, I would have thought that there are 26 tasks, as in 1 task per partition and 1 partition for each file.
It would be great if someone could give me some more insights into what is actually happening.
Text files are loaded using Hadoop formats. Number of partitions depends on:
mapreduce.input.fileinputformat.split.minsize
mapreduce.input.fileinputformat.split.maxsize
minPartitions argument if provided
block size
compression if present (splitable / not-splitable).
You'll find example computations here: Behavior of the parameter "mapred.min.split.size" in HDFS

Spark 2.0.0 Partition size for a parquet

I am trying to understand how I could improve (or increase) the parallelism of tasks that run for a particular spark job.
Here is my observation...
scala> spark.read.parquet("hdfs://somefile").toJavaRDD.partitions.size()
25
$ hadoop fs -ls hdfs://somefile | grep 'part-r' | wc -l
200
$ hadoop fs -du -h -s hdfs://somefile
2.2 G
I notice that, depending on what the repartition / coalesce the number of part files to HDFS is created appropriately during the save operation. Meaning the number of part files can be tweaked according to this parameter.
But, how do I control the read's 'partitions.size()'? Meaning, I want to have this to be 200 (without having to repartition it during the read operation so that I would be able have more number of tasks run for this job)
This has a major impact in-terms of the time it takes to perform query operations on this job.
On a side note, I do understand that 200 parquet part files for the above 2.2 G seems over-kill for a 128 MB block size. Ideally it should be 18 parts or so.
Please advice.

Resources