I am a bit confused by the number of tasks that are created by Spark when reading a number of text files.
Here is the code:
val files = List["path/to/files/a/23",
"path/to/files/b/",
"path/to/files/c/0"]
val ds = spark.sqlContext.read.textFile(files :_*)
ds.count()
Each of the folders a, b, c contains 24 files, so that there are a total of 26 files since the complete b folder is read. Now if I execute an action, like .count(), the Spark UI shows me that there are 24 tasks. However, I would have thought that there are 26 tasks, as in 1 task per partition and 1 partition for each file.
It would be great if someone could give me some more insights into what is actually happening.
Text files are loaded using Hadoop formats. Number of partitions depends on:
mapreduce.input.fileinputformat.split.minsize
mapreduce.input.fileinputformat.split.maxsize
minPartitions argument if provided
block size
compression if present (splitable / not-splitable).
You'll find example computations here: Behavior of the parameter "mapred.min.split.size" in HDFS
Related
I have millions of Gzipped files to process and converting to Parquet. I'm running a simple Spark batch job on EMR to do the conversion, and giving it a couple million files at a time to convert.
However, I've noticed that there is a big delay from when the job starts to when the files are listed and split up into a batch for the executors to do the conversion. From what I have read and understood, the scheduler has to get the metadata for those files, and schedule those tasks. However, I've noticed that this step is taking 15-20 minutes for a million files to split up into tasks for a batch. Even though the actual task of listing the files and doing the conversion only takes 15 minutes with my cluster of instances, the overall job takes over 30 minutes. It appears that it takes a lot of time for the driver to index all the files to split up into tasks. Is there any way to increase parallelism for this initial stage of indexing files and splitting up tasks for a batch?
I've tried tinkering with and increasing spark.driver.cores thinking that it would increase parallelism, but it doesn't seem to have an effect.
you can try by setting below config
spark.conf.set("spark.default.parallelism",x)
where x = total_nodes_in_cluster * (total_core_in_node - 1 ) * 5
This is a common problem with spark (and other big data tools) as it uses only on driver to list all files from the source (S3) and their path.
Some more info here
I have found this article really helpful to solve this issue.
Instead of using spark to list and get metadata of files we can use PureTools to create a parallelized rdd of the files and pass that to spark for processing.
S3 Specific Solution
If you don not want to install and setup tools as in the guide above you can also use a S3 manifest file to list all the files present in a bucket and iterate over the files using rdds in parallel.
Steps for S3 Manifest Solution
# Create RDD from list of files
pathRdd = sc.parallelize([file1,file2,file3,.......,file100])
# Create a function which reads the data of file
def s3_path_to_data(path):
# Get data from s3
# return the data in whichever format you like i.e. String, array of String etc.
# Call flatMap on the pathRdd
dataRdd = pathRdd.flatMap(s3_path_to_data)
Details
Spark will create a pathRdd with default number of partitions. Then call the s3_path_to_data function on each partition's rows in parallel.
Partitions play an important role in spark parallelism. e.g.
If you have 4 executors and 2 partitions then only 2 executors will do the work.
You can play around num of partitions and num of executors to achieve the best performance according to your use case.
Following are some useful attributes you can use to get insights on your df or rdd specs to fine tune spark parameters.
rdd.getNumPartitions
rdd.partitions.length
rdd.partitions.size
Input Data:
a hive table (T) with 35 files (~1.5GB each, SequenceFile)
files are in a gs bucket
default fs.gs.block.size=~128MB
all other parameters are default
Experiment 1:
create a dataproc with 2 workers (4 core per worker)
run select count(*) from T;
Experiment 1 Result:
~650 tasks created to read the hive table files
each task read ~85MB data
Experiment 2:
create a dataproc with 64 workers (4 core per worker)
run select count(*) from T;
Experiment 2 Result:
~24,480 tasks created to read the hive table files
each task read ~2.5MB data
(seems to me 1 task read 2.5MB data is not a good idea as time to open the file would probably be longer than reading 2.5MB.)
Q1: Any idea how spark determines the number of tasks to read hive table data files?
I repeated the same experiments by putting the same data in hdfs and I got similar results.
My understanding is that the number of tasks to read hive table files should be the same as the number of blocks in hdfs. Q2: Is that correct? Q3: Is that also correct when data is in gs bucket (instead of hdfs)?
Thanks in advance!
The number of tasks in one stage is equal to the number of partitions of the input data, which is in turn determined by the data size and the related configs (dfs.blocksize (HDFS), fs.gs.block.size (GCS), mapreduce.input.fileinputformat.split.minsize, mapreduce.input.fileinputformat.split.maxsize). For a complex query which involves multiple stages, it is the sum of the number of tasks of all stages.
There is no difference between HDFS and GCS, except they use different configs for block size, dfs.blocksize vs fs.gs.block.size.
See the following related questions:
How are stages split into tasks in Spark?
How does Spark SQL decide the number of partitions it will use when loading data from a Hive table?
I have a parquet directory having 5 files as shown below:
I am using Spark 2.2 version and reading this directory using below code:
I am not clear why 7 partitions (alternateDF.rdd().getNumPartitions()) are being determined by Spark when we have 5 files (each less than block size) in the parquet directory?
5 tasks have input records but the last 2 tasks have 0 input records but non-zero input data. Could you please explain the behavior of each task?
#Aman,
You can follow an old question link
simply put following are the 3 parameters it depends (from above link) on to calculate number of partitions
spark.default.parallelism (roughly translates to #cores available for
the application)
spark.sql.files.maxPartitionBytes (default 128MB)
spark.sql.files.openCostInBytes (default 4MB)
Spark source code to refer
I am a Spark Newbie. I have a simple pyspark script. It reads a json file, flattens it and writes it to S3 location as parquet compressed file.
The read and transformation steps run very fast and uses 50 executors (which I set in the conf). But the write stage takes a long time and writes only one large file (480MB).
How is the number of files saved decided?
Can the write operation be sped up somehow?
Thanks,
Ram.
The number of files output is equal to the the number of partitions of the RDD being saved. In this sample, the RDD is repartitioned to control the number of output files.
Try:
repartition(numPartitions) - Reshuffle the data in the RDD randomly
to create either more or fewer partitions and balance it across them.
This always shuffles all data over the network.
>>> dataRDD.repartition(2).saveAsTextFile("/user/cloudera/sqoop_import/orders_test")
The number of files output is the same as the number of partitionds of the RDD.
$ hadoop fs -ls /user/cloudera/sqoop_import/orders_test
Found 3 items
-rw-r--r-- 1 cloudera cloudera 0 2016-12-28 12:52 /user/cloudera/sqoop_import/orders_test/_SUCCESS
-rw-r--r-- 1 cloudera cloudera 1499519 2016-12-28 12:52 /user/cloudera/sqoop_import/orders_test/part-00000
-rw-r--r-- 1 cloudera cloudera 1500425 2016-12-28 12:52 /user/cloudera/sqoop_import/orders_test/part-00001
Also check this: coalesce(numPartitions)
source-1 | source-2
Update:
The textFile method also takes an optional second argument for
controlling the number of partitions of the file. By default, Spark
creates one partition for each block of the file (blocks being 64MB by
default in HDFS), but you can also ask for a higher number of
partitions by passing a larger value. Note that you cannot have fewer
partitions than blocks.
... but this is minimum number of possible partitions so they are not guaranteed.
so if you want to partition on read, you should use this....
dataRDD=sc.textFile("/user/cloudera/sqoop_import/orders").repartition(2)
There are 2 different things to consider:-
HDFS Block size:- The block size of HDFS is configurable in HDFS-site.xml (128 Mb by default). If a file is having a size more than the block size, a new block will be assigned in the memory to the rest of the file data. But, that is not something you can see. It is done internally. The whole process is sequential.
Partitions:- When Spark comes into picture, so does parallelism. Ideally, if you do not manually provide the number of partitions, it would be equal to the block size in the default configuration. On the other hand, if you want to customize the number of partitioned files, you could go ahead and use the API , where n being the number of partition.
These partitions are visible to you in the HDFS when you browse it.
Also, To increase the performance, you could give some specifications such as num executors, executor memory, cores per executor, etc. while spark-submit / pyspark /spark-shell. The performance while writing any file depends on the format and compression codec used for the same wildly.
Thanks for reading.
I have a bunch of compressed text files each line of which containing a JSON object. Simplified my workflow looks like this:
string_json = sc.textFile('/folder/with/gzip/textfiles/')
json_objects = string_json.map(make_a_json)
DataRDD = json_objects.map(extract_data_from_json)
DataDF = sqlContext.createDataFrame(DataRDD,schema).collect()
'''followed by some transformations to the dataframe'''
Now, the code works fine. The problem arises as soon as the number files can not be evenly divided between executors.
That is as far as I understand it, because spark is not extracting the files and then distributing the rows to the executors, but rather each executioner gets one file to work with.
e.g If i have 5 files and 4 executors, the first 4 files are processed in parallel and then the 5th file.
Because the 5th is not being processed in parallel with the other 4 and cannot be divided between the 4 executors, it takes the same amount of time as the first 4 together.
This happens at every stages of the program.
Is there a way to transform this kind compartmentalized RDD either into a RDD or Dataframe that is not ?
I'm using python 3.5 and spark 2.0.1
Spark operations are divided into tasks, or units of work that can be done in parallel. There are a few things to know about sc.textFile:
If you're loading multiple files, you're going to get 1 task per file, at minimum.
If you're loading gzipped files, you're going to get 1 task per file, at maximum.
Based on these two premises,your use case is going to see one task per file. You're absolutely right about how the tasks / cores ratio affects wall clock time: having 5 tasks running on 4 cores will take roughly the same time as 8 tasks on 4 cores (though not quite, because stragglers exist and the first core to finish will take on the 5th task).
A rule of thumb is that you should have roughly 2-5 tasks per core in your Spark cluster to see good performance. But if you only have 5 gzipped text files, you're not going to see this. You could try to repartition your RDD (which uses a somewhat expensive shuffle operation) if you're doing a lot downstream:
repartitioned_string_json = string_json.repartition(100, shuffle=True)