Number of files generated by a Spark Job - apache-spark

I want to monitor the number of files that spark generates, and maybe raise an exception if it is generating a lot of files. Is there any way to see this?

well it depends on how you are doing the write operation. Assuming you are writing the content of a dataframe or rdd as output, the easiest way would be to see number of partitions in your final dataframe/rdd. Basically each partition is written as a separate file.
Assuming you are using scala, this should give you the number of partitions.
df.rdd.getNumPartitions
Instead of raising an exception and causing job to fail, i would suggest that you use coalesce function to repartition the df with a value that suits you need. For example, if the output is not too large (1 Gb or less) i use coalesce(1) and write only 1 file.

Related

How Apache Spark can preserve order of lines in the output textFile?

Can anyone help me understand how apache-spark is able preserve the order of lines in output, when read from a textFile. Consider the below code snippet,
sparkContext.textFile(<inputTextFilePath>)
.coalesce(1)
.saveAsTextFile(<outputTextFilePath>)
The text file size is in GBs and I could see the data is read parallelly by worker nodes and written to the destination folder in a single file(since partition count is set to 1). When I open the output file, I could see all the lines are in order. How does Spark acheive this ordering?
There is no guarantee in general.
coalesce has optimization logic based on partition locality. Then, given that a large file has many partitions that may be on same worker, there is no guarantee - in order to reduce shuffling - that order is preserved. It may be in some cases so, but not always.
for parquet, orc other considerations apply, but this is a text file you state.

Spark - In this case, when does repartition occur?

I need to output a unique file in each prefix, so the code is written like this ds.repartition(1).write.partitionBy("prefix").mode(SaveMode.Overwrite).csv(output)
Before the code did not add repartition, each prefix will have thousands of files, and the task can be completed in 2 hours. After adding repartition, each prefix will have 1 file, and the task will be executed for more than 7 hours. At what stage is repartition executed? Am I using this gracefully?
Whenever you do repartition it does a full shuffle and distribute the data evenly as much as possible.
In your case when you do ds.repartition(1), it shuffles all the data and bring all the data in a single partition on one of the worker node.
Now when you perform the write operation then only one worker node/executor is performing the write operation after partitioning by prefix. As only single worker is doing the work it is taking lot of time.
Few Stuffs that you could take into consideration:
If there is no real reason to have only one csv file , try to avoid doing that.
Instead of repartition(1) , use coalesce(1) that will do minimum shuffle instead of repartition(1) that would do full shuffle.
saving a single csv file , you are not utilizing the spark's power of parallelism.
if you want to use prefix as partition column, then you need run
spark.sql("set hive.exec.dynamic.partition=true")
spark.sql("set hive.exec.dynamic.partition.mode=nonstrict")
and you can use coalesce(1) instead of repartition(1), beacause in this case, coalesce don't shuffle, repartition has shuffle, and the partition is one, then just has one task to deal the all data. so it cost 7 hours.

How to control size of Parquet files in Glue?

I'm loading a data set into a DynamicFrame, perform a transformation and then write it back to S3:
datasink = glueContext.write_dynamic_frame.from_options(
frame = someDateFrame,
connection_type = "s3",
connection_options = {
"path": "s3://the-bucket/some-data-set"
},
format = "parquet"
)
The result is 12 Parquet files with an average size of about 3MB.
First of all, I don't get why Glue/Spark won't by default instead create a single file about 36MB large given that almost all consuming software (Presto/Athena, Spark) prefer a file size of about 100MB and not a pile of small files. If somebody has an insight here, I'd appreciate to hear about it.
But practically speaking I'm wondering if it is possible to make Glue/Spark produce a large file or at least larger files. Is that possible?
Using coalesce(1) will deteriorate the performance of Glue in the long run. While, it may work for small files, it will take ridiculously long amounts of time for larger files.
coalesce(1) makes only 1 spark executor to write the file which without coalesce() would have used all the spark executors to write the file.
Also, using coalesce(1) will have bigger cost. 1 executor running for long run time will have bigger cost than all executors running for fraction of the time taken by 1 executor.
Coalesce(1) took 4 hrs 48 minutes to process 1GB of Parquet Snappy Compressed Data.
Coalesce(9) took 48 minutes for the same.
No Coalesce() did the same job in 25 minutes.
I haven't tried yet. But you can set accumulator_size in write_from_options.
Check https://github.com/awslabs/aws-glue-libs/blob/master/awsglue/context.py for how to pass value.
Alternatively, you can use pyspark DF with 1 partition before write in order to make sure it writes to one file only.
df.coalesce(1).write.format('parquet').save('s3://the-bucket/some-data-set')
Note that writing to 1 file will not take advantage of parallel writing and hence will increase time to write.
You could try repartition(1) before writing the dynamic dataframe to S3. Refer here to understand why coalesce(1) is a bad choice for merging. It might also cause Out Of Memory(OOM) exceptions if a single node cannot hold all the data to be written.
I don't get why Glue/Spark won't by default instead create a single file about 36MB large given that almost all consuming software (Presto/Athena, Spark) prefer a file size of about 100MB and not a pile of small files.
The number of the output files is directly linked to the number of partitions. Spark cannot assume a default size for output files as it is application depended. The only way you control the size of output files is to act on your partitions numbers.
I'm wondering if it is possible to make Glue/Spark produce a large file or at least larger files. Is that possible?
Yes, it is possible but there is no rule of thumb. You have to try different settings according to your data.
If you are using AWS Glue API [1], you can control how to group small files into a single partition while you read data:
glueContext.write_dynamic_frame.from_options(
frame = someDateFrame,
connection_type = "s3",
connection_options = {
"path": "s3://the-bucket/some-data-set",
"groupFiles": "inPartition",
"groupSize": "10485760" # 10485760 bytes (10 MB)
}
format = "parquet"
)
If your transformation code does not impact too much the data distribution (not filtering, not joining, etc), you should expect the output file to have almost the same size as the read in input (not considering compression rate) In general, Spark transformations are pretty complex with joins, aggregates, filtering. This changes the data distribution and number of final partitions.
In this case, you should use either coalesce() or repartition() to control the number of partitions you expect.
[1] https://aws.amazon.com/premiumsupport/knowledge-center/glue-job-output-large-files/

Why so many Parquet files created? Can we not limit Parquet output files?

Why so many Parquet files created in sparkSql? Can we not limit Parquet output files ?
in general when you write to parquet it will write one (or more depending on various options) files per partition. If you want to reduce the number of files you can call coalesce on the dataframe before writing. e.g.:
df.coalesce(20).write.parquet(filepath)
Of course if you have various options (e.g. partitionBy) the number of files can increase dramatically.
Also note that if you coalesce to a very small number of partitions this can become very slow (both because of copying data between the partitions and because of the reduced parallelism if you go to a number small enough). You might also get OOM errors if the data in a single partition is too large (when you coalesce the partitions naturally get bigger).
A couple of things to note:
saveAsParquetFile is depracated since version 1.4.0. Use write.parquet(path) instead.
Depending on your use case, searching for a specific string on parquet files might not be the most efficient way to go.

How to combine small parquet files with Spark?

I have a Hive table that has a lot of small parquet files and I am creating a Spark data frame out of it to do some processing using SparkSQL. Since I have a large number of splits/files my Spark job creates a lot of tasks, which I don't want. Basically what I want is the same functionality that Hive provides, that is, to combine these small input splits into larger ones by specifying a max split size setting. How can I achieve this with Spark? I tried using the coalesce function, but I can only specify the number of partitions with it (I can only control the number of output files with it). Instead I really want some control over the (combined) input split size that a task processes.
Edit: I am using Spark itself, not Hive on Spark.
Edit 2: Here is the current code I have:
//create a data frame from a test table
val df = sqlContext.table("schema.test_table").filter($"my_partition_column" === "12345")
//coalesce it to a fixed number of partitions. But as I said in my question
//with coalesce I cannot control the file sizes, I can only specify
//the number of partitions
df.coalesce(8).write.mode(org.apache.spark.sql.SaveMode.Overwrite)
.insertInto("schema.test_table")
I have not tried but read it in getting started guide that setting this property should work "hive.merge.sparkfiles=true"
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
In case using Spark on Hive, than Spark's abstraction doesn't provide explicit split of data. However we can control the parallelism in several ways.
You can leverage DataFrame.repartition(numPartitions: Int) to explicitly control the number of partitions.
In case you are using Hive Context than ensure hive-site.xml contains the CombinedInputFormat. That may help.
For more info, take a look at following documentation about Spark data parallelism - http://spark.apache.org/docs/latest/tuning.html#level-of-parallelism.

Resources