Spark Dataframe write operation is missing some partition - apache-spark

We have a PySpark dataframe with 100 partition. When we do a write operation to save this dataframe as parquet file on HDFS location, we get 100 files most of the time. However sometime less than 100 files are generated and no exception or error is getting generated. Because of these missing files, we loose some records.
Spark DAG is showing all 100 tasks. So not able to figure out why this is happening.
Troubleshoot 1:- Added the code to print partition wise count before writing the dataframe. Code has printed the 100 partition with non-zero records(evenly distributed ) which proves there is no empty partition.

Related

Repartition by dates for high concurrency and big output files

I'm running Spark job on AWS Glue. The job transforms the data and saves output to parquet files, partitioned by date (year, month, day directories). Job must be able to handle terabytes of input data and uses hundreds of executors, each with 5.5 GB memory limit.
Input covers over 2 years of data. The output parquet files for each date should be as big as possible, optionally split into 500 MB chunks. Creating multiple small files for each day is not wanted.
Few tested approaches:
repartitioning by the same columns as in write results in Out Of Memory errors on executors:
df = df.repartition(*output_partitions)
(df
.write
.partitionBy(output_partitions)
.parquet(output_path))
repartitioning with an additional column with random value results in having multiple small output files written (corresponding to spark.sql.shuffle.partitions value):
df = df.repartition(*output_partitions, "random")
(df
.write
.partitionBy(output_partitions)
.parquet(output_path))
setting the number of partitions in repartition function, for example to 10, gives 10 quite big output files, but I'm afraid it will cause Out Of Memory errors when actual data (TBs in size) will be loaded:
df = df.repartition(10, *output_partitions, "random")
(df
.write
.partitionBy(output_partitions)
.parquet(output_path))
(df in code snippets is a regular Spark Data Frame)
I know I can limit the output file size with maxRecordsPerFile write option. But this limits the output created from a single memory partition, so in the first place, I would need to have partitions created by date.
So the question is how to repartition data in memory to:
split it over multiple executors to prevent Out Of Memory errors,
save output for each day to limited number of big parquet files,
write output files in parallel (using as much executors as possible)?
I've read those sources but did not find a solution:
https://mungingdata.com/apache-spark/partitionby/
https://stackoverflow.com/a/42780452
https://stackoverflow.com/a/50812609

PYSPARK - Solution for slow performance when writing dataframe to parquet file when using repartition() before partitionby()?

I want to write my data (contained in a dataframe) into parquet files.
I need to partition the data by two variables : "month" and "level". (data is always filtered on these 2 variables)
If I do the following
data.write.format("parquet").partitionBy("month", "level").save("...") I end up with the expected partitions, however i have a lot of files per partitions. Some of these files are really small which hurt the performance of queries run on the data.
In order to correct that, I tried to apply repartition before writing the data :
data.repartition("month", "level").write.format("parquet").partitionBy("month", "level").save("...") which give me exactly what i want (1 file per partition, with a decent size for each file).
===> the problem here is that the repartition causes a full shuffle of the data, which means that for an input data of 400Gb, I end up with a few Tb of shuffle...
Is there any way to optimize the repartition() before the partitionby() or to do this any other way ?
Thanks !

How Spark SQL reads Parquet partitioned files

I have a parquet file of around 1 GB. Each data record is a reading from an IOT device which captures the energy consumed by the device in the last one minute.
Schema: houseId, deviceId, energy
The parquet file is partitioned on houseId and deviceId. A file contains the data for the last 24 hours only.
I want to execute some queries on the data residing in this parquet file using Spark SQL An example query finds out the average energy consumed per device for a given house in the last 24 hours.
Dataset<Row> df4 = ss.read().parquet("/readings.parquet");
df4.as(encoder).registerTempTable("deviceReadings");
ss.sql("Select avg(energy) from deviceReadings where houseId=3123).show();
The above code works well. I want to understand that how spark executes this query.
Does Spark read the whole Parquet file in memory from HDFS without looking at the query? (I don't believe this to be the case)
Does Spark load only the required partitions from HDFS as per the query?
What if there are multiple queries which need to be executed? Will Spark look at multiple queries while preparing an execution plan? One query may be working with just one partition whereas the second query may need all the partitions, so a consolidated plan shall load the whole file from disk in memory (if memory limits allow this).
Will it make a difference in execution time if I cache df4 dataframe above?
Does Spark read the whole Parquet file in memory from HDFS without looking at the query?
It shouldn't scan all data files, but it might in general, access metadata of all files.
Does Spark load only the required partitions from HDFS as per the query?
Yes, it does.
Does Spark load only the required partitions from HDFS as per the query?
It does not. Each query has its own execution plan.
Will it make a difference in execution time if I cache df4 dataframe above?
Yes, at least for now, it will make a difference - Caching dataframes while keeping partitions

Spark Dataframe Repartioning Causing uneven paritions

I am using spark repartition to change the number of partitions in the dataframe.
While writing the data after repartitioning I saw there are different size parquet files have been created.
Here is the code which I am using to repartition
df.repartition(partitionCount).write.mode(SaveMode.Overwrite).parquet("/test")
Most of the partitions in size KBs and some of them is in around 100MB which is the size I want to keep per partition.
Here is a sample
20.2 K /test/part-00010-0957f5aa-1f14-4295-abe2-0aacfe135444.snappy.parquet
20.2 K /test/part-00011-0957f5aa-1f14-4295-abe2-0aacfe135444.snappy.parquet
99.9 M /test/part-00012-0957f5aa-1f14-4295-abe2-0aacfe135444.snappy.parquet
Now if I open the 20.2K parquet files and do a count action the result comes to be 0. For 99.9M file the same count operation gives some non zero result.
Now as per my understanding of repartition in dataframe, it does a full shuffle and tries to keep each partition of same size. However the above mentioned example contradicts that.
Could someone please help me here.

How do I increase the number of partitions when I read in a hive table in Spark

So, I am trying to read in a hive table in Spark with hiveContext.
The job basically reads data from two tables into two Dataframes which are subsequently converted to RDD's. I, then, join them based on a common key.
However, this join is failing due to a MetadataFetchFailException (What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?).
I want to avoid that by spreading my data over to other nodes.
Currently, even though I have 800 executors, most data is being read into 10 nodes, each of which is using > 50% of its memory.
The question, is, how do I spread the data over to more partitions during the read operation? I do not want to repartition later on.
val tableDF= hiveContext.read.table("tableName")
.select("colId1", "colId2")
.rdd
.flatMap(sqlRow =>{
Array((colId1, colId2))
})

Resources