How many files does a pyspark parquet write generate? I have read that the output is one file per in memory partition. However, this does not seem to always be true.
I am running a 6 executors cluster with 6G executor memory per executor. All the rest (pyspark, overhead, offheap) are 2G
using the following data:
dummy_data = spark.createDataFrame(pd.DataFrame({'a':np.random.choice([1,2,3,4,5,6,7,8,9,10],100000)}))
The following code where I repartition without specifying a column to repartition by, always produces the number of files equal to the number of memory partitions:
df_dummy = dummy_data.repartition(200)
df_dummy.rdd.getNumPartitions()
df_dummy.write.format("parquet").save("gs://monsoon-credittech.appspot.com/spark_datasets/test_writes/df_dummy_repart_wo_id")
#files generated 200
However, the following code, where I do specify the column to repartition the data by, produces some random number of files:
df_dummy = dummy_data.repartition(200,'a')
df_dummy.rdd.getNumPartitions()
df_dummy.write.format("parquet").save("gs://monsoon-credittech.appspot.com/spark_datasets/test_writes/df_dummy_repart_w_id")
#files generated 11
Can you help me understand the number of output files that gets generated by the pyspark parquet writer.
This is an answer that does not explain everything you're noticing, but probably contains useful enough information that it would be a pity not to share it.
The reason why you're seeing a different amount of output files is because of the order of your data after those 2 partitions.
dummy_data.repartition(200) repartitions your individual rows using round robin partitioning
the result is that your data has a random ordering, because your input data has random ordering
dummy_data.repartition(200,'a') uses hash partitioning according to the column a's values
the result is that your data is chopped up in a very specific order: hashing the column values will put values where a == 1 always in the same partition
since your nr of partitions is smaller than the distinct amount of possible values, each partition will contain only 1 distinct a value.
Now, there is a pattern in the amount of output part-files you receive:
In the case of dummy_data.repartition(200), you simply get the same number of part-files as partitions. 200 in your example.
In the other case, you get 11 part-files. If you have a look at the content of those part-files, you will see that there is 1 empty file + 10 filled files. 1 for each distinct value of your original dataset. So this leads to the conclusion that while writing your files, something is being smart and merging those minuscule and identical files. I'm not sure whether this is Spark, or the PARQUET_OUTPUT_COMMITTER_CLASS, or something else.
Conclusion
In general, you get the same amount of part-files as the amount of partitions.
In your specific case, when you're repartitioning by the column (which is the only value in the Row), your parquet part-files will contain a bunch of the same values. It seems that something (I don't know what) is being smart and merging files with the same values.
In your case, you got 11 part-files because there is 1 empty file and 10 files for each distinct value in your dataframe. Try changing np.random.choice([1,2,3,4,5,6,7,8,9,10] to np.random.choice([1,2,3,4,5,6,7,8] and you will see you'll get 9 part-files (8 + 1).
Most likely, the reason you see 11 files being written after you do a .repartition(200,'a') is because your first partition (with partition id = 0) becomes empty. Spark allows the task working on that empty partition to proceed with the write, but will suppress writing all other empty parquet files for all other partitions. This behavior can be tracked down to the changes made for JIRA SPARK-21435 "Empty files should be skipped while write to file", and corresponding code in FileFormatWriter.scala:
:
val dataWriter =
if (sparkPartitionId != 0 && !iterator.hasNext) {
// In case of empty job, leave first partition to save meta for file format like parquet.
new EmptyDirectoryDataWriter(description, taskAttemptContext, committer)
} else if (description.partitionColumns.isEmpty && description.bucketSpec.isEmpty) {
:
So, if you repartition your dataset such that partition 0 becomes non-empty, you would not see any empty files written.
Related
I've got a problem with Spark, its driver and an OoM issue.
Currently I have a dataframe which is being built with several, joined sources (actually different tables in parquet format), and there are thousands of tuples. They have a date which represents the date of creation of the record, and distinctly they are a few.
I do the following:
from pyspark.sql.functions import year, month
# ...
selectionRows = inputDataframe.select(year('registration_date').alias('year'), month('registration_date').alias('month')).distinct()
selectionRows.show() # correctly shows 8 tuples
selectionRows = selectionRows.collect() # goes heap space OoM
print(selectionRows)
Reading the memory consumption statistics shows that the driver does not exceed ~60%. I thought that the driver should load only the distinct subset, not the entire dataframe.
Am I missing something? Is it possible to collect those few rows in a smarter way? I need them as a pushdown predicate to load a secondary dataframe.
Thank you very much!
EDIT / SOLUTION
After reading the comments and elaborating my personal needs, I cached the dataframe at every "join/elaborate" step, so that in a timeline I do the following:
Join with loaded table
Queue required transformations
Apply the cache transformation
Print the count to keep track of cardinality (mainly for tracking / debugging purposes) and thus apply all transformations + cache
Unpersist the cache of the previous sibiling step, if available (tick/tock paradigm)
This reduced some complex ETL jobs down to 20% of the original time (as previously it was applying the transformations of each previous step at each count).
Lesson learned :)
After reading the comments, I elaborated the solution for my use case.
As mentioned in the question, I join several tables one with each other in a "target dataframe", and at each iteration I do some transformations, like so:
# n-th table work
target = target.join(other, how='left')
target = target.filter(...)
target = target.withColumn('a', 'b')
target = target.select(...)
print(f'Target after table "other": {target.count()}')
The problem of slowliness / OoM was that Spark was forced to do all the transformations from start to finish at each table due to the ending count, making it slower and slower at each table / iteration.
The solution I found is to cache the dataframe at each iteration, like so:
cache: DataFrame = null
# ...
# n-th table work
target = target.join(other, how='left')
target = target.filter(...)
target = target.withColumn('a', 'b')
target = target.select(...)
target = target.cache()
target_count = target.count() # actually do the cache
if cache:
cache.unpersist() # free the memory from the old cache version
cache = target
print(f'Target after table "other": {target_count}')
My objective is to re-partition data from source and save it at the destination path. I intend to create only one S3 object per partition and I have achieved that using the following:
df.repartition("created_year", "created_month", "created_day").write.mode('overwrite').partitionBy( "created_year", "created_month", "created_day").parquet(dest_path)
I want to ensure that all the data has been transferred and I learned that re-partitioning might drop out the duplicates. So, I decided to check whether the distinct count of each of the source and destination should match. So, I did the following:
source_df.distinct().count() == destination.distinct().count()
This returns False indicating that the distinct count is different in source and destination, in those jobs that had all the tasks completed.
Is this the right way to check whether the complete data was re-partitioned and saved? What is the better/right way?
The source and destination are the two different buckets on Amazon S3.
The possible MVC is:
def count_distinct(src_path, spark):
try:
df = spark.read.parquet(f'{src_path}')
distinct_count = df.distinct().count()
print(distinct_count)
return distinct_count
except:
log_failed_bucket(src_path)
return None
def compare_distinct(spark, bucket_name):
src_path = form_path_string(bucket_name)
original_distinct_count = count_distinct(src_path, spark)
dest_path = form_path_string(bucket_name, repartitioned_data=True)
final_distinct_count = count_distinct(dest_path, spark)
return original_distinct_count == final_distinct_count
Unless you've given all columns in partitionBy it's not possible to remove duplicates while writing, and providing all columns in partitionBy is also not possible.
In-case of any nulls or empty value for any of the partition columns, it'll be under __HIVE_DEFAULT_PARTITION__ folder respectively to your partition column.
If multiple paths are read using spark.read.format().load(), then you should provide `basePath` option (there's a chance of missing paths if it's formed dynamically), else you could directly load `basePath` and follow sanity approach mentioned below
You can check the count/distinct after grouping based on partition columns between source and target dataset.
The total count can be checked from basePath with sourcePath.
Distinct value count combination check of partition columns between source and target.
I am working on Spark SQL where I need to find out Diff between two large CSV's.
Diff should give:-
Inserted Rows or new Record // Comparing only Id's
Changed Rows (Not include inserted ones) - Comparing all column values
Deleted rows // Comparing only Id's
Spark 2.4.4 + Java
I am using Databricks to Read/Write CSV
Dataset<Row> insertedDf = newDf_temp.join(oldDf_temp,oldDf_temp.col(key)
.equalTo(newDf_temp.col(key)),"left_anti");
Long insertedCount = insertedDf.count();
logger.info("Inserted File Count == "+insertedCount);
Dataset<Row> deletedDf = oldDf_temp.join(newDf_temp,oldDf_temp.col(key)
.equalTo(newDf_temp.col(key)),"left_anti")
.select(oldDf_temp.col(key));
Long deletedCount = deletedDf.count();
logger.info("deleted File Count == "+deletedCount);
Dataset<Row> changedDf = newDf_temp.exceptAll(oldDf_temp); // This gives rows (New +changed Records)
Dataset<Row> changedDfTemp = changedDf.join(insertedDf, changedDf.col(key)
.equalTo(insertedDf.col(key)),"left_anti"); // This gives only changed record
Long changedCount = changedDfTemp.count();
logger.info("Changed File Count == "+changedCount);
This works well for CSV with columns upto 50 or so.
The Above code fails for one row in CSV with 300+columns, so I am sure this is not file Size problem.
But if I have a CSV having 300+ Columns then it fails with Exception
Max iterations (100) reached for batch Resolution – Spark Error
If I set the below property in Spark, It Works!!!
sparkConf.set("spark.sql.optimizer.maxIterations", "500");
But my question is why do I have to set this?
Is there something wrong which I am doing?
Or this behaviour is expected for CSV's which have large columns.
Can I optimize it in any way to handle Large column CSV's.
The issue you are running into is related to how spark takes the instructions you tell it and transforms that into the actual things it's going to do. It first needs to understand your instructions by running Analyzer, then it tries to improve them by running its optimizer. The setting appears to apply to both.
Specifically your code is bombing out during a step in the Analyzer. The analyzer is responsible for figuring out when you refer to things what things you are actually referring to. For example, mapping function names to implementations or mapping column names across renames, and different transforms. It does this in multiple passes resolving additional things each pass, then checking again to see if it can resolve move.
I think what is happening for your case is each pass probably resolves one column, but 100 passes isn't enough to resolve all of the columns. By increasing it you are giving it enough passes to be able to get entirely through your plan. This is definitely a red flag for a potential performance issue, but if your code is working then you can probably just increase the value and not worry about it.
If it isn't working, then you will probably need to try to do something to reduce the number of columns used in your plan. Maybe combining all the columns into one encoded string column as the key. You might benefit from checkpointing the data before doing the join so you can shorten your plan.
EDIT:
Also, I would refactor your above code so you could do it all with only one join. This should be a lot faster, and might solve your other problem.
Each join leads to a shuffle (data being sent between compute nodes) which adds time to your job. Instead of computing adds, deletes and changes independently, you can just do them all at once. Something like the below code. It's in scala psuedo code because I'm more familiar with that than the Java APIs.
import org.apache.spark.sql.functions._
var oldDf = ..
var newDf = ..
val changeCols = newDf.columns.filter(_ != "id").map(col)
// Make the columns you want to compare into a single struct column for easier comparison
newDf = newDF.select($"id", struct(changeCols:_*) as "compare_new")
oldDf = oldDF.select($"id", struct(changeCols:_*) as "compare_old")
// Outer join on ID
val combined = oldDF.join(newDf, Seq("id"), "outer")
// Figure out status of each based upon presence of old/new
// IF old side is missing, must be an ADD
// IF new side is missing, must be a DELETE
// IF both sides present but different, it's a CHANGE
// ELSE it's NOCHANGE
val status = when($"compare_new".isNull, lit("add")).
when($"compare_old".isNull, lit("delete")).
when($"$compare_new" != $"compare_old", lit("change")).
otherwise(lit("nochange"))
val labeled = combined.select($"id", status)
At this point, we have every ID labeled ADD/DELETE/CHANGE/NOCHANGE so we can just a groupBy/count. This agg can be done almost entirely map side so it will be a lot faster than a join.
labeled.groupBy("status").count.show
I have an ML dataframe which I read from csv files. It contains three types of columns:
ID Timestamp Feature1 Feature2...Feature_n
where n is ~ 500 (500 features in ML parlance). The total number of rows in the dataset is ~ 160 millions.
As this is the result of a previous full join, there are many features which do not have values set.
My aim is to run a "fill" function(fillna style form python pandas), where each empty feature value gets set with the previously available value for that column, per Id and Date.
I am trying to achieve this with the following spark 2.2.1 code:
val rawDataset = sparkSession.read.option("header", "true").csv(inputLocation)
val window = Window.partitionBy("ID").orderBy("DATE").rowsBetween(-50000, -1)
val columns = Array(...) //first 30 columns initially, just to see it working
val rawDataSetFilled = columns.foldLeft(rawDataset) { (originalDF, columnToFill) =>
originalDF.withColumn(columnToFill, coalesce(col(columnToFill), last(col(columnToFill), ignoreNulls = true).over(window)))
}
I am running this job on a 4 m4.large instances on Amazon EMR, with spark 2.2.1. and dynamic allocation enabled.
The job runs for over 2h without completing.
Am I doing something wrong, at the code level? Given the size of the data, and the instances, I would assume it should finish in a reasonable amount of time? And I haven't even tried with the full 500 columns, just with about 30!
Looking in the container logs, all I see are many logs like this:
INFO codegen.CodeGenerator: Code generated in 166.677493 ms
INFO execution.ExternalAppendOnlyUnsafeRowArray: Reached spill
threshold of
4096 rows, switching to
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
I have tried setting parameter spark.sql.windowExec.buffer.spill.threshold to something larger, without any impact. Is theresome other setting I should know about? Those 2 lines are the only ones I see in any container log.
In Ganglia, I see most of the CPU cores peaking around full usage, but the memory usage is lower than the maximum available. All executors are allocated and are doing work.
I have managed to rewrite the fold left logic without using withColumn calls. Apparently they can be very slow for large number of columns, and I was also getting stackoverflow errors because of that.
I would be curious to know why this massive difference - and what exactly happens behind the scenes with the query plan execution, which makes repeated withColumns calls so slow.
Links which proved very helpful: Spark Jira issue and this stackoverflow question
var rawDataset = sparkSession.read.option("header", "true").csv(inputLocation)
val window = Window.partitionBy("ID").orderBy("DATE").rowsBetween(Window.unboundedPreceding, Window.currentRow)
rawDataset = rawDataset.select(rawDataset.columns.map(column => coalesce(col(column), last(col(column), ignoreNulls = true).over(window)).alias(column)): _*)
rawDataset.write.option("header", "true").csv(outputLocation)
I'm running a code on apache spark on a multi node environment(one master and two slave nodes) in which I'm manipulating a dataframe and then performing logistic regression to it. In between I'm also writing out the interim transformed files. I have witnessed a peculiar observation (and yes I've double checked and triple checked) which I'm not able to explain and want to confirm if this could be because of my code or there might be other factors in play.
I have a dataframe like
df
uid rank text
a 1 najn
b 2 dak
c 1 kksa
c 3 alkw
b 1 bdsj
c 2 asma
I sort it with the following code
sdf = df.orderBy("uid", "rank")
sdf.show()
uid rank text
a 1 najn
b 1 bdsj
b 2 dak
c 1 kksa
c 2 asma
c 3 alkw
and write the transformed df to HDFS using
sdf.repartition(1)
.write.format("com.databricks.spark.csv")
.option("header", "true")
.save("/someLocation")
Now when i again try to view the data it seems to have lost its sorting
sdf.show()
uid rank text
a 1 najn
c 2 asma
b 2 dak
c 1 kksa
c 3 alkw
b 1 bdsj
When i skip the writing code, it works fine.
Anyone has any pointers if this might be a valid case and we can do something to resolve it.
P.s. I tried various variations of the writing code, increasing the number of partition, removing the partitioning altogether and saving it to other formats.
The problem is not writing to HDFS but rather the repartition as stated in the comments by zero323.
If you are planning to write everything down to a single file you should do it like this:
sdf.coalesce(1).orderBy("uid", "rank").write...
coalesce avoids repartitioning (it just copies the partitions one after the other instead of shuffling everything by the hash) which would mean your data would still be ordered within the original partitions and therfore faster to order (of course you can always lose the original ordering as it won't help much here).
Note that this is not scalable as you are pulling everything to a single partition. if you would have wrong without any repartitioning you would get a number of files according to the original number of partitions of sdf. Each file would be ordered inside so you can easily combine them.