I have an ML dataframe which I read from csv files. It contains three types of columns:
ID Timestamp Feature1 Feature2...Feature_n
where n is ~ 500 (500 features in ML parlance). The total number of rows in the dataset is ~ 160 millions.
As this is the result of a previous full join, there are many features which do not have values set.
My aim is to run a "fill" function(fillna style form python pandas), where each empty feature value gets set with the previously available value for that column, per Id and Date.
I am trying to achieve this with the following spark 2.2.1 code:
val rawDataset = sparkSession.read.option("header", "true").csv(inputLocation)
val window = Window.partitionBy("ID").orderBy("DATE").rowsBetween(-50000, -1)
val columns = Array(...) //first 30 columns initially, just to see it working
val rawDataSetFilled = columns.foldLeft(rawDataset) { (originalDF, columnToFill) =>
originalDF.withColumn(columnToFill, coalesce(col(columnToFill), last(col(columnToFill), ignoreNulls = true).over(window)))
}
I am running this job on a 4 m4.large instances on Amazon EMR, with spark 2.2.1. and dynamic allocation enabled.
The job runs for over 2h without completing.
Am I doing something wrong, at the code level? Given the size of the data, and the instances, I would assume it should finish in a reasonable amount of time? And I haven't even tried with the full 500 columns, just with about 30!
Looking in the container logs, all I see are many logs like this:
INFO codegen.CodeGenerator: Code generated in 166.677493 ms
INFO execution.ExternalAppendOnlyUnsafeRowArray: Reached spill
threshold of
4096 rows, switching to
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
I have tried setting parameter spark.sql.windowExec.buffer.spill.threshold to something larger, without any impact. Is theresome other setting I should know about? Those 2 lines are the only ones I see in any container log.
In Ganglia, I see most of the CPU cores peaking around full usage, but the memory usage is lower than the maximum available. All executors are allocated and are doing work.
I have managed to rewrite the fold left logic without using withColumn calls. Apparently they can be very slow for large number of columns, and I was also getting stackoverflow errors because of that.
I would be curious to know why this massive difference - and what exactly happens behind the scenes with the query plan execution, which makes repeated withColumns calls so slow.
Links which proved very helpful: Spark Jira issue and this stackoverflow question
var rawDataset = sparkSession.read.option("header", "true").csv(inputLocation)
val window = Window.partitionBy("ID").orderBy("DATE").rowsBetween(Window.unboundedPreceding, Window.currentRow)
rawDataset = rawDataset.select(rawDataset.columns.map(column => coalesce(col(column), last(col(column), ignoreNulls = true).over(window)).alias(column)): _*)
rawDataset.write.option("header", "true").csv(outputLocation)
Related
I've got a problem with Spark, its driver and an OoM issue.
Currently I have a dataframe which is being built with several, joined sources (actually different tables in parquet format), and there are thousands of tuples. They have a date which represents the date of creation of the record, and distinctly they are a few.
I do the following:
from pyspark.sql.functions import year, month
# ...
selectionRows = inputDataframe.select(year('registration_date').alias('year'), month('registration_date').alias('month')).distinct()
selectionRows.show() # correctly shows 8 tuples
selectionRows = selectionRows.collect() # goes heap space OoM
print(selectionRows)
Reading the memory consumption statistics shows that the driver does not exceed ~60%. I thought that the driver should load only the distinct subset, not the entire dataframe.
Am I missing something? Is it possible to collect those few rows in a smarter way? I need them as a pushdown predicate to load a secondary dataframe.
Thank you very much!
EDIT / SOLUTION
After reading the comments and elaborating my personal needs, I cached the dataframe at every "join/elaborate" step, so that in a timeline I do the following:
Join with loaded table
Queue required transformations
Apply the cache transformation
Print the count to keep track of cardinality (mainly for tracking / debugging purposes) and thus apply all transformations + cache
Unpersist the cache of the previous sibiling step, if available (tick/tock paradigm)
This reduced some complex ETL jobs down to 20% of the original time (as previously it was applying the transformations of each previous step at each count).
Lesson learned :)
After reading the comments, I elaborated the solution for my use case.
As mentioned in the question, I join several tables one with each other in a "target dataframe", and at each iteration I do some transformations, like so:
# n-th table work
target = target.join(other, how='left')
target = target.filter(...)
target = target.withColumn('a', 'b')
target = target.select(...)
print(f'Target after table "other": {target.count()}')
The problem of slowliness / OoM was that Spark was forced to do all the transformations from start to finish at each table due to the ending count, making it slower and slower at each table / iteration.
The solution I found is to cache the dataframe at each iteration, like so:
cache: DataFrame = null
# ...
# n-th table work
target = target.join(other, how='left')
target = target.filter(...)
target = target.withColumn('a', 'b')
target = target.select(...)
target = target.cache()
target_count = target.count() # actually do the cache
if cache:
cache.unpersist() # free the memory from the old cache version
cache = target
print(f'Target after table "other": {target_count}')
Version: DBR 8.4 | Spark 3.1.2
I'm trying to get the top 500 rows per partition, but I can see from the query plan that it is sorting the entire data set (50K rows per partition) before eventually filtering to the rows I care about.
max_rank = 500
ranking_order = Window.partitionBy(['category', 'id'])
.orderBy(F.col('primary').desc(), F.col('secondary'))
df_ranked = (df
.withColumn('rank', F.row_number().over(ranking_order))
.where(F.col('rank') <= max_rank)
)
df_ranked.explain()
I read elsewhere that expressions such as df.orderBy(desc("value")).limit(n) are optimized by the query planner to use TakeOrderedAndProject and avoid sorting the entire table. Is there a similar approach I can use here to trigger an optimization and avoid fully sorting all partitions?
For context, right now my query is taking 3.5 hours on a beefy 4 worker x 40 core cluster and shuffle write time surrounding this query (including some projections not listed above) appears to be my high-nail, so I'm trying to cut down the amount of data as soon as possible.
I am working on Spark SQL where I need to find out Diff between two large CSV's.
Diff should give:-
Inserted Rows or new Record // Comparing only Id's
Changed Rows (Not include inserted ones) - Comparing all column values
Deleted rows // Comparing only Id's
Spark 2.4.4 + Java
I am using Databricks to Read/Write CSV
Dataset<Row> insertedDf = newDf_temp.join(oldDf_temp,oldDf_temp.col(key)
.equalTo(newDf_temp.col(key)),"left_anti");
Long insertedCount = insertedDf.count();
logger.info("Inserted File Count == "+insertedCount);
Dataset<Row> deletedDf = oldDf_temp.join(newDf_temp,oldDf_temp.col(key)
.equalTo(newDf_temp.col(key)),"left_anti")
.select(oldDf_temp.col(key));
Long deletedCount = deletedDf.count();
logger.info("deleted File Count == "+deletedCount);
Dataset<Row> changedDf = newDf_temp.exceptAll(oldDf_temp); // This gives rows (New +changed Records)
Dataset<Row> changedDfTemp = changedDf.join(insertedDf, changedDf.col(key)
.equalTo(insertedDf.col(key)),"left_anti"); // This gives only changed record
Long changedCount = changedDfTemp.count();
logger.info("Changed File Count == "+changedCount);
This works well for CSV with columns upto 50 or so.
The Above code fails for one row in CSV with 300+columns, so I am sure this is not file Size problem.
But if I have a CSV having 300+ Columns then it fails with Exception
Max iterations (100) reached for batch Resolution – Spark Error
If I set the below property in Spark, It Works!!!
sparkConf.set("spark.sql.optimizer.maxIterations", "500");
But my question is why do I have to set this?
Is there something wrong which I am doing?
Or this behaviour is expected for CSV's which have large columns.
Can I optimize it in any way to handle Large column CSV's.
The issue you are running into is related to how spark takes the instructions you tell it and transforms that into the actual things it's going to do. It first needs to understand your instructions by running Analyzer, then it tries to improve them by running its optimizer. The setting appears to apply to both.
Specifically your code is bombing out during a step in the Analyzer. The analyzer is responsible for figuring out when you refer to things what things you are actually referring to. For example, mapping function names to implementations or mapping column names across renames, and different transforms. It does this in multiple passes resolving additional things each pass, then checking again to see if it can resolve move.
I think what is happening for your case is each pass probably resolves one column, but 100 passes isn't enough to resolve all of the columns. By increasing it you are giving it enough passes to be able to get entirely through your plan. This is definitely a red flag for a potential performance issue, but if your code is working then you can probably just increase the value and not worry about it.
If it isn't working, then you will probably need to try to do something to reduce the number of columns used in your plan. Maybe combining all the columns into one encoded string column as the key. You might benefit from checkpointing the data before doing the join so you can shorten your plan.
EDIT:
Also, I would refactor your above code so you could do it all with only one join. This should be a lot faster, and might solve your other problem.
Each join leads to a shuffle (data being sent between compute nodes) which adds time to your job. Instead of computing adds, deletes and changes independently, you can just do them all at once. Something like the below code. It's in scala psuedo code because I'm more familiar with that than the Java APIs.
import org.apache.spark.sql.functions._
var oldDf = ..
var newDf = ..
val changeCols = newDf.columns.filter(_ != "id").map(col)
// Make the columns you want to compare into a single struct column for easier comparison
newDf = newDF.select($"id", struct(changeCols:_*) as "compare_new")
oldDf = oldDF.select($"id", struct(changeCols:_*) as "compare_old")
// Outer join on ID
val combined = oldDF.join(newDf, Seq("id"), "outer")
// Figure out status of each based upon presence of old/new
// IF old side is missing, must be an ADD
// IF new side is missing, must be a DELETE
// IF both sides present but different, it's a CHANGE
// ELSE it's NOCHANGE
val status = when($"compare_new".isNull, lit("add")).
when($"compare_old".isNull, lit("delete")).
when($"$compare_new" != $"compare_old", lit("change")).
otherwise(lit("nochange"))
val labeled = combined.select($"id", status)
At this point, we have every ID labeled ADD/DELETE/CHANGE/NOCHANGE so we can just a groupBy/count. This agg can be done almost entirely map side so it will be a lot faster than a join.
labeled.groupBy("status").count.show
I recently began to use Spark to process huge amount of data (~1TB). And have been able to get the job done too. However I am still trying to understand its working. Consider the following scenario:
Set reference time (say tref)
Do any one of the following two tasks:
a. Read large amount of data (~1TB) from tens of thousands of files using SciSpark into RDDs (OR)
b. Read data as above and do additional preprossing work and store the results in a DataFrame
Print the size of the RDD or DataFrame as applicable and time difference wrt to tref (ie, t0a/t0b)
Do some computation
Save the results
In other words, 1b creates a DataFrame after processing RDDs generated exactly as in 1a.
My query is the following:
Is it correct to infer that t0b – t0a = time required for preprocessing? Where can I find an reliable reference for the same?
Edit: Explanation added for the origin of question ...
My suspicion stems from Spark's lazy computation approach and its capability to perform asynchronous jobs. Can/does it initiate subsequent (preprocessing) tasks that can be computed while thousands of input files are being read? The origin of the suspicion is in the unbelievable performance (with results verified okay) I see that look too fantastic to be true.
Thanks for any reply.
I believe something like this could assist you (using Scala):
def timeIt[T](op: => T): Float = {
val start = System.currentTimeMillis
val res = op
val end = System.currentTimeMillis
(end - start) / 1000f
}
def XYZ = {
val r00 = sc.parallelize(0 to 999999)
val r01 = r00.map(x => (x,(x,x,x,x,x,x,x)))
r01.join(r01).count()
}
val time1 = timeIt(XYZ)
// or like this on next line
//val timeN = timeIt(r01.join(r01).count())
println(s"bla bla $time1 seconds.")
You need to be creative and work incrementally with Actions that cause actual execution. This has limitations thus. Lazy evaluation and such.
On the other hand, Spark Web UI records every Action, and records Stage duration for the Action.
In general: performance measuring in shared environments is difficult. Dynamic allocation in Spark in a noisy cluster means that you hold on to acquired resources during the Stage, but upon successive runs of the same or next Stage you may get less resources. But this is at least indicative and you can run in a less busy period.
I've a very basic question about spark. I usually run spark jobs using 50 cores. While viewing the job progress, most of the times it shows 50 processes running in parallel (as it is supposed to do), but sometimes it shows only 2 or 4 spark processes running in parallel. Like this:
[Stage 8:================================> (297 + 2) / 500]
The RDD's being processed are repartitioned on more than 100 partitions. So that shouldn't be an issue.
I have an observations though. I've seen the pattern that most of the time it happens, the data locality in SparkUI shows NODE_LOCAL, while other times when all 50 processes are running, some of the processes show RACK_LOCAL.
This makes me doubt that, maybe this happens because the data is cached before processing in the same node to avoid network overhead, and this slows down the further processing.
If this is the case, what's the way to avoid it. And if this isn't the case, what's going on here?
After a week or more of struggling with the issue, I think I've found what was causing the problem.
If you are struggling with the same issue, the good point to start would be to check if the Spark instance is configured fine. There is a great cloudera blog post about it.
However, if the problem isn't with configuration (as was the case with me), then the problem is somewhere within your code. The issue is that sometimes due to different reasons (skewed joins, uneven partitions in data sources etc) the RDD you are working on gets a lot of data on 2-3 partitions and the rest of the partitions have very few data.
In order to reduce the data shuffle across the network, Spark tries that each executor processes the data residing locally on that node. So, 2-3 executors are working for a long time, and the rest of the executors are just done with the data in few milliseconds. That's why I was experiencing the issue I described in the question above.
The way to debug this problem is to first of all check the partition sizes of your RDD. If one or few partitions are very big in comparison to others, then the next step would be to find the records in the large partitions, so that you could know, especially in the case of skewed joins, that what key is getting skewed. I've wrote a small function to debug this:
from itertools import islice
def check_skewness(df):
sampled_rdd = df.sample(False,0.01).rdd.cache() # Taking just 1% sample for fast processing
l = sampled_rdd.mapPartitionsWithIndex(lambda x,it: [(x,sum(1 for _ in it))]).collect()
max_part = max(l,key=lambda item:item[1])
min_part = min(l,key=lambda item:item[1])
if max_part[1]/min_part[1] > 5: #if difference is greater than 5 times
print 'Partitions Skewed: Largest Partition',max_part,'Smallest Partition',min_part,'\nSample Content of the largest Partition: \n'
print (sampled_rdd.mapPartitionsWithIndex(lambda i, it: islice(it, 0, 5) if i == max_part[0] else []).take(5))
else:
print 'No Skewness: Largest Partition',max_part,'Smallest Partition',min_part
It gives me the smallest and largest partition size, and if the difference between these two is more than 5 times, it prints 5 elements of the largest partition, to should give you a rough idea on what's going on.
Once you have figured out that the problem is skewed partition, you can find a way to get rid of that skewed key, or you can re-partition your dataframe, which will force it to get equally distributed, and you'll see now all the executors will be working for equal time and you'll see far less dreaded OOM errors and processing will be significantly fast too.
These are just my two cents as a Spark novice, I hope Spark experts can add some more to this issue, as I think a lot of newbies in Spark world face similar kind of problems far too often.