I've a very basic question about spark. I usually run spark jobs using 50 cores. While viewing the job progress, most of the times it shows 50 processes running in parallel (as it is supposed to do), but sometimes it shows only 2 or 4 spark processes running in parallel. Like this:
[Stage 8:================================> (297 + 2) / 500]
The RDD's being processed are repartitioned on more than 100 partitions. So that shouldn't be an issue.
I have an observations though. I've seen the pattern that most of the time it happens, the data locality in SparkUI shows NODE_LOCAL, while other times when all 50 processes are running, some of the processes show RACK_LOCAL.
This makes me doubt that, maybe this happens because the data is cached before processing in the same node to avoid network overhead, and this slows down the further processing.
If this is the case, what's the way to avoid it. And if this isn't the case, what's going on here?
After a week or more of struggling with the issue, I think I've found what was causing the problem.
If you are struggling with the same issue, the good point to start would be to check if the Spark instance is configured fine. There is a great cloudera blog post about it.
However, if the problem isn't with configuration (as was the case with me), then the problem is somewhere within your code. The issue is that sometimes due to different reasons (skewed joins, uneven partitions in data sources etc) the RDD you are working on gets a lot of data on 2-3 partitions and the rest of the partitions have very few data.
In order to reduce the data shuffle across the network, Spark tries that each executor processes the data residing locally on that node. So, 2-3 executors are working for a long time, and the rest of the executors are just done with the data in few milliseconds. That's why I was experiencing the issue I described in the question above.
The way to debug this problem is to first of all check the partition sizes of your RDD. If one or few partitions are very big in comparison to others, then the next step would be to find the records in the large partitions, so that you could know, especially in the case of skewed joins, that what key is getting skewed. I've wrote a small function to debug this:
from itertools import islice
def check_skewness(df):
sampled_rdd = df.sample(False,0.01).rdd.cache() # Taking just 1% sample for fast processing
l = sampled_rdd.mapPartitionsWithIndex(lambda x,it: [(x,sum(1 for _ in it))]).collect()
max_part = max(l,key=lambda item:item[1])
min_part = min(l,key=lambda item:item[1])
if max_part[1]/min_part[1] > 5: #if difference is greater than 5 times
print 'Partitions Skewed: Largest Partition',max_part,'Smallest Partition',min_part,'\nSample Content of the largest Partition: \n'
print (sampled_rdd.mapPartitionsWithIndex(lambda i, it: islice(it, 0, 5) if i == max_part[0] else []).take(5))
else:
print 'No Skewness: Largest Partition',max_part,'Smallest Partition',min_part
It gives me the smallest and largest partition size, and if the difference between these two is more than 5 times, it prints 5 elements of the largest partition, to should give you a rough idea on what's going on.
Once you have figured out that the problem is skewed partition, you can find a way to get rid of that skewed key, or you can re-partition your dataframe, which will force it to get equally distributed, and you'll see now all the executors will be working for equal time and you'll see far less dreaded OOM errors and processing will be significantly fast too.
These are just my two cents as a Spark novice, I hope Spark experts can add some more to this issue, as I think a lot of newbies in Spark world face similar kind of problems far too often.
Related
I am querying a large (2 trillion records) parquet file using PySpark, partitioned by two columns, month and day .
If I run a simple query as:
SELECT month, day, count(*) FROM mytable
WHERE month >= 201801 and month< 202301 -- two years data
GROUP BY month, day
ORDER BY month, day
the query is executed in 5 min or less. Super good performance!
If, I remove the where condition, it will bring whole data lake information (4 years). This query will take 1.5 hours to execute.
This behaviour is far from normal. I guess might be related to the large amount of data being queried in the workers node, leading to GC or shuffle, but is just a guess
How can I debug above situation?
My understanding is that Spark should be clever enough to calculate per partion (since is a distributed environment), and take around 5 * 2 (double years), not so much big different
Edit1: Adding information from SparkUI
I will put the screenshots of the two runs, 4 years data, 1.7 hours, and 3 years data, 7.5 min. First, always the 4 years data
General overview
Job Page
Stage 1 - Heavy stage
Stage 2
SQL
Edit 2 - New findings - Scheduler delay
In the heavy task, I have found out an scheduler delay
If this is the case, what is the approach?
Thanks a lot!
I have found what was the problem.
By increasing the memory and cores (not really important) of the
Driver, the problem was solved.
How to reach this conclusion?
First, I knew my data was not very skewed (as pointed by #samkart and #Leonid Vasilev). but, I checked again.
Second, all metrics were very similar to each other, without great number differences, soooo, it had to be something.
Third and lastly, I open the Stage Event line, and found a very interesting issue, see edit 2.
After further investigating why my scheduler was so delayed, I really didn't find the real reason, but this sentence gave me the hint. The problem was in the driver
Scheduler delay (blue) is the time spent waiting. There is something
that the executors are waiting for - often this is waiting for the
driver that controls and coordinates the jobs.
source: enter link description here
In that post, the author also mention something very important that I wish to add
See all that red and blue? This is a sure sign that something is up.
What we really want to see is lots of green - the proportion of time
spent doing work - I mean real work - the part where Spark does the
number crunching.
TDLR:
Biggest problem came from Scheduler delay, very related to driver. Increasing the Memory (and vCPUs), solved the issue.
I am working with a large CSV (~60GB; ~250M rows) with Dask in Jupyter.
The first thing I want to do with the DF after loading it is to concatenate two string columns. I can do so successfully, but I noticed that cell execution time does not seem to decrease with higher workers counts (I tried 5, 10, and 20 on a machine with 64 logical cores). If anything, every five or so workers seem to add an extra minute to execution time.
Meanwhile, the progress bar of Dask's dashboard suggests that the task scales well with worker count. At 5 workers the task finishes (ac. to the dashboard) in about 10-15 min. At 20 workers the stream visualisation suggests task completion in roughly 3-5 min. But cell execution time remains around 25 min, i.e. in the 5-worker case the cell will appear to be hanging for an extra 10-15 min. after the stream has finished; in the 20-worker case -- for 20-22 more min., with no evidence of worker activity as far as I can see.
This is the code that I'm running:
import dask
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(n_workers=20)
client = Client(cluster)
df = dd.read_csv('df_name.csv', dtype={'col1': 'object', 'col2': 'object'})
with ProgressBar():
df["col_merged"] = df["col3"]+df["col4"]
df = df.compute()
Python version: 3.9.1
Dask version: 2021.06.2
What am I missing? Could this simply be overhead from having Dask to coordinate several workers?
To add to #SultanOrazbayev 's answer, the specific thing that's taking time after the tasks have all been done, is copying data from the workers into your client process to assemble the single in-memory dataframe that you have asked for. This is not a "task", as all the computing has already happened, and does not parallelise well, because the client is a single thread pulling data from the workers.
As with the comment above: if you want to achieve parallelism, you need to load the data in workers (which dd.read_csv does) and act on them in workers o get your result. You should on .compute() relatively small things. Conversely, if your data first comfortably into memory, there was probably nothing to be gained by having dask involved at all, just use pandas.
Running
df = df.compute()
will attempt to load all the 250M rows into memory. If this is feasible with your machine, you will still spend a lot of time because each worker is going to send their chunk, so there will be a lot of data transfer...
The core idea is to bring into memory only the results of the reduced calculations, and distribute the workload among the workers until then.
There are 4 major actions(jdbc write) with respect to application and few counts which in total takes around 4-5 minutes for completion.
But the total uptime of Application is around 12-13minutes.
I see there are certain jobs by name run at ThreadPoolExecutor.java : 1149. Just before this job being reflected on Spark UI, the invisible long delays occur.
I want to know what are the possible causes for these delays.
My application is reading 8-10 CSV files, 5-6 VIEWs from table. Number of joins are around 59, few groupBy with agg(sum) are there and 3 unions are there.
I am not able to reproduce the issue in DEV/UAT env since the data is not that much.
It's in the production where I get the app. executed run by my Manager.
If anyone has come across such delays in their job, please share your experience what could be the potential cause for this, currently I am working around the unions, i.e. caching the associated dataframes and calling count so as to get the benefit of cache in the coming union(yet to test, if union is the reason for delays)
Similarly, I tried the break the long chain of transformations with cache and count in between to break the long lineage.
The time reduced from initial 18 minutes to 12 minutes but the issue with invisible delays still persist.
Thanks in advance
I assume you don't have a CPU or IO heavy code between your spark jobs.
So it really sparks, 99% it is QueryPlaning delay.
You can use
spark.listenerManager.register(QueryExecutionListener) to check different metrics of query planing performance.
I have obtained task stream using distributed computing in Dask for different number of workers. I can observe that as the number of workers increase (from 16 to 32 to 64), the white spaces in task stream also increases which reduces the efficiency of parallel computation. Even when I increase the work-load per worker (that is, more number of computation per worker), I obtain the similar trend. Can anyone suggest how to reduce the white spaces?
PS: I need to extend the computation to 1000s of workers, so reducing the number of workers is not an option for me.
Image for: No. of workers = 16
Image for: No. of workers = 32
Image for: No. of workers = 64
As you mention, white space in the task stream plot means that there is some inefficiency causing workers to not be active all the time.
This can be caused by many reasons. I'll list a few below:
Very short tasks (sub millisecond)
Algorithms that are not very parallelizable
Objects in the task graph that are expensive to serialize
...
Looking at your images I don't think that any of these apply to you.
Instead, I see that there are gaps of inactivity followed by gaps of activity. My guess is that this is caused by some code that you are running locally. My guess is that your code looks like the following:
for i in ...:
results = dask.compute(...) # do some dask work
next_inputs = ... # do some local work
So you're being blocked by doing some local work. This might be Dask's fault (maybe it takes a long time to build and serialize your graph) or maybe it's the fault of your code (maybe building the inputs for the next computation takes some time).
I recommend profiling your local computations to see what is going on. See https://docs.dask.org/en/latest/phases-of-computation.html
I have an ML dataframe which I read from csv files. It contains three types of columns:
ID Timestamp Feature1 Feature2...Feature_n
where n is ~ 500 (500 features in ML parlance). The total number of rows in the dataset is ~ 160 millions.
As this is the result of a previous full join, there are many features which do not have values set.
My aim is to run a "fill" function(fillna style form python pandas), where each empty feature value gets set with the previously available value for that column, per Id and Date.
I am trying to achieve this with the following spark 2.2.1 code:
val rawDataset = sparkSession.read.option("header", "true").csv(inputLocation)
val window = Window.partitionBy("ID").orderBy("DATE").rowsBetween(-50000, -1)
val columns = Array(...) //first 30 columns initially, just to see it working
val rawDataSetFilled = columns.foldLeft(rawDataset) { (originalDF, columnToFill) =>
originalDF.withColumn(columnToFill, coalesce(col(columnToFill), last(col(columnToFill), ignoreNulls = true).over(window)))
}
I am running this job on a 4 m4.large instances on Amazon EMR, with spark 2.2.1. and dynamic allocation enabled.
The job runs for over 2h without completing.
Am I doing something wrong, at the code level? Given the size of the data, and the instances, I would assume it should finish in a reasonable amount of time? And I haven't even tried with the full 500 columns, just with about 30!
Looking in the container logs, all I see are many logs like this:
INFO codegen.CodeGenerator: Code generated in 166.677493 ms
INFO execution.ExternalAppendOnlyUnsafeRowArray: Reached spill
threshold of
4096 rows, switching to
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
I have tried setting parameter spark.sql.windowExec.buffer.spill.threshold to something larger, without any impact. Is theresome other setting I should know about? Those 2 lines are the only ones I see in any container log.
In Ganglia, I see most of the CPU cores peaking around full usage, but the memory usage is lower than the maximum available. All executors are allocated and are doing work.
I have managed to rewrite the fold left logic without using withColumn calls. Apparently they can be very slow for large number of columns, and I was also getting stackoverflow errors because of that.
I would be curious to know why this massive difference - and what exactly happens behind the scenes with the query plan execution, which makes repeated withColumns calls so slow.
Links which proved very helpful: Spark Jira issue and this stackoverflow question
var rawDataset = sparkSession.read.option("header", "true").csv(inputLocation)
val window = Window.partitionBy("ID").orderBy("DATE").rowsBetween(Window.unboundedPreceding, Window.currentRow)
rawDataset = rawDataset.select(rawDataset.columns.map(column => coalesce(col(column), last(col(column), ignoreNulls = true).over(window)).alias(column)): _*)
rawDataset.write.option("header", "true").csv(outputLocation)