Why does Spark job fail with "too many open files"? - apache-spark

I get "too many open files" during the shuffle phase of my Spark job. Why is my job opening so many files? What steps can I take to try to make my job succeed.

This has been answered on the spark user list:
The best way is definitely just to increase the ulimit if possible,
this is sort of an assumption we make in Spark that clusters will be
able to move it around.
You might be able to hack around this by decreasing the number of
reducers [or cores used by each node] but this could have some performance implications for your
job.
In general if a node in your cluster has C assigned cores and you run
a job with X reducers then Spark will open C*X files in parallel and
start writing. Shuffle consolidation will help decrease the total
number of files created but the number of file handles open at any
time doesn't change so it won't help the ulimit problem.
-Patrick Wendell

the default ulimit is 1024 which is ridiculously low for large scale applications. HBase recommends up to 64K; modern linux systems don't seem to have trouble with this many open files.
use
ulimit -a
to see your current maximum number of open files
ulimit -n
can temporarily change the number of open files; you need to update the system configuration files and per-user limits to make this permanent. On CentOS and RedHat systems, that can be found in
/etc/sysctl.conf
/etc/security/limits.conf

Another solution for this error is reducing your partitions.
check to see if you've got a lot of partitions with:
someBigSDF.rdd.getNumPartitions()
Out[]: 200
#if you need to persist the repartition, do it like this
someBigSDF = someBigSDF.repartition(20)
#if you just need it for one transformation/action,
#you can do the repartition inline like this
someBigSDF.repartition(20).groupBy("SomeDt").agg(count("SomeQty")).orderBy("SomeDt").show()

Related

Monitor memory usage of each node in a slurm job

My slurm job uses several nodes, and I want to know the maximum memory usage of each node for a running job. What can I do?
Right now, I can ssh into each node and do free -h -s 30 > memory_usage, but I think there must be a better way to do this.
The Slurm accounting will give you the maximum memory usage over time over all tasks directly. If that information is not sufficient, you can setup profiling following this documentaiton and you will receive from Slurm the full memory usage of each process as a time series for the duration of the job. You can then aggregate per node, find the maximum, etc.

How much load on a node is too much?

We have 10 node Cassandra cluster and would like to monitor Load(CPU) .
The OpsCenter shows the load on nodes ranges from 0 to 20s and sometimes it goes in 50s and 60s . But normally its below 25 .
We just want to understand how much should be considered alarming ?
It's ok until you don't have performance issues/timeouts.
Going higher than 80% is not recommended as you don't have time to react. Also consider checking your 60% spikes.
Depends on your hardware and installation.
I recommend doing stress-testing of your system. Because we don't know how many writes and reads you are doing. You need reserve one core for linux and keep load 80%.

Spark cores & tasks concurrency

I've a very basic question about spark. I usually run spark jobs using 50 cores. While viewing the job progress, most of the times it shows 50 processes running in parallel (as it is supposed to do), but sometimes it shows only 2 or 4 spark processes running in parallel. Like this:
[Stage 8:================================> (297 + 2) / 500]
The RDD's being processed are repartitioned on more than 100 partitions. So that shouldn't be an issue.
I have an observations though. I've seen the pattern that most of the time it happens, the data locality in SparkUI shows NODE_LOCAL, while other times when all 50 processes are running, some of the processes show RACK_LOCAL.
This makes me doubt that, maybe this happens because the data is cached before processing in the same node to avoid network overhead, and this slows down the further processing.
If this is the case, what's the way to avoid it. And if this isn't the case, what's going on here?
After a week or more of struggling with the issue, I think I've found what was causing the problem.
If you are struggling with the same issue, the good point to start would be to check if the Spark instance is configured fine. There is a great cloudera blog post about it.
However, if the problem isn't with configuration (as was the case with me), then the problem is somewhere within your code. The issue is that sometimes due to different reasons (skewed joins, uneven partitions in data sources etc) the RDD you are working on gets a lot of data on 2-3 partitions and the rest of the partitions have very few data.
In order to reduce the data shuffle across the network, Spark tries that each executor processes the data residing locally on that node. So, 2-3 executors are working for a long time, and the rest of the executors are just done with the data in few milliseconds. That's why I was experiencing the issue I described in the question above.
The way to debug this problem is to first of all check the partition sizes of your RDD. If one or few partitions are very big in comparison to others, then the next step would be to find the records in the large partitions, so that you could know, especially in the case of skewed joins, that what key is getting skewed. I've wrote a small function to debug this:
from itertools import islice
def check_skewness(df):
sampled_rdd = df.sample(False,0.01).rdd.cache() # Taking just 1% sample for fast processing
l = sampled_rdd.mapPartitionsWithIndex(lambda x,it: [(x,sum(1 for _ in it))]).collect()
max_part = max(l,key=lambda item:item[1])
min_part = min(l,key=lambda item:item[1])
if max_part[1]/min_part[1] > 5: #if difference is greater than 5 times
print 'Partitions Skewed: Largest Partition',max_part,'Smallest Partition',min_part,'\nSample Content of the largest Partition: \n'
print (sampled_rdd.mapPartitionsWithIndex(lambda i, it: islice(it, 0, 5) if i == max_part[0] else []).take(5))
else:
print 'No Skewness: Largest Partition',max_part,'Smallest Partition',min_part
It gives me the smallest and largest partition size, and if the difference between these two is more than 5 times, it prints 5 elements of the largest partition, to should give you a rough idea on what's going on.
Once you have figured out that the problem is skewed partition, you can find a way to get rid of that skewed key, or you can re-partition your dataframe, which will force it to get equally distributed, and you'll see now all the executors will be working for equal time and you'll see far less dreaded OOM errors and processing will be significantly fast too.
These are just my two cents as a Spark novice, I hope Spark experts can add some more to this issue, as I think a lot of newbies in Spark world face similar kind of problems far too often.

when I run my program on spark cluster,how can I choose quicker machine to run my program?

I have a spark cluster which contain 8 machines,
but two of them are old computers and run slowly,
so I want to make just 6 of them run,
I just know that i can use --num-executor to make 6 computers to run,
but i want to make 6 new computers running my program,how can I do it ?
If you're using Yarn as a resource manager, you could specify the executor's memory size to be larger than the amount available on the older computers. This would allow you to achieve what you're looking for. (--executor-memory 10g)
Otherwise, you'll need a feature called Yarn Labels:YARN-796

ext performance handling millions of files

I have a filesystem with 40 million files in a 10 level tree structure (around 500 GB in total). The problem I have is the backup. An Incr backup (bacula) takes 9 hours (around 10 GB) with a very low performance. Some directories have 50k files, other 10k files. The HDs are HW RAID, and I have the default Ubuntu LV on top. I think the bottleneck here is the # of files (the huge # of inodes.) I'm trying to improve the performance (a full backup on the same FS takes 4+ days, at 200k/s read speed).
- Do you think that partitioning the FS into several smaller FS would help? I can have 1000 smaller FS...
- Do you think that moving from HD to SSD would help?
- Any advice?
Thanks!
Moving to SSD will improve the speed of the backup. The SSD will get tired very soon and you will need the backup...
Can't you organise things that you know where to look for changed/new files?
In that way you pnlu need to increment-backup those folders.
Is it necessary your files are online? Can you have tar files of old trees 3 levels deep?
I guess a find -mtime -1 will take hours as well.
I hope that the backup is not using the same partition as de tree structure
(everything under /tmp is a very bad plan), the temporary files the bavkup might make should be on a different partition.
Where are the new files coming from? When all files are changed by a process you control, your process can make a logfile with a list of files changed.

Resources