Stop Spark executor logs from getting gzipped - apache-spark

I have a Spark job with some very long running tasks. When the tasks start I can go to the executors tab and see all my executors and their tasks. I can click on the stderr link to see the logs for those tasks which helps a lot for monitoring. However, after a few hours the stderr link stops working. If you click on it you get java.lang.Exception: Cannot find this log on the local disk.. I dug into a bit and the issue seems to be that something has decided to gzip the logs. That is, I can still manually find the log by ssh-ing to the worker node and looking in the correct directory (e.g. /mnt/var/log/hadoop-yarn/containers/application_1486407288470_0005/container_1486407288470_0005_01_000002/stderr.gz). It's annoying that this happens since I now can't monitor my job from the UI. Also, the files are pretty tiny so the compression doesn't seem helpful (40k uncompressed). It seems like there's a lot of things that could be causing this to happen: yarn, a logroller cron job, the log4j config in my Yarn/Spark distro, AWS (since EMR zips logs and saves 'em to S3), etc. so I'm hoping someone can point me in the right direction so I don't have to search a ton of docs.
I'm using AWS EMR at emr-5.3.0 without any custom bootstrap steps.

Just had a similar issue. I haven't searched how to stop gzip from happening, but you can access the logs using the hadoop interface.
On the left menu, under Tools > Local logs
Then browse to find the log you are interested in.
For my case, the gzip from the gui at /node/containerlogs/container_1498033803655_0037_01_000001/hadoop/stderr.gz/?start=-4096
And using local logs menu, it was in
/logs/containers/application_1498033803655_0037/container_1498033803655_0037_01_000001/stderr.gz
Hope it helps

Related

Spark SHC Core - Log region/regionserver

Im using the SHC spark connector by hortonworks to read an HBase table
https://github.com/hortonworks-spark/shc
I have some tasks that take a very long time to complete and I suspect its because of region size skew but would like to confirm it by logging which region/region server each task is reading.
I tried turning on debug logs by doing the following in the driver
Logger.getLogger("org").setLevel(Level.DEBUG);
Logger.getLogger("akka").setLevel(Level.DEBUG);
But it didnt seem to have any effect.
Is it possible to log the above somehow?
it didn't seem to have any effect.
Yes, unfortunately, SHC itself does not log the region/region server name information anywhere during the execution. That's why enabling DEBUG log would not help at all.
Is it possible to log the above somehow?
Yes, and only if you know where and how to customize shc's source code. You might need to insert your own log command, rebuild, test, package, and ship it with your application.
It depends on your goal. i.e. you might want to call logDebug() or logInfo() of the region name info during a task of table scanning. here is source code HBaseTableScan
The build, test, ship, .etc details are here in SHC's repo doc.

How can I move catalina.out with log4j in Azure blob storage

How this can be achieve? I have a catalina.out log in a prod server which is growing fast in space. 6.7 GB in couple of days . I had the Idea at the begging to create a cronjob to be executed 2 or 3 days a week to run a script that copy catalina log to Azure blob storage and then wipe it out with just a command "echo "" > file". But moving 2 GB to azure every day that cron job executes donĀ“t know if is the best idea either. way too big file.
Is there a way that the logs is in another server/azure storage? Where should I configuer that?
I read something about implementing log4j with tomcat, is this possible also? that catalina.out using log4j move it to other server? Howcan I achieve this?. I know that development team should check also why is growing and logging so fast this file, but in the meantime I need a solution to implement.
thanks!!
I read something about implementing log4j with tomcat, is this
possible also?
I think what you want to describe is Log Rotation, if you want to use this way, here is a blog about how to configure it.
I had the Idea at the begging to create a cronjob to be executed 2 or
3 days a week to run a script that copy catalina log to Azure blob
storage
Yes, you could choose this way to manage log, however I still have something to say. If you want to upload the log file to Azure Blob, I think you may get error for the large file . You need split large file into multiple small file. In this article, under the title Upload a file in blocks programmatically, there is detailed description.
From you description, you are not using Azure Web, so if you choose Azure Web , you could also use Azure Functions or WebJobs to do the cronjob.
If you still have other questions, please let me know.

Jenkins build logs are too large

I have some jobs in Jenkins that create logs that are 300MB large, each.
The build's log gets created on my Solaris M6 machine.
Fact : I cannot pimp my job because of a business process, it must stay as it is.
My question : How to maintain such huge logs?
Is there any way that Jenkins perhaps knows how to ZIP the logs by
himself and then UNZIP it when my user tries to read the Console
Output (the logs itself via Jenkins) ?
Because if I zip the log manually on Solaris, it will no longer be readable via Jenkins.
I found out that Hudson (Jenkins) has a support for .gz format, which means you can gzip all the logs and Jenkins will still be able to read them, which is SUPER AWESOME ! That way I saved tons of GB's of storage, it shrinked my 600MB logs to just 2MB, great.

Weird "Stale file handle, errno=116" on remote cluster after dozens of hours running

I'm now running a simulation code called CMAQ on a remote cluster. I first ran a benchmark test in serial to see the performance of the software. However, the job always runs for dozens of hours and then crashes with the following "Stale file handle, errno=116" error message:
PBS Job Id: 91487.master.cluster
Job Name: cmaq_cctm_benchmark_serial.sh
Exec host: hs012/0
An error has occurred processing your job, see below.
Post job file processing error; job 91487.master.cluster on host hs012/0Unknown resource type REJHOST=hs012.cluster MSG=invalid home directory '/home/shangxin' specified, errno=116 (Stale file handle)
This is very strange because I never modify the home directory and this "/home/shangxin/" is surely my permanent directory where the code is....
Also, in the standard output .log file, the following message is always shown when the job fails:
Bus error
100247.930u 34.292s 27:59:02.42 99.5% 0+0k 16480+0io 2pf+0w
What does this message mean specifically?
I once thought this error is due to that the job consumes the RAM up and this is a memory overflow issue. However, when I logged into the computing node while running to check the memory usage with "free -m" and "htop" command, I noticed that both the RAM and swap memory occupation never exceed 10%, at a very low level, so the memory usage is not a problem.
Because I used "tee" to record the job running to a log file, this file can contain up to tens of thousands of lines and the size is over 1MB. To test whether this standard output overwhelms the cluster system, I ran another same job but without the standard output log file. The new job still failed with the same "Stale file handle, errno=116" error after dozens of hours, so the standard output is also not the reason.
I also tried running the job in parallel with multiple cores, it still failed with the same error after dozens of hours running.
I can make sure that the code I'm using has no problem because it can successfully finish on other clusters. The administrator of this cluster is looking into the issue but also cannot find out the specific reasons for now.
Has anyone ever run into this weird error? What should we do to fix this problem on the cluster? Any help is appreciated!
On academic clusters, home directories are frequently mounted via NFS on each node in the cluster to give you a uniform experience across all of the nodes. If this were not the case, each node would have its own version of your home directory, and you would have to take explicit action to copy relevant files between worker nodes and/or the login node.
It sounds like the NFS mount of your home directory on the worker node probably failed while your job was running. This isn't a problem you can fix directly unless you have administrative privileges on the cluster. If you need to make a work-around and cannot wait for sysadmins to address the problem, you could:
Try using a different network drive on the worker node (if one is available). On clusters I've worked on, there is often scratch space or other NFS directly under root /. You might get lucky and find an NFS mount that is more reliable than your home directory.
Have your job work in a temporary directory local to the worker node, and write all of its output files and logs to that directory. At the end of your job, you would need to make it copy everything to your home directory on the login node on the cluster. This could be difficult with ssh if your keys are in your home directory, and could require you to copy keys to the temporary directory which is generally a bad idea unless you restrict access to your keys with file permissions.
Try getting assigned to a different node of the cluster. In my experience, academic clusters often have some nodes that are more flakey than others. Depending on local settings, you may be able to request certain nodes directly, or potentially request resources that are only available on stable nodes. If you can track which nodes are unstable, and you find your job assigned to an unstable node, you could resubmit your job, then cancel the job that is on an unstable node.
The easiest solution is to work with the cluster administrators, but I understand they don't always work on your schedule.

How to raise or lower the log level in puppet master?

I am using puppet 3.2.3, passenger and apache on CentOS 6. I have 680 compute nodes in a cluster along with 8 gateways users use to log in to the cluster and submit jobs. All the nodes and gateways are under puppet control. I recently upgraded from 2.6. The master logs to syslog as desired, but how to change the log level for the master escapes me. I appear to have the choice of --debug, or nothing. Debug logs far too much detail, while not using that switch simply logs each time passneger/apache launch a new worker to handle incoming connections.
I find nothing in the on-line docs about doing this. What I want is to log each time a nodes hits the server; but I do not need to see the compiled catalogue, or resources in/var/log/messages.
How is this accomplished?
This is a hack, but here is how I solved the problem. In the file (config.ru) that passenger uses to launch puppet via rack middleware, which in my system lives in /usr/share/puppet/rack/puppetmasterd, I noticed these lines
require 'puppet/util/command_line'
run Puppet::Util::CommandLine.new.execute
So, this I edited to become
require 'puppet/util/command_line'
Puppet::Util::Log.level = :info
run Puppet::Util::CommandLine.new.execute
I suppose other choices for Log.level could be :warn and others.

Resources