Stopping a Running Spark Application (Databricks Interactive Cluster) - apache-spark

I'm using databricks with an interactive cluster. If I review their management user-interface, there is only one "application" listed. And when I try to kill it, I always get this message
HTTP ERROR 405
Problem accessing /app/kill/. Reason:
Method Not Allowed
The end result is that I'm forced to restart the entire cluster. I use their "cluster pool" feature which makes the wait time a bit less. but it still involves waiting for about a minute before I'm able to get back to work.
The reason I need to restart the application is to swap fresh jar's into the spark environment. Otherwise when I repeatedly use addJar(), I run into some annoying jar-hell issues (class not found errors and such).
Why does Databricks only list one application at a time in their "interactive" cluster?
Why doesn't databricks have a way to stop one application and start another in its place (without restarting the whole cluster)?
This affects development productivity when we are forced to sit around waiting an extra minute for no good reason. It is already pretty hard to be productive with spark.

Related

Pacemaker/Corosync/PostgreSQL cluster failovers during heavy load

At our company we're running PostgreSQL on 4-node clusters using Pacemaker and Corosync.
During heavy batch loads we suffer from cluster failovers because the inbuilt resource monitoring gets timed out when trying to access the database because well, server overload...
On one end it's understandable cluster behaviour that a 'self induced denial of service' should trigger a master switchover, on the other hand we'd like to not see our batches and service (temporarily) aborted because of this. A standalone server would have just pulled through. Obviously we look into optimizing and spreading the batches, but that's like putting one fire out and another pops up elsewhere.
I looked into linux cgroups but this doesn't seem to be a viable solution as all it does is CPU/IO limit your postgresql resource, which is part of the problem :-)
Any ideas or suggestions very much appreciated!

If multiple jobs exist in the event loop for one process. What happens to the remaining jobs if the current job crashes the process?

In Node.js cluster mode, if multiple jobs exist in the event loop for one process, should the current job crash the process, what happens to the remaining job?
I'm assuming the remaining jobs in the event loop would go unfulfilled or return a server error. My question is, why is this an acceptable risk? Why would someone opt to use Node.js cluster mode in production then, rather than use something like PHP in production, where there is no risk of this, because PHP handles each request in its own process.
Edit:
Obviously this doesn't just apply to Node.js cluster mode. It can happen on a single instance, in which case obviously the end user would just get a server error. Cluster mode just happens to be my personal use case.
I'm looking for a way to pick back up a job in the queue job should a previous job cause the process to exit, before the subsequent job gets a change to be fulfilled. I am currently reading about how you can use a tool like RabbitMQ to handle your job queue outside of the node.js cluster, and each cluster instance just pulls jobs from the RabbitMQ queue. If anyone has any input on that, that would also be greatly appreciated.
If multiple jobs exist in the event loop for one process. What happens to the remaining jobs if the current job crashes the process?
If a node.js process crashes, the same thing happens to it that happens to any other process. All open sockets get automatically disconnected and the client will receive an immediate close on their socket (socket connection dropped essentially).
If you were using a Java server that was in the middle of handling 10 requests (perhaps in threads) and it crashed, the consequences would be the same. All 10 socket connections would get dropped.
If process isolation from one request to another is your #1 criteria for selecting a server environment, then I guess you wouldn't pick any environment that ever serves multiple requests from the same process. But, you would give up a lot of get that. One of the reasons for the node.js design is that is scales really, really well for a high number of concurrent connections that are all doing mostly I/O things (disk, networking, database stuff, etc...) which happens to be most web servers. Whereas a design that fires up a new process for every incoming connection does not scale as well for a large number of concurrent connections because a process is a much more heavy-weight thing in the eyes of the operating system (memory usage, other system resource usage, task switching overhead, etc...) than the way node.js does things.
And, there are obviously hundreds of other considerations too when choosing a server environment. So, you kind of have to look at the whole picture of what you're designing for and make the best set of tradeoffs.
In general, I wouldn't put this issue anywhere on the radar for why you should choose one over the other unless you expect to be running risky code (perhaps out of your control) that crashes a lot and this issue is therefore more important in your deployment than all the other differences. And, if that was the case, I'd probably isolate the risky code to its own process (even when using nodejs) to alleviate any pain from that crash. You could have a process pool waiting to process risky things. For example, if you were running code submitted by a user, I might run that code in its own isolated VM.
If you're just worried about your own code crashing a lot, then you probably have bigger problems and need more extensive unit testing, more robust error handling and need to take advantage of other tools just as a linter and other code analysis tools to find potential problem areas. With proper design, implementation and error handling, you should be able to keep a single incoming request from harming anything other than itself. That's certainly the philosophy that every server environment that serves multiple requests from the same process advises and the people/companies deploying those servers use.

Azure App Service - Running Solr on Jetty - LockObtainFailedException after Azure maintenance

I'm running a single (not scaled) solr instance on a Azure App Service.
The App Service runs Java 8 and a Jetty 9.3 container.
Everything works really well, but when Azure decides to swap to another VM sometimes the JVM doesn't seem to shutdown gracefully and we encounter issues.
One of the reasons for Azure to decide to swap to another VM is infrastructure maintenance. For example Windows Updates are installed and your app is moved to another machine.
To prevent downtime Azure spins up the new app and when it's ready it will swap over to the new app. Seems fine, but this does not seem to work well with solr's locking mechanism.
We are using the default native lockType, which should be fine since we're only running a single instance. Solr should remove the write.lock file during shutdown, but this does not seem to happen all of the time.
The Azure Diagnostics tools clearly show this event happening:
And the memory usage shows both apps:
During the start of the second instance solr tries to lock the index, but this is not possible because the first one is still using it (it also has the write.lock file). Sometimes the first one doesn't remove the write.lock file and this is were the problems start.
The second solr instance will never work correctly without manual intervention (manually deleting the write.lock file).
The solr logs:
Caused by: org.apache.solr.common.SolrException: Index dir 'D:\home\site\wwwroot\server\solr\****\data\index/' of core '*****' is already locked. The most likely cause is another Solr server (or another solr core in this server) also configured to use this directory; other possible causes may be specific to lockType: native
and
org.apache.lucene.store.LockObtainFailedException
What can be done about this?
I was thinking of changing the lockType to a memory-based lock, but I'm not sure if that would work because both instances are alive at the same time during a short period of time.
You could try and set WEBSITE_DISABLE_OVERLAPPED_RECYCLING=1
Overlapped recycling makes it so that before the current instance on
an app is shut down, a new instance starts. It can in some cases cause
file locking issues, in which case you can try turning it off:
Reference
If you would like to run Solr without any locks at all you could do these by specifying in your solrconfig.xml instead of usual <lockType>native</lockType> you could use <lockType>none</lockType>.
Obviously, you need to be careful with this mechanism, since different Solr instances could try to change index at the same time which could lead to potential corruptions.
All available lock types are listed there

Is Apache Zeppelin stable enough to be used in Production

I am using AWS EMR cluster. I have been experimenting with Spark Drivers and Apache Zeppelin Rest APIs to run jobs. I have run several hundred adhoc jobs with Zeppelin and didn't have any concern. With that fact I am considering to use Zeppelin Rest APIs in production. Will be submitting jobs using Rest APIs.
Has anyone experienced stability issues with Zeppelin in Production?
I have a zeppelin running in production in a multiuser environment (+/- 15 users) and it hasn't been very stable. To make it more stable I run zeppelin on its own node, not any longer on the master node.
Anyway, I found the following problems:
In the releases before 0.7.2 Zeppelin created a lot of zombie processes, which causes memory problems after heavy usage.
User libraries can break Zeppelin, this has been the case in the versions prior 0.7.0. E.g. Jackson libraries make Zeppelin unable to communicate with the spark interpreter. In 0.7.0 and up this problem has been mitigated.
There are random freezes when there are a lot of users. The only way to fix this, is a restart of the service. (All versions)
Sometimes when a user starts his interpreter and the local repo is empty, zeppelin doesn't download all the libraries specified in the interpreter config. Then it won't download them again, the only way to mitigate this is to delete the contents of the local repo of the interpreter. (All versions)
Sometimes changes on notebooks don't get saved, which causes users to loose code.
In version 0.6.0 spark interpreters shared a context, which caused users to overwrite each other variables.
Problems are difficult to debug, the logging is not that great yet. Some bugs seem to break the logging and sometimes running an interpreter in debug mode fixes the problem.
So, I wouldn't put it in a production setting yet, where people depend on it. But for testing and data discovery it would be fine. Zeppelin is clearly still in a beta stage.
Also don't run it on the master node, but setup your own instance and let it connect remotely to the cluster. This makes it much more stable. Put it on a beefy node and restart it overnight.
Most of the bugs I encountered are already on the Jira and the developers are working hard to make things better. The stability becomes better and better every release and I see the maintenance load going down every version, so it certainly has potential.
I have used zeppelin now for more than a year. It gets you going quickly when you are just starting but it is not a good candidate for production use cases and especially with more than 10 users and it depends on your cluster resources. These were my concerns overall with Zeppelin.
By default you can't have more than one job running at a time, you
will need to change the configuration to make that happen.
If you are loading additional libraries from s3 or external environments, you can do that only in the beginning or you will have
to restart zeppelin.
spark context is pre-created and there are only few settings you can make changes to.
The editor itself doesn't resize well when your output is large.
I am moving on to jupyter for my use cases which is much strong in my initial assessment.
As of the time of this answer, end of February 2019, my answer would be : NO. Plain and Simple. Zeppelin keeps crashing, hanging and getting unresponsive, notebooks tend to get unloadable due to size errors, very slow execution compared to Jupyter, plus so many limitations regarding third party displaying engines integration (although many effort have been made towards this).
I experienced these issues on a decently sized and capacited cluster, with a single user. I would never, ever, advice it to be a production tool. Not as it is today to the least. Unless you have an admin at hand able to restart the whole thing regularly and track down/fix errors and be in charge of integration.
We moved back to Jupyter, and everything worked smoothly out-of-the box from day one, after struggling to stabilize Zeppelin for weeks.

Monitor node.js scripts running on ubuntu instance

I have a node.js script that run once in a day on ubuntu EC2 instance. This script pulls data from some hundered thousand remote APIs and save to our local database. Is there any way we can monitor this node.js script on remote server? There have been few instances where script crashed due to some reason and we were unable to figure it out without SSHing into instance and checking the logs. I have however created a small system after first few crashes which send us an email whenever script crashes due to some uncaught exception and also when script completes execution.
However, we need to develop a better system where we can monitor the progress of script via web interface of our admin application which is deployed over some other instance and also trigger start/stop of script via this interface. What are possible options for achieving this?
If you like to stay in Node.js, then there are several process monitoring tools:
PM2 comes with lots of other features besides monitoring processes. You can monitor your processes via CLI or their official web interface: https://keymetrics.io/. A quick search on npm also gives a bunch of nice unofficial gui tools: https://www.npmjs.com/search?q=pm2+web
Forever is not as feature rich as PM2 but will do the basic process operations and couple of gui are also available in npm.
There are two problems here that you are trying to solve:
Scheduling work to be done
Monitoring a process for failure
At a simple level, this is easy: schedule a cron job and restart failed things so they keep trying.
However, when things don't go smoothly, it helps to have a lot more granularity over what you are scheduling, and how it is executed. This would also give you the visibility over each little piece of work.
Adding a little more complexity, you can end up with something like this:
Schedule the script that starts everything (via cron, if that's comfortable)
That script generates several jobs that need to be executed into a queue
A worker process (or n worker processes) consume that queue and execute pending jobs
You can monitor both the progress of the jobs, as well as the state of each worker (# of crashes, failures, jobs completed, etc.). The other tools mentioned above are good candidates for this (forever, pm2, etc.)
When jobs fail, other workers can pick up the small piece of work that was in progress and restart it. This is much more efficient than restarting the entire process, and also lets you parallelize things across n workers based on how you can split up the workloads.
You could easily throw the status onto a web app so you can check in periodically rather than have to dig through server logs.
You can also get more intelligent with different types of failures. Network error? Retry 5 times. Rated limited? Gradual back-off. Crash? Don't retry and notify via email. etc
I have tried this with pm2, you can get the info of the task, then cat out or grab the log files. Or you could have a logging server, see also: https://github.com/papertrail/remote_syslog2

Resources