Submit & monitor spark jobs via java in cluster mode - apache-spark

I have a java class which manage jobs and execute them via spark(using 1.6).
I am using the API - sparkLauncher. startApplication(SparkAppHandle.Listener... listeners) in order to monitor the state of the job.
The problem is I moved to work in a real cluster environment and this way can’t work when the master and workers are not on the same machine, as the internal implementation is making a use of localhost only (loopback) to open a port for the workers to bind to.
The API sparkLauncher.launch() works but doesn’t let me monitor the status.
What is the best practice for cluster environment using a java code?
I also saw the option of hidden Rest API, is it mature enough? Should I enable it in spark somehow (I am getting access denied, even though the port is open from outside) ?

REST API
In addition to viewing the metrics in the UI, they are also available as JSON. This gives developers an easy way to create new visualizations and monitoring tools for Spark. The JSON is available for both running applications, and in the history server. The endpoints are mounted at /api/v1. Eg., for the history server, they would typically be accessible at http://:18080/api/v1, and for a running application, at http://localhost:4040/api/v1.
More details you can find here.

Every SparkContext launches a web UI, by default on port 4040, that displays useful information about the application. This includes:
A list of scheduler stages and tasks
A summary of RDD sizes and memory usage
Environmental information.
Information about the running executors
You can access this interface by simply opening http://driver-node:4040 in a web browser. If multiple SparkContexts are running on the same host, they will bind to successive ports beginning with 4040 (4041, 4042, etc).
More details you can find here.

Related

Remotely access a qsub compute node

Using qsub, I have submitted a long running job that spawns two Java processes; one of which is listening for Java RMI calls on some port. Say qsub assigns that job to node "compute-0-37". How can I communicate with compute-0-37 remotely (on a node other than the head node) over an RPC call (Java RMI in this case)?
I have not been able to find this from reading existing docs (e.g. http://gridscheduler.sourceforge.net/htmlman/htmlman1/qsub.html, http://docs.adaptivecomputing.com/torque/4-0-2/Content/topics/commands/qsub.htm). As far as I can tell, the only way to access a compute node directly is from the head node, but it seems like that would be pretty restrictive for use cases like mine.
The reason you can't find any documentation in the resource management software documentation is because this isn't a resource management software question. Accessing worker nodes is simply that: a function of network access. Whether you're hoping to do RPC communication between the nodes themselves, or with some machine on a different subnet, you should be able to do so (provided that site policies and the system administator/s allow it).

Connect Spark application with web server

I just will try to explain my simplified use case. There is:
Spark Application which count the words.
A web server which serves a web page with a form.
User that can type word within this form and submit it.
Server receives the word and sends it to the Spark Application.
Spark application takes as an input this word, based on some data and this word launches a job with recalculations. Once Spark done with calculations, it sends results to web server which shows results on a web page.
The question is, how i can establish communication between spark application and web-server?
I guess, that spark-jobserver or spark-streaming can help me here, but i am not sure about it.
There are a few projects that will help you with this.
Generally you run a separate webserver for managing the spark jobs as there is some messy systemExec work around the spark-submit cli to accomplish this. Obviously this runs on a different port than you're primary application and is only accessible by the server component of the primary web application.
There are a few open source projects that will handle this for you most notably:
https://github.com/spark-jobserver/spark-jobserver
https://github.com/cloudera/livy

Open a port for Kafka communication to the outside-world

I have a VM (Linux OS) in Azure which has Hortonworks on it, which launches Kafka.
Kafka service is running and I am able of creating producer and consumer inside the VM.
I have the server IP and I'm also able to log into Ambari using 8080 port.
When I am trying to send a message to Kafka from my Java application I get a TimoutEception after 60 seconds.
What do I need to do in order to set the right port for Kafka communication from outside the VM?
I think that the m,ain issue here, is that Kafka is listening on local IP and not on the VM IP (WAN).
Any help will be really appreciated...
If you have used the Azure Resource Manager workflow to create the VM you have a Network Security Group that has been created automatically. You need to create rules in the NSG to make Kafka available. See : https://azure.microsoft.com/en-us/documentation/articles/virtual-networks-nsg/
If you have used the Azure classic deployment workflow, you need to define an endpoint to expose Kafka. See: https://azure.microsoft.com/fr-fr/documentation/articles/virtual-machines-windows-classic-setup-endpoints/
Hope this helps,
Julien
Did you set Kafka advertised.host.name and advertised.port environment variables? That's how you present yourself to the outside world.
(Copy and pasting my response to a similar post)
For the recent versions of Kafka (0.10.0 as of this writing), you don't want to use advertised.host.name at all. In fact, even the [documentation] states that advertised.host.name is already deprecated. Moreover, Kafka will use this not only as the "advertised" host name for the producers/consumers, but for other brokers as well (in a multi-broker environment)...which is kind of a pain if you're using using a different (perhaps internal) DNS for the brokers...and you really don't want to get into the business of adding entries to the individual /etc/hosts of the brokers (ew!)
So, basically, you would want the brokers to use the internal name, but use the external FQDNs for the producers and consumers only. To do this, you will update advertised.listeners instead.

Is it possible to include remote JMX values on a dashboard?

I'm looking at using hawtio for our app as a support console. We're not currently using camel or the like, but I am impressed by the ability to connect to remote JVM's via Jolokia/JMX and the logging features and was wondering:
Our use case would be that we have a weblogic server hosting our web app and my thought would be to include hawtio as a war alongside it. In addition to monitoring the web app, we have a number other JVMs running on different servers.
Is it possible to create a dashboard using values from the local JVM, as well as some of the remote JVMs?
Or must one always manually connect to the instance to see the dashboard for that particular JVM?
The current dashboard and JMX plugin does not support that.
Though there is works planned to support gathering statistics from remote JMVs etc. And there is also work on elastichsarch with a kibana web ui.

How to raise or lower the log level in puppet master?

I am using puppet 3.2.3, passenger and apache on CentOS 6. I have 680 compute nodes in a cluster along with 8 gateways users use to log in to the cluster and submit jobs. All the nodes and gateways are under puppet control. I recently upgraded from 2.6. The master logs to syslog as desired, but how to change the log level for the master escapes me. I appear to have the choice of --debug, or nothing. Debug logs far too much detail, while not using that switch simply logs each time passneger/apache launch a new worker to handle incoming connections.
I find nothing in the on-line docs about doing this. What I want is to log each time a nodes hits the server; but I do not need to see the compiled catalogue, or resources in/var/log/messages.
How is this accomplished?
This is a hack, but here is how I solved the problem. In the file (config.ru) that passenger uses to launch puppet via rack middleware, which in my system lives in /usr/share/puppet/rack/puppetmasterd, I noticed these lines
require 'puppet/util/command_line'
run Puppet::Util::CommandLine.new.execute
So, this I edited to become
require 'puppet/util/command_line'
Puppet::Util::Log.level = :info
run Puppet::Util::CommandLine.new.execute
I suppose other choices for Log.level could be :warn and others.

Resources