I just will try to explain my simplified use case. There is:
Spark Application which count the words.
A web server which serves a web page with a form.
User that can type word within this form and submit it.
Server receives the word and sends it to the Spark Application.
Spark application takes as an input this word, based on some data and this word launches a job with recalculations. Once Spark done with calculations, it sends results to web server which shows results on a web page.
The question is, how i can establish communication between spark application and web-server?
I guess, that spark-jobserver or spark-streaming can help me here, but i am not sure about it.
There are a few projects that will help you with this.
Generally you run a separate webserver for managing the spark jobs as there is some messy systemExec work around the spark-submit cli to accomplish this. Obviously this runs on a different port than you're primary application and is only accessible by the server component of the primary web application.
There are a few open source projects that will handle this for you most notably:
https://github.com/spark-jobserver/spark-jobserver
https://github.com/cloudera/livy
Related
I installed Nagios Core and NCPA on a Mac. Implemented a few checks via custom plugins to understand how to use it. I am trying to understand the following:
Protocol that Nagios server actually use to communicate with NCPA agent and how exactly does NCPA return the result back to Nagios. Does it ssh into Nagios server and writes a file that server processes?
From application monitoring standpoint how can it be leveraged? Is it just to monitor that application is up and running (I read its not just for that it can do more but couldn't find any place where I could see how its actually implemented) or is there a restful API as well that we invoke from with in our application to send custom notification to Nagios server. I understand it might require some configuration at Nagios server end as well.
I came across Pager Duty and Sematext articles i.e PagerDuty Integration and SemaText Nagios Alert Integration where they have integrated their solution with Nagios I am trying to do something similar. Adding integration support for Nagios so that a user can utilise our applications UI to configure alerts/notification. For e.g. if a condition is met then alert or notify Nagios server to show a notification on its dashboard.
Can we generate an alert from within a spark streaming application based on a variable e.g. if its value is above a threshold or some condition is met send an alert to Nagios Server to display as notification on Nagios Dashboard. I came across a link where we can monitor status of a spark application but didn't find anything for something within a spark application.
I tried looking for answers to above questions but couldn't find anything useful or complete as such online. I would really appreciate if someone could help me understand above.
Nagios is highly configurable, and can communicate across many protocols. NCPA can return JSON or XML data. The most common agentless protocol is probably SNMP. If you can read Python, look directly at the /usr/local/nagios/libexec/check_ncpa.py file to see what's up.
Nagios can check whether a system is running a service, how much resources it is consuming, etc... There is a restful API.
Nagios offers an application with a more advanced graphical interface called Nagios XI. Perhaps that is what you are after.
I bet you probably could, yeah. It might take some development work to get the systems to communicate though.
I want to build a website which will take an input file and process it on Apache spark in the backend then send back the output to website.
I am not understanding how to connect spark running on Jupyter notebook with my website.
Any idea is highly welcome.
Spark won’t communicate directly with your web application servers, really. One possible way around this is to publish your results to a database (MongoDB or PostgreSQL, for instance) and then integrate it with your website.
I got my answer regarding the question.
I have used python flask code in my pyspark program and it is giving me desired result on website.
I have a java class which manage jobs and execute them via spark(using 1.6).
I am using the API - sparkLauncher. startApplication(SparkAppHandle.Listener... listeners) in order to monitor the state of the job.
The problem is I moved to work in a real cluster environment and this way can’t work when the master and workers are not on the same machine, as the internal implementation is making a use of localhost only (loopback) to open a port for the workers to bind to.
The API sparkLauncher.launch() works but doesn’t let me monitor the status.
What is the best practice for cluster environment using a java code?
I also saw the option of hidden Rest API, is it mature enough? Should I enable it in spark somehow (I am getting access denied, even though the port is open from outside) ?
REST API
In addition to viewing the metrics in the UI, they are also available as JSON. This gives developers an easy way to create new visualizations and monitoring tools for Spark. The JSON is available for both running applications, and in the history server. The endpoints are mounted at /api/v1. Eg., for the history server, they would typically be accessible at http://:18080/api/v1, and for a running application, at http://localhost:4040/api/v1.
More details you can find here.
Every SparkContext launches a web UI, by default on port 4040, that displays useful information about the application. This includes:
A list of scheduler stages and tasks
A summary of RDD sizes and memory usage
Environmental information.
Information about the running executors
You can access this interface by simply opening http://driver-node:4040 in a web browser. If multiple SparkContexts are running on the same host, they will bind to successive ports beginning with 4040 (4041, 4042, etc).
More details you can find here.
Currently I have two servers which I have deployed node.js/Express.JS based web services API. I am using Redis for caching the JSON strings.
What will be the best option deploying this setup in to production? I see here it advices to go with a dedicated server redis. OK. I take it and use a dedicated server for running redis master. Can I use existing app servers as slave nodes? Note : these app servers are running an Node/Express application.
What other other options do I have?
You can.
It all depends on the load that those other servers have, it's a problem of resource sharing. To be honest my main issue with your architecture is not the dedicated vs the non-dedicated servers, it's the fact that you are placing a Redis server (master or not) on a host that most likely will be facing the internet (expressJS app), meaning, it's quite exposed.
If you can simulate HTTP load into your Node/Express JS servers, see the difference between running some benchmark tests on your dedicated server vs the non dedicated ones:
On a running redis server type in:
redis-benchmark -q -n 100000
If the app servers are being hammered and using all cores frequently you should see a substantial difference in the benchmarks.
My suggestion is, go ahead with your first setup and add monitoring for the redis response times, and only act when you have to, which might be now if the benchmarks show very poor results.
As a side note, consider the option of not sharing hosts for services that you expose to the internet with services that perform internal functions to your application.
I'm looking at using hawtio for our app as a support console. We're not currently using camel or the like, but I am impressed by the ability to connect to remote JVM's via Jolokia/JMX and the logging features and was wondering:
Our use case would be that we have a weblogic server hosting our web app and my thought would be to include hawtio as a war alongside it. In addition to monitoring the web app, we have a number other JVMs running on different servers.
Is it possible to create a dashboard using values from the local JVM, as well as some of the remote JVMs?
Or must one always manually connect to the instance to see the dashboard for that particular JVM?
The current dashboard and JMX plugin does not support that.
Though there is works planned to support gathering statistics from remote JMVs etc. And there is also work on elastichsarch with a kibana web ui.