I want to have a view of Spark in Kibana such as decommissioned nodes per application, shuffle read and write per application, and more.
I know I can get all this information about these metrics here.
But I don't know how to send them to elastic search or how to do it the correct way. I know I can do it with Prometheus but I don't think that helps me.
Is there a way of doing so?
Related
Small question regarding an integration between Splunk and Apache Spark.
Currently, I am doing a search query in Splunk. The result is quite big. And I am exporting this result as a CSV to share with several teams for downstream work.
Each of the downstream work ended up loading the CSV as part of a Apache Spark job, converting it to DataSet, and doing map reduce on it.
The Spark jobs from each and every teams are different. Therefore, simply plugin each and every teams computation in Splunk directly is not quite scalable.
This is leading us to ask a question, instead of each teams having to download a copy of the CSV, may I ask, if there is an API, or a way to connect to Splunk search result from Apache Spark directly?
Thank you
Splunk does not have an API specifically for Spark. There is a REST API, a few SDKs, and (perhaps best for you) support for ODBC. With an ODBC/JDBC driver installed on your Spark server and a few saved searches defined on Splunk, you should be able to export results from Splunk to Spark for analysis. See https://www.cdata.com/kb/tech/splunk-jdbc-apache-spark.rst for more information.
spark exposes many metrics to monitor the work of the driver and the executors.
Let's say I use Prometheus. Can the metrics be used to see information about a specific spark run? To investigate for example the memory usage of specific execution, and not in general? Not just make big picture graphs in Grafana (as an example). I do not see how can I do it with Prometheus or graphite.
Is there a tool that is better suitable for what I need?
We have many micro-services(java) and data is being written to hazelcast cache for better performance. Now the same data needs to be made available to Spark application for data analysis. I am not sure If this is right design approach to access external cache in apache spark. I cannot make database calls to get the data as there will be many database hits which might affect micro-services(currently we dont have http caching).
I thought about pushing the latest data into Kafka and read the same in spark. However, data(each message) might be big(> 1 MB sometimes) which is not right.
If its ok to use external cache in apache spark, is it better to use hazelcast client or to read Hazelcast cached data over rest service ?
Also, please let me know If there are any other recommended way of sharing data between Apache Spark and micro-services
Please let me know your thoughts. Thanks in advance.
I have to perform the benchmarking of spark streaming processing. My process gets pulls messages from the kafka, process and loads into ElasticSearch. The upstream generates 100k records per second. So I would like to calculate how many messages processed in 1 second and the latency time. Is there any tools available to monitor this or is there any process to calculate this.
Spark UI can help you,providing the necessary details you need.
By default, the spark ui is available on http://:4040 in a web browser(For a single spark Context).
For the help,you can use this link: http://spark.apache.org/docs/latest/monitoring.html
Beyond the Spark UI, which is useful for determining the rate of processing of your data, you can also use third-party tools like spark-perf to perform load testing on your cluster and obtain benchmark data in that way as well.
Maybe someone should try Yahoo's streaming-benchmarks, I found databricks use that tool to do benchmark between spark streaming and flink.
https://github.com/yahoo/streaming-benchmarks
https://databricks.com/blog/2017/10/11/benchmarking-structured-streaming-on-databricks-runtime-against-state-of-the-art-streaming-systems.html
I want to get information about the workers that are being used by an application in Spark cluster. I need to get its IP address, CPU cores, memory available etc.
Is there any API in spark for this purpose?
Above image shows the same info on Spark UI but I am not able to figure out the way to get it by JAVA code.
It is specific to JAVA.
I want all worker nodes information.
Thanks.
There are multiple ways to do this:
Parse the output log messages and see what workers are started on each machine in your cluster. You can get the names/IPs of all the hosts, when tasks are started and where, how much memory each worker gets, etc. If you want to see the exact HW configuration, you will then need to log in to the worker nodes or use different tools.
The same information as in the web frontend is contained in the eventLogs of the spark applications (this is actually where the data you see comes from). I prefer to use the eventLog as it is very easy to parse in python rather than the log messages.
If you want to have real-time monitoring of the cluster you can use either ganglia (gives nice graphical displays of CPU/memory/network/disks) or use colmux that gives you the same data but in a text format. I personally prefer colmux (easier to set up, you get immediate stats, etc).