Spark UI available on Dataproc Cluster? - apache-spark

Looking to interact with the traditional Spark Web GUI on default clusters in Dataproc.

This can be done by creating a SSH tunnel to the Dataproc master node. By using a SOCKS proxy you can then access all the applications running on YARN including your Spark sessions.
This guide will walk you through in detail:
Dataproc Cluster web interfaces

Related

How to connect to an existing Kafka MSK Cluster step by step?

I would like to know more about how to connect to a Kafka MSK Cluster.
You need a Kafka client.
Since you have node.js, start with kafkajs or node-rdkafka. I'd recommend kafkajs since it supports SASL consumption with AWS IAM
Beyond that, MSK is an implementation detail, and any steps will work for all Kafka clusters (depending on other security implementations). However, cloud providers will require that you allow your client IP address space within the VPC, and you can follow the official getting started guide for verifying a basic client will work.

spark history server shutting down

I installed Spark and able to run Spark jobs.
however I can't access the spark history server.
when I start it, I can see the service is running:
services running
but after few seconds the service is gone.
trying to access the web ui in port 18080 returning "site can't be reached"

How to expose metrics to Prometheus from springboot application CF cluster

I am developing an application from scratch using Springboot (Jhipster) and intend to deploy in a cloudfoundry with multiple nodes. Planning to setup prometheus server which can pull metrics from the cluster.
I am trying to understand how can I setup Prometheus to query individual nodes of the cluster. As the application is deployed in cloud foundry, it may not be possible to obtain the ip address of individual nodes.
As I am newbie with Prometheus, want to make sure I am solutioning appropriately.
I would recommend using Micrometer.io
You can use either service discovery (e.g. Netflix Eureka) or use the Pushgateway in some form.
https://prometheus.io/docs/instrumenting/pushing/
https://prometheus.github.io/client_java/io/prometheus/client/exporter/PushGateway.html

Is there any general rule when assigning role to gateway in Cloudera Spark2?

I am planning upgrade the existing Spark 1.6 to 2.1 in Cloudera, I was advised that I should assign gateway role to all Node Manager and Resource Manager nodes. Current gateway role is assigned to a proxy node, which is not included in the planned Spark2, the reason is that proxy node has too many (20+) roles, I wonder if anyone can give any suggestion here? I checked Cloudera doc, I don't see a guideline on it (or maybe I missed it?)
Thanks lots.
I have a slight disagreement with the other answer, which says
By default any host running a service will have the config files
included so you don't need to add a gateway role to your Node Manager
and Resource Manager roles
Just having Node Manager and Resource Manager running on a node will only give you the configuration files for YARN, not Spark2. That being said, you only need to deploy Spark gateway role to your edge node, where you allow end user to login and run command line tool such as beeline, hdfs command and spark-shell/spark-submit. No one should be allowed to login your Node Manager/Datanode, as a security policy.
In your case, it looks like what you call proxy node. The gateway is just configuration files and is not a running process. So I don't think you need to be concerned about too many existing roles.
A gateway role just has the config files such as /etc/hadoop/conf/*. It allows clients to run on that host (the hdfs, hadoop, yarn, spark CLIs) and submit commands to the cluster. By default any host running a service will have the config files included so you don't need to add a gateway role to your Node Manager and Resource Manager roles.
The official documentation describes it as such:
Managing Roles: Gateway Roles
A gateway is a special type of role whose sole purpose is to designate a host that should receive a client configuration for a specific service, when the host does not have any roles running on it. Gateway roles enable Cloudera Manager to install and manage client configurations on that host. There is no process associated with a gateway role, and its status will always be Stopped. You can configure gateway roles for HBase, HDFS, Hive, Kafka, MapReduce, Solr, Spark, Sqoop 1 Client, and YARN.

what approach I should use for external access to Cassandra running inside kubernetes

I have a StatefulSet Cassandra deployment that works great for services deployed to Kubernetes with namespace access, but I also have an ETL job that runs in EMR and needs to load data into that Cassandra cluster.
What would be the main approach/Kubernetes way of doing this?
I can think of two options.
Simple one is you can create the server with Type: NodePort, with this you can connect server with Node IP Address:PortNumber.
Second option is you can create the Ingress Load Balancer and connect to Cassandra cluster.

Resources