How to setup Spark with a multi node Cassandra cluster? - apache-spark

First of all, I am not using the DSE Cassandra. I am building this on my own and using Microsoft Azure to host the servers.
I have a 2-node Cassandra cluster, I've managed to set up Spark on a single node but I couldn't find any online resources about setting it up on a multi-node cluster.
This is not a duplicate of how to setup spark Cassandra multi node cluster?
To set it up on a single node, I've followed this tutorial "Setup Spark with Cassandra Connector".

You have two high level tasks here:
setup Spark (single node or cluster);
setup Cassandra (single node or cluster);
This tasks are different and not related (if we are not talking about data locality).
How to setup Spark in Cluster you can find here Architecture overview.
Generally there are two types (standalone, where you setup Spark on hosts directly, or using tasks schedulers (Yarn, Mesos)), you should draw upon your requirements.
As you built all by yourself, I suppose you will use Standalone installation. The difference between one node is network communication. By default Spark runs on localhost, more commonly it uses FQDNS name, so you should configure it in /etc/hosts and hostname -f or try IPs.
Take a look at this page, which contains all necessary ports for nodes communication. All ports should be open and available between nodes.
Be attentive that by default Spark uses TorrentBroadcastFactory with random ports.
For Cassandra see this docs: 1, 2, tutorials 3, etc.
You will need 4 likely. You also could use Cassandra inside Mesos using docker containers.
p.s. If data locality it is your case you should come up with something yours, because nor Mesos, nor Yarn don't handle running spark jobs for partitioned data closer to Cassandra partitions.

Related

Does Spark streaming needs HDFS with Kafka

I have to design a setup to read incoming data from twitter (streaming). I have decided to use Apache Kafka with Spark streaming for real time processing. It is required to show analytics in a dashboard.
Now, being a newbie is this domain, My assumed data rate will be 10 Mb/sec maximum. I have decided to use 1 machine for Kafka of 12 cores and 16 GB memory. *Zookeeper will also be on same machine. Now, I am confused about Spark, it will have to perform streaming job analysis only. Later, analyzed data output is pushed to DB and dashboard.
Confused list:
Should I run Spark on Hadoop cluster or local file system ?
Is standalone mode of Spark can fulfill my requirements ?
Is my approach is appropriate or what should be best in this case ?
Try answer:
Should I run Spark on Hadoop cluster or local file system ?
recommend use hdfs,it can can save more data, ensure High availability.
Is standalone mode of Spark can fulfill my requirements ?
Standalone mode is the easiest to set up and will provide almost all the same features as the other cluster managers if you are only running Spark.
YARN allows you to dynamically share and centrally configure the same pool of cluster resources between all frameworks that run on YARN.
YARN doesn’t need to run a separate ZooKeeper Failover Controller.
YARN will likely be preinstalled in many Hadoop distributions.such as CDH HADOOP.
so recommend use
YARN doesn’t need to run a separate ZooKeeper Failover Controller.
so recommend yarn
Useful links:
spark yarn doc
spark standalone doc
other wonderful answer
Is my approach is appropriate or what should be best in this case ?
If you data not more than 10 million ,I think can use use local cluster to do it.
local mode avoid many nodes shuffle. shuffles between processes are faster than shuffles between nodes.
else recommend use greater than or equal 3 nodes,That is real Hadoop cluster.
As a spark elementary players,this is my understand. I hope ace corrects me.

hadoop cluster + any way to disable spark application to run on specific data nodes

we have Hadoop cluster ( HDP 2.6.5 cluster with ambari , with 25 datanodes machines )
we are using spark streaming application (spark 2.1 run over Hortonworks 2.6.x )
the current situation is that spark streaming applications runs on all datanodes machines
but now we want the spark streaming application to run only on the first 10 datanodes machines
so the others last 15 datanodes machines will be restricted , and spark application will runs only on the first 10 datanodes machines
is this scenario can be done by ambary features or other approach?
for example we found the - https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.3.2/bk_yarn_resource_mgt/content/configuring_node_labels.html ,
and
http://crazyadmins.com/configure-node-labels-on-yarn/
but not sure if Node Labes can help us
#Jessica Yes, you are absolutely onto the right path. Yarn Node Labels and Yarn Queues are how Ambari Administrators control team level access to portions of the entire yarn cluster. You can start very basic with just a non default queues or get very in-depth with many queues for many different teams. Node labels take it to another level, allow you to map queues and teams to nodes specifically.
Here is a post with the syntax for spark to use the yarn queue:
How to choose the queue for Spark job using spark-submit?
I tried to find 2.6 version of these docs, but was not able.... they have really mixed up the docs since the merger...
https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.3.2/bk_yarn_resource_mgt/content/ch_node_labels.html
https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.0/data-operating-system/content/configuring_node_labels.html
The actual steps you may have to take may be a combination of items from both. Typical experience for me when working in Ambari HDP/HDF.

Run parallel jobs on-prem dynamic spark clusters

I am new to spark, And we have a requirement to set up a dynamic spark cluster to run multiple jobs. by referring to some articles, we can achieve this by using EMR (Amazon) service.
Is there any way to the same setup that can be done locally?
Once Spark clusters are available with services running on different ports on different servers, how to point mist to new spark cluster for each job.
Thanks in advance.
Yes, you can use a Standalone cluster that Spark provides where you can set up Spark Cluster (master nodes and slave nodes). There are also docker containers that can be used to achieve that. Take a look here.
Other options it will be to take and deploy locally Hadoop ecosystems, like MapR, Hortonworks, Cloudera.

Datastax Spark Sql Thriftserver with Spark Application

I have an analytics node running, with Spark Sql Thriftserver running on it. Now I can't run another Spark Application with spark-submit.
It says it doesn't have resources. How to configure the dse node, to be able to run both?
The SparkSqlThriftServer is a Spark application like any other. This means it requests and reserves all resources in the cluster by default.
There are two options if you want to run multiple applications at the same time:
Allocate only part of your resources to each application.
This is done by setting spark.cores.max to a smaller value than the max resources in your cluster.
See Spark Docs
Dynamic Allocation
Which allows applications to change the amount of resources they use depending on how much work they are trying to do.
See Spark Docs

Cassandra clusters separation

I have one single-node cluster and just added a multi-node cluster (on 4 seprate nodes, let's call them node1, node2,.., node4). The single-node cluster uses the localhost as seed_provider. The multi-node uses node1,node2 hosts as seeds (SimpleSeedProvider).
To my suprise when I started the multi-node cluster I was able to see they started talking to the single-node Cassandra and they downloaded data from it.
How to prevent the new cluster talking to the existing cluster? Do I miss anything else?
They will "gossip" on the network and detect each other if they are not separated.
Did you make sure your cluster_name value in your cassandra.yaml file is not the same for both of your clusters? That's how they differentiate each other as said in the sample configuration file :
# The name of the cluster. This is mainly used to prevent machines in
# one logical cluster from joining another.

Resources