Setup and configuration of Titan for a Spark cluster and Cassandra - apache-spark

There are already several questions on the aurelius mailing list as well as here on stackoverflow about specific problems with configuring Titan to get it working with Spark. But what is missing in my opinion is a high-level description of a simple setup that uses Titan and Spark.
What I am looking for is a somewhat minimal setup that uses recommended settings. For example for Cassandra, the replication factor should be 3 and a dedicated datacenter should be used for analytics.
From the information I found in the documentation of Spark, Titan, and Cassandra, such a minimal setup could look like this:
Real-time processing DC: 3 Nodes with Titan + Cassandra (RF: 3)
Analytics DC: 1 Spark master + 3 Spark slaves with Cassandra (RF: 3)
Some questions I have about that setup and Titan + Spark in general:
Is that setup correct?
Should Titan also be installed on the 3 Spark slave nodes and / or the Spark master?
Is there another setup that you would use instead?
Will the Spark slaves only read data from the analytics DC and ideally even from Cassandra on the same node?
Maybe someone can even share a config file that supports such a setup (or a better one).

So I just tried it out and set up a simple Spark cluster to work with Titan (and Cassandra as the storage backend) and here is what I came up with:
High-Level Overview
I just concentrate on the analytics side of the cluster here, so I let out the real-time processing nodes.
Spark consists of one (or more) master and multiple slaves (workers). Since the slaves do the actual processing, they need to access the data they work on. Therefore Cassandra is installed on the workers and holds the graph data from Titan.
Jobs are sent from Titan nodes to the spark master who distributes them to his workers. Therefore, Titan basically only communicates with the Spark master.
The HDFS is only needed because TinkerPop stores intermediate results in it. Note, that this changed in TinkerPop 3.2.0.
Installation
HDFS
I just followed a tutorial I found here. There are only two things to keep in mind here for Titan:
Choose a compatible version, for Titan 1.0.0, this is 1.2.1.
TaskTrackers and JobTrackers from Hadoop are not needed, as we only want the HDFS and not MapReduce.
Spark
Again, the version has to be compatible, which is also 1.2.1 for Titan 1.0.0. Installation basically means extracting the archive with a compiled version. In the end, you can configure Spark to use your HDFS by exporting the HADOOP_CONF_DIR which should point to the conf directory of Hadoop.
Configuration of Titan
You also need a HADOOP_CONF_DIR on the Titan node from which you want to start OLAP jobs. It needs to contain a core-site.xml file that specifies the NameNode:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://COORDINATOR:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
Add the HADOOP_CONF_DIR to your CLASSPATH and TinkerPop should be able to access the HDFS. The TinkerPop documentation contains more information about that and how to check whether HDFS is configured correctly.
Finally, a config file that worked for me:
#
# Hadoop Graph Configuration
#
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphInputFormat=com.thinkaurelius.titan.hadoop.formats.cassandra.CassandraInputFormat
gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat
gremlin.hadoop.deriveMemory=false
gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=output
#
# Titan Cassandra InputFormat configuration
#
titanmr.ioformat.conf.storage.backend=cassandrathrift
titanmr.ioformat.conf.storage.hostname=WORKER1,WORKER2,WORKER3
titanmr.ioformat.conf.storage.port=9160
titanmr.ioformat.conf.storage.keyspace=titan
titanmr.ioformat.cf-name=edgestore
#
# Apache Cassandra InputFormat configuration
#
cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra.input.keyspace=titan
cassandra.input.predicate=0c00020b0001000000000b000200000000020003000800047fffffff0000
cassandra.input.columnfamily=edgestore
cassandra.range.batch.size=2147483647
#
# SparkGraphComputer Configuration
#
spark.master=spark://COORDINATOR:7077
spark.serializer=org.apache.spark.serializer.KryoSerializer
Answers
This leads to the following answers:
Is that setup correct?
It seems to be. At least it works with this setup.
Should Titan also be installed on the 3 Spark slave nodes and / or the Spark master?
Since it isn't required, I wouldn't do that as I prefer a separation of Spark and Titan servers which the user can access.
Is there another setup that you would use instead?
I would be happy to hear from someone else who has a different setup.
Will the Spark slaves only read data from the analytics DC and ideally even from Cassandra on the same node?
Since the Cassandra nodes (from the analytics DC) are explicitly configured, the Spark slaves shouldn't be able to pull data from completely different nodes. But I am still not sure about the second part. Maybe someone else can provide more insight here?

Related

Spark JobServer can use Cassandra as SharedDb

I have been doing a research about Configuring Spark JobServer Backend (SharedDb) with Cassandra.
And I saw in the SJS documentation that they cited Cassandra as one of the Shared DBs that can be used.
Here is the documentation part:
Spark Jobserver offers a variety of options for backend storage such as:
H2/PostreSQL or other SQL Databases
Cassandra
Combination of SQL DB or Zookeeper with HDFS
But I didn't find any configuration example for this.
Would anyone have an example? Or can help me to configure it?
Edited:
I want to use Cassandra to store metadata and jobs from Spark JobServer. So, I can hit any servers through a proxy behind of these servers.
Cassandra was supported in the previous versions of Jobserver. You just needed to have Cassandra running, add correct settings to your configuration file for Jobserver: https://github.com/spark-jobserver/spark-jobserver/blob/0.8.0/job-server/src/main/resources/application.conf#L60 and specify spark.jobserver.io.JobCassandraDAO as DAO.
But Cassandra DAO was recently deprecated and removed from the project, because it was not really used and maintained by the community.

How do I set up a HDFS file system to run a Spark job with HDFS?

I am interested in running Spark in standalone mode with Minio/HDFS.
This question asked exactly what I want: "I require a HDFS, is it thus enough to just use the file-system part of Hadoop?" -- but the accepted answer was not helpful, as it did not mention how to use HDFS with Spark.
I have downloaded Spark 2.4.3 pre-built for Apache Hadoop 2.7 and later.
I have followed the Apache Spark tutorials and successfully deployed one master (my local machine) and one worker (my RPi4 on the same local network). I was able to run a simple word count (counting words in /opt/spark/README.md).
Now I want to count words of a file that exists only on the master. I understand that I will need to use HDFS for this to share files across the local network. However, I don't have any idea how to do this, despite perusing both the Apache Spark and Hadoop documentation.
I am confused about the interplay between Spark and Hadoop. I don't know if I should be setting up a Hadoop cluster in addition to a Spark cluster. This tutorial on hadoop.apache.org doesn't seem to help, as it says that "you will need to start both the HDFS and YARN cluster". I want to run Spark in standalone mode, not YARN.
What do I need to do in order for me to run
val textFile = spark.read.textFile("file_that_exists_only_on_my_master")
and have the file be propagated to the worker nodes, i.e. not get a "File does not exist" error on the worker nodes?
I set up MinIO instead, and wrote the following Github Gist on instructions.
The trick is to set up core_site.xml to point to the MinIO server.
Github Gist here
<script src="https://gist.github.com/lieuzhenghong/c062aa2c5544d6b1a0fa5139e10441ad.js"></script>

How to setup Spark with a multi node Cassandra cluster?

First of all, I am not using the DSE Cassandra. I am building this on my own and using Microsoft Azure to host the servers.
I have a 2-node Cassandra cluster, I've managed to set up Spark on a single node but I couldn't find any online resources about setting it up on a multi-node cluster.
This is not a duplicate of how to setup spark Cassandra multi node cluster?
To set it up on a single node, I've followed this tutorial "Setup Spark with Cassandra Connector".
You have two high level tasks here:
setup Spark (single node or cluster);
setup Cassandra (single node or cluster);
This tasks are different and not related (if we are not talking about data locality).
How to setup Spark in Cluster you can find here Architecture overview.
Generally there are two types (standalone, where you setup Spark on hosts directly, or using tasks schedulers (Yarn, Mesos)), you should draw upon your requirements.
As you built all by yourself, I suppose you will use Standalone installation. The difference between one node is network communication. By default Spark runs on localhost, more commonly it uses FQDNS name, so you should configure it in /etc/hosts and hostname -f or try IPs.
Take a look at this page, which contains all necessary ports for nodes communication. All ports should be open and available between nodes.
Be attentive that by default Spark uses TorrentBroadcastFactory with random ports.
For Cassandra see this docs: 1, 2, tutorials 3, etc.
You will need 4 likely. You also could use Cassandra inside Mesos using docker containers.
p.s. If data locality it is your case you should come up with something yours, because nor Mesos, nor Yarn don't handle running spark jobs for partitioned data closer to Cassandra partitions.

How to set Cassandra as my Distributed Storage(File System) for my Spark Cluster

I am new to big data and Spark(pyspark).
Recently I just setup a spark cluster and wanted to use Cassandra File System (CFS) on my spark cluster to help upload files.
Can any one tell me how to set it up and briefly introduce how to use CFS system? (like how to upload files / from where)
BTW I don't even know how to use HDFS(I downloaded pre-built spark-bin-hadoop but I can't find hadoop in my system tho.)
Thanks in advance!
CFS only exists in DataStax Enterprise and isn't appropriate for most Distributed File applications. It's primary focused as a substitute for HDFS for map/reduce jobs and small temporary but distributed files.
To use it you just use the CFS:// uri and make sure you are using dse spark-submit from your application.

Connecting to Remote Spark Cluster for TitanDB SparkGraphComputer

I am attempting to leverage a Hadoop Spark Cluster in order to batch load a graph into Titan using the SparkGraphComputer and BulkLoaderVertex program, as specified here. This requires setting the spark configuration in a properties file, telling Titan where Spark is located, where to read the graph input from, where to store its output, etc.
The problem is that all of the examples seem to specify a local spark cluster through the option:
spark.master=local[*]
I, however, want to run this job on a remote Spark cluster which is on the same VNet as the VM where the titan instance is hosted. From what I have read, it seems that this can be accomplished by setting
spark.master=<spark_master_IP>:7077
This is giving me the error that all Spark masters are unresponsive, which disallows me from sending the job to the spark cluster to distribute the batch loading computations.
For reference, I am using Titan 1.0.0 and a Spark 1.6.4 cluster, which are both hosted on the same VNet. Spark is being managed by yarn, which also may be contributing to this difficulty.
Any sort of help/reference would be appreciated. I am sure that I have the correct IP for the spark master, and that I am using the right gremlin commands to accomplish bulk loading through the SparkGraphComputer. What I am not sure about is how to properly configure the Hadoop properties file in order to get Titan to communicate with a remote Spark cluster over a VNet.

Resources