Spark JobServer can use Cassandra as SharedDb - apache-spark

I have been doing a research about Configuring Spark JobServer Backend (SharedDb) with Cassandra.
And I saw in the SJS documentation that they cited Cassandra as one of the Shared DBs that can be used.
Here is the documentation part:
Spark Jobserver offers a variety of options for backend storage such as:
H2/PostreSQL or other SQL Databases
Cassandra
Combination of SQL DB or Zookeeper with HDFS
But I didn't find any configuration example for this.
Would anyone have an example? Or can help me to configure it?
Edited:
I want to use Cassandra to store metadata and jobs from Spark JobServer. So, I can hit any servers through a proxy behind of these servers.

Cassandra was supported in the previous versions of Jobserver. You just needed to have Cassandra running, add correct settings to your configuration file for Jobserver: https://github.com/spark-jobserver/spark-jobserver/blob/0.8.0/job-server/src/main/resources/application.conf#L60 and specify spark.jobserver.io.JobCassandraDAO as DAO.
But Cassandra DAO was recently deprecated and removed from the project, because it was not really used and maintained by the community.

Related

'hive on spark' in datastax enterprise DSE?

DSE 6 comes pre-bundled Cassandra and SparkSql. Has anyone also setup 'Hive on Spark' there? I wonder about spark version conflicts being an issue. Reason i wan't this is that Hive seems to allow masking/authorization with Ranger but SparkSQL doesn't
Answer not directly related to setting Hive, etc. but DSE has security (authentication/authorization/...) built-in (see FAQ), and it's supported by the all components, including Spark SQL. If you want to have more granular permissions, you can set row-level access control.

Getting "AssertionError("Unknown application type")" when Connecting to DSE 5.1.0 Spark

I am connecting to DSE (Spark) using this:
new SparkConf()
.setAppName(name)
.setMaster("spark://localhost:7077")
With DSE 5.0.8 works fine (Spark 1.6.3) but now fails with DSE 5.1.0 getting this error:
java.lang.AssertionError: Unknown application type
at org.apache.spark.deploy.master.DseSparkMaster.registerApplication(DseSparkMaster.scala:88) ~[dse-spark-5.1.0.jar:2.0.2.6]
After checking the use-spark jar, I've come up with this:
if(rpcendpointref instanceof DseAppProxy)
And within spark, seems to be RpcEndpointRef (NettyRpcEndpointRef).
How can I fix this problem?
I had a similar issue, and fixed it by following this:
https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_enterprise/spark/sparkRemoteCommands.html
Then you need to run your job using dse spark-submit, without specifying any master.
Resource Manager Changes
The DSE Spark Resource manager is different than the OSS Spark Standalone Resource Manager. The DSE method uses a different uri "dse://" because under the hood it actually is performing a CQL based request. This has a number of benefits over the Spark RPC but as noted does not match some of the submission
mechanisms possible in OSS Spark.
There are several articles on this on the Datastax Blog as well as documentation notes
Network Security with DSE 5.1 Spark Resource Manager
Process Security with DSE 5.1 Spark Resource Manager
Instructions on the URL Change
Programmatic Spark Jobs
While it is still possible to launch an application using "setJars" you must also add the DSE specific jars and config options to talk with the resource manager. In DSE 5.1.3+ there is a class provided
DseConfiguration
Which can be applied to your Spark Conf DseConfiguration.enableDseSupport(conf) (or invoked via implicit) which will set these options for you.
Example
Docs
This is of course for advanced users only and we strongly recommend using dse spark-submit if at all possible.
I found a solution.
First of all, I think is impossible to run a Spark job within an Application within DSE 5.1. Has to be sent with dse spark-submit
Once sent, it works perfectly. In order to do the communications to the job I used Apache Kafka.
If you don't want to use a job, you can always go back to a Apache Spark.

Setup and configuration of Titan for a Spark cluster and Cassandra

There are already several questions on the aurelius mailing list as well as here on stackoverflow about specific problems with configuring Titan to get it working with Spark. But what is missing in my opinion is a high-level description of a simple setup that uses Titan and Spark.
What I am looking for is a somewhat minimal setup that uses recommended settings. For example for Cassandra, the replication factor should be 3 and a dedicated datacenter should be used for analytics.
From the information I found in the documentation of Spark, Titan, and Cassandra, such a minimal setup could look like this:
Real-time processing DC: 3 Nodes with Titan + Cassandra (RF: 3)
Analytics DC: 1 Spark master + 3 Spark slaves with Cassandra (RF: 3)
Some questions I have about that setup and Titan + Spark in general:
Is that setup correct?
Should Titan also be installed on the 3 Spark slave nodes and / or the Spark master?
Is there another setup that you would use instead?
Will the Spark slaves only read data from the analytics DC and ideally even from Cassandra on the same node?
Maybe someone can even share a config file that supports such a setup (or a better one).
So I just tried it out and set up a simple Spark cluster to work with Titan (and Cassandra as the storage backend) and here is what I came up with:
High-Level Overview
I just concentrate on the analytics side of the cluster here, so I let out the real-time processing nodes.
Spark consists of one (or more) master and multiple slaves (workers). Since the slaves do the actual processing, they need to access the data they work on. Therefore Cassandra is installed on the workers and holds the graph data from Titan.
Jobs are sent from Titan nodes to the spark master who distributes them to his workers. Therefore, Titan basically only communicates with the Spark master.
The HDFS is only needed because TinkerPop stores intermediate results in it. Note, that this changed in TinkerPop 3.2.0.
Installation
HDFS
I just followed a tutorial I found here. There are only two things to keep in mind here for Titan:
Choose a compatible version, for Titan 1.0.0, this is 1.2.1.
TaskTrackers and JobTrackers from Hadoop are not needed, as we only want the HDFS and not MapReduce.
Spark
Again, the version has to be compatible, which is also 1.2.1 for Titan 1.0.0. Installation basically means extracting the archive with a compiled version. In the end, you can configure Spark to use your HDFS by exporting the HADOOP_CONF_DIR which should point to the conf directory of Hadoop.
Configuration of Titan
You also need a HADOOP_CONF_DIR on the Titan node from which you want to start OLAP jobs. It needs to contain a core-site.xml file that specifies the NameNode:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://COORDINATOR:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
Add the HADOOP_CONF_DIR to your CLASSPATH and TinkerPop should be able to access the HDFS. The TinkerPop documentation contains more information about that and how to check whether HDFS is configured correctly.
Finally, a config file that worked for me:
#
# Hadoop Graph Configuration
#
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphInputFormat=com.thinkaurelius.titan.hadoop.formats.cassandra.CassandraInputFormat
gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat
gremlin.hadoop.deriveMemory=false
gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=output
#
# Titan Cassandra InputFormat configuration
#
titanmr.ioformat.conf.storage.backend=cassandrathrift
titanmr.ioformat.conf.storage.hostname=WORKER1,WORKER2,WORKER3
titanmr.ioformat.conf.storage.port=9160
titanmr.ioformat.conf.storage.keyspace=titan
titanmr.ioformat.cf-name=edgestore
#
# Apache Cassandra InputFormat configuration
#
cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra.input.keyspace=titan
cassandra.input.predicate=0c00020b0001000000000b000200000000020003000800047fffffff0000
cassandra.input.columnfamily=edgestore
cassandra.range.batch.size=2147483647
#
# SparkGraphComputer Configuration
#
spark.master=spark://COORDINATOR:7077
spark.serializer=org.apache.spark.serializer.KryoSerializer
Answers
This leads to the following answers:
Is that setup correct?
It seems to be. At least it works with this setup.
Should Titan also be installed on the 3 Spark slave nodes and / or the Spark master?
Since it isn't required, I wouldn't do that as I prefer a separation of Spark and Titan servers which the user can access.
Is there another setup that you would use instead?
I would be happy to hear from someone else who has a different setup.
Will the Spark slaves only read data from the analytics DC and ideally even from Cassandra on the same node?
Since the Cassandra nodes (from the analytics DC) are explicitly configured, the Spark slaves shouldn't be able to pull data from completely different nodes. But I am still not sure about the second part. Maybe someone else can provide more insight here?

DataStax Enterprise: Submitting spark 0.9.1 app to DSE cluster in a right way

I have a running analytics(Spark Enabled) dse cluster of 8 nodes. Spark Shell is working fine.
Now I would like to build a spark app and deploy it on the cluster using the command "dse spark-class" that I guess is the right tool for the job, according to the dse documentation.
I built the app with sbt assembly and I got the fat jar of my app.
Then after a lot of digging I figured out to export the env var $SPARK_CLIENT_CLASSPATH, because it is referenced by the spark-class command
export SPARK_CLIENT_CLASSPATH=<fat jar full path>
Now I'm able to invoke:
dse spark-class <main Class>
The app crashes immediately because of classNotFound exception. It doesn't recognize internal classes of my app.
The only way I have been able to make it work has been to initialize the SparkConf as following:
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "cassandrahost")
.set("spark.cassandra.auth.username", "cassandra")
.set("spark.cassandra.auth.password", "cassandra")
.setJars(Seq("fat-jar-full-path"))
val sc = new SparkContext("spark://masterurl:7077", "DataGenerator", conf)
The method setJars enables to dispatch my jar to the cluster workers.
Is it the only way to accomplish that ? I thinks it's pretty ugly and not portable.
Is it possible to have an external configuration to set master url, cassandra host and app jar path?
I have seen that starting from Spark 1.0 there is the spark-submit command that allows to specify the app-jar externally. Is it possible to update spark to version 1.1 in DSE 4.5.3 ?
Thanks a lot
You can use Spark submit with DSE 4.6 which just dropped today (Dec 3rd, 2014) and includes Spark 1.1.
Here are the new features:
LDAP authentication Enhanced audit logging:
-Audit logging
-configuration is decoupled from log4j Logging to a Cassandra table
-Configurable consistency levels for table logging Optional
-asynchronous logging for better performance when logging to a table
Spark enhancements:
-Spark 1.1 integration Spark Java API support
-Spark Python API (PySpark) support Spark SQL support Spark Streaming
-Kerberos support for connecting Spark components to Cassandra DSE
Search enhancements:
-Simplified, automatic resource generation
-New dsetool commands for creating, reloading, and managing Solr core resources
-Redesigned implementation of CQL Solr queries for production usage
-Solr performance objects
-Tuning index size and range query speed
-Restricted query routing for experts
-Ability to use virtual nodes (vnodes) in Solr nodes. Recommended range: 64 to 256 (overhead increases by approximately 30%)
Check out the docs here:
http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/newFeatures.html
As usual you can download here with your credentials:
http://downloads.datastax.com/enterprise/opscenter.tar.gz
http://downloads.datastax.com/enterprise/dse-4.6-bin.tar.gz

Performing Analytics over Cassandra DB

I am working for a small concern and very new to apache cassandra. Studying about cassandra and performing some small analytics like sum function on cassandra DB for creating reports. For the same, Hive and Accunu can be choices.
Datastax Enterprise provides the solution for Apache Cassandra and Hive Integration. Is Datastax Enterprise is the only solution for such integration. Is there any way to resolve the hive and cassandra integration. If so, Can I get the links or documents regarding the same. Is that possible to work the same with the windows platform.
Is any other solution to perform analytics on cassandra DB?
Thanks in advance .
I was trying to download DataStax Enterprise (DSE) for Windows but found there is no such option on their website. I suppose they do not support DSE for Windows.
Apache Cassandra does have builtin Hadoop support. You need to set up a standalone Hadoop cluster colocated with Apache Cassandra nodes and then use ColumnFamilyInputFormat and ColumnFamilyOutputFormat to read/write data from/to your Hadoop cluster.

Resources