Connecting to Cassandra with Spark

Connecting to Cassandra with Spark - cassandra

First, I have bought the new O'Reilly Spark book and tried those Cassandra setup instructions. I've also found other stackoverflow posts and various posts and guides over the web. None of them work as-is. Below is as far as I could get.
This is a test with only a handful of records of dummy test data. I am running the most recent Cassandra 2.0.7 Virtual Box VM provided by plasetcassandra.org linked from the main Cassandra project page.
I downloaded Spark 1.2.1 source and got the latest Cassandra Connector code from github and built both against Scala 2.11. I have JDK 1.8.0_40 and Scala 2.11.6 setup on Mac OS 10.10.2.
I run the spark shell with the cassandra connector loaded:
bin/spark-shell --driver-class-path ../spark-cassandra-connector/spark-cassandra-connector/target/scala-2.11/spark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar
Then I do what should be a simple row count type test on a test table of four records:
import com.datastax.spark.connector._
sc.stop
val conf = new org.apache.spark.SparkConf(true).set("spark.cassandra.connection.host", "192.168.56.101")
val sc = new org.apache.spark.SparkContext(conf)
val table = sc.cassandraTable("mykeyspace", "playlists")
table.count
I get the following error. What is confusing is that it is getting errors trying to find Cassandra at 127.0.0.1, but it also recognizes the host name that I configured which is 192.168.56.101.
15/03/16 15:56:54 INFO Cluster: New Cassandra host /192.168.56.101:9042 added
15/03/16 15:56:54 INFO CassandraConnector: Connected to Cassandra cluster: Cluster on a Stick
15/03/16 15:56:54 ERROR ServerSideTokenRangeSplitter: Failure while fetching splits from Cassandra
java.io.IOException: Failed to open thrift connection to Cassandra at 127.0.0.1:9160
<snip>
java.io.IOException: Failed to fetch splits of TokenRange(0,0,Set(CassandraNode(/127.0.0.1,/127.0.0.1)),None) from all endpoints: CassandraNode(/127.0.0.1,/127.0.0.1)
BTW, I can also use a configuration file at conf/spark-defaults.conf to do the above without having to close/recreate a spark context or pass in the --driver-clas-path argument. I ultimately hit the same error though, and the above steps seem easier to communicate in this post.
Any ideas?

Check the rpc_address config in your cassandra.yaml file on your cassandra node. It's likely that the spark connector is using that value from the system.local/system.peers tables and it may be set to 127.0.0.1 in your cassandra.yaml.
The spark connector uses thrift to get token range splits from cassandra. Eventually I'm betting this will be replaced as C* 2.1.4 has a new table called system.size_estimates (CASSANDRA-7688). It looks like it's getting the host metadata to find the nearest host and then making the query using thrift on port 9160.

Related

How to print out Spark connection of Spark session ?

Suppose I run pyspark command and got global variable spark of type SparkSession. As I understand, this spark holds a connection to the Spark master. Can I print out the details of this connection including the hostname of this Spark master ?

For basic information you can use master property:
spark.sparkContext.master
To get details on YARN you might have to dig through hadoopConfiguration:
hadoopConfiguration = spark.sparkContext._jsc.hadoopConfiguration()
hadoopConfiguration.get("yarn.resourcemanager.hostname")
or
hadoopConfiguration.get("yarn.resourcemanager.address")
When submitted to YARN Spark uses Hadoop configuration to determine the resource manger so these values should match ones present in configuration placed in HADOOP_CONF_DIR or YARN_CONF_DIR.

Cassandra Connection with Groovy Script In SoapUI

thanks for the time. I am trying to access a remote Cassandra DB in order to complete my assertions. I see that the Server is running:
Cassandra V 3.0.8.1293
Driver Type: Cassandra CQL
Datastax Java Driver for Apache Cassandra - Core [3.0.5]
So, I am trying with the following simple code to access the DB
import com.datastax.driver.core.*
Cluster cluster = null;
try {
cluster = Cluster.builder().addContactPoint("x.x.x.x").withCredentials("xxxxxxx", "xxxxxx").withPort(9042).build()
Session session = cluster.connect();
ResultSet rs = session.execute("select * from TABLE");
Row row = rs.one();
} finally {
if (cluster != null) cluster.close();
}
when I use the cassandra-driver-core-2.0.1.jar I am getting the error :
ERROR:com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /x.x.x.x(null))
Read the documentation and a lot of posts here and on other blogs and I saw that there may be an incompatibility with the driver version so I tried to upgrade the driver to many versions (cassandra-driver-core-2.5,cassandra-driver-core-3,cassandra-driver-core-3.2), but on that I am getting the following:
ERROR:java.lang.ExceptionInInitializerError
Have also tried to connect using JDBC, but to no avail, using the configuration proposed at this thread
SoapUI JDBC connection with Apache Cassandra
Actually I am running out of ideas. Can anyone propose or point to some direction on how to actually achieve this, either by pointing me to some tutorial or any idea.
Thank you very much

I think you haven't enable remote access to cassandra.
Try enabling remote access using below configuration -
File Path /etc/cassandra/default.conf/cassandra.yaml
rpc_address: 0.0.0.0
broadcast_rpc_address: <serverIPAddress>
After that, restart cassandra service.

How to configure SSL between Spark and Cassandra?

I'm trying to configure SSL for the Cassandra Spark connector, but I couldn't find an example of how to do it.
I'm trying to configure it like this:
SparkConf conf = new SparkConf().setAppName("someApp")
.set("spark.cassandra.connection.host", "111.111.111.111")
.set("spark.cassandra.connection.ssl.enabled", "true")
.set("spark.cassandra.connection.ssl.trustStore.path", "/some/tfile.jks")
.set("spark.cassandra.connection.ssl.trustStore.password", "apassword")
.set("spark.cassandra.connection.ssl.trustStore.type", "JKS")
.set("spark.cassandra.connection.ssl.enabledAlgorithms", "TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA")
.set("spark.cassandra.connection.ssl.keyStore.path", "/some/kfile.jks")
.set("spark.cassandra.connection.ssl.keyStore.password", "anotherpassword")
.set("spark.cassandra.connection.ssl.keyStore.type", "JKS")
.set("spark.cassandra.connection.ssl.protocol", "TLS");
When I try to submit the spark job, I get these errors:
Exception in thread "main" com.datastax.spark.connector.util.ConfigCheck$ConnectorConfigurationException: Invalid Config Variables
Only known spark.cassandra.* variables are allowed when using the Spark Cassandra Connector.
spark.cassandra.connection.ssl.keyStore.password is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.enabled is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.protocol is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.keyStore.type is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.trustStore.path is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.enabledAlgorithms is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.keyStore.path is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.trustStore.password is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.trustStore.type is not a valid Spark Cassandra Connector variable.
No likely matches found.
So I'm not sure if this is supported or I'm just using the wrong property names.
I saw this ticket for release 1.2.3 of the connector, but I couldn't find an example of how to use it and it sounded like it may not support keystores. I'm using version 1.4.0-M1 of the connector.
Can anyone show me an example of how to configure SSL for the Spark Cassandra connector? Thanks.

Though I don't see any keystore configurations, I can see below config variables and they are working fine for me.
Note: I am using 1.5.0-M1 version. Not sure if there is any other bug in the version you are using.
sparkConf.set("spark.cassandra.connection.ssl.enabled", "true");
sparkConf.set("spark.cassandra.connection.ssl.trustStore.password", "password");
sparkConf.set("spark.cassandra.connection.ssl.trustStore.path", "jks file path");

Accessing Spark RDDs from a web browser via thrift server - java

We have processed our data using Spark 1.2.1 with Java and stored in Hive tables. We want to access this data as RDDs from an web browser.
I read documentation and I understood the steps to do the task.
I am unable to find the way to interact with Spark SQL RDDs via thrift server. Examples I found have belw line in the code and I am not find the class for this in Spark 1.2.1 java API docs.
HiveThriftServer2.startWithContext
In github i saw scala examples using
import org.apache.spark.sql.hive.thriftserver , but I dont see this in Java API docs. Not sure if I am missing something.
Did anybody had luck with accessing Spark SQL RDDs from a browser via thrift? Can you post the code snippet. We are using Java.

I've got most of this working. Lets dissect each part of it: (References at bottom of post)
HiveThriftServer2.startWithContext is defined in Scala. I was never able to access it from Java or from Python using Py4j, and am no JVM expert, but I ended up switching to Scala. This may have something to do with the annotation #DeveloperApi . This is how I imported it Scala in Spark 1.6.1:
import org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
For anyone reading this and not using Hive, a Spark SQL context won't do, and you need a hive context. However, the HiveContext constructor requires a Java spark context, not a scala one.
import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.sql.hive.HiveContext
var hiveContext = new HiveContext(JavaSparkContext.toSparkContext(sc))
Now start the thrift server
HiveThriftServer2.startWithContext(hiveContext)
// Yay
Next, we need to make our RDDs available as SQL tables. First, we have to convert them into Spark SQL DataFrames:
val someDF = hiveContext.createDataFrame(someRDD)
Then, we need to turn them into Spark SQL tables. You do this by persisting them to Hive, or making the RDD available as a temporary table.
Persist to Hive:
// Deprecated since Spark 1.4, to be removed in Spark 2.0:
someDF.saveAsTable("someTable")
// Up-to-date at time of writing
someDF.write().saveAsTable("someTable")
Or, use a temporary table:
// Use the Data Frame as a Temporary Table
// Introduced in Spark 1.3.0
someDF.registerTempTable("someTable")
Note - temporary tables are isolated to an SQL session.
Spark's hive thrift server is multi-session by default
in version 1.6 (one session per connection). Therefore,
for clients to access temporary tables you've registered,
you'll need to set the option spark.sql.hive.thriftServer.singleSession to true
You can test this by querying the tables in beeline, a command line utility for interacting with the hive thrift server. It ships with Spark.
Finally, you need a way of accessing the hive thrift server from the browser. Thanks to its awesome developers, it has an HTTP mode, so if you want to build a web app, you can use the thrift protocol over AJAX requests from the browser. A simpler strategy might be to create an IPython notebook, and use pyhive to connect to the thrift server.
Data Frame Reference:
https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/DataFrame.html
singleSession option pull request:
https://mail-archives.apache.org/mod_mbox/spark-commits/201511.mbox/%3Cc2bd1313f7ca4e618ec89badbd8f9f31#git.apache.org%3E
HTTP mode and beeline howto:
https://spark.apache.org/docs/latest/sql-programming-guide.html#distributed-sql-engine
Pyhive:
https://github.com/dropbox/PyHive
HiveThriftServer2 startWithContext definition:
https://github.com/apache/spark/blob/6b1a6180e7bd45b0a0ec47de9f7c7956543f4dfa/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L56-73

Thrift is JDBC/ODBC server.
You can connect to it via JDBC/ODBC connections and access content through the HiveDriver.
You can not get RDDs back from it, because HiveContext is not available.
What you refered to is an experimental feature not available for Java.
As a workaround, you could re-parse the results and create your structures for your client.
For example:
private static String driverName = "org.apache.hive.jdbc.HiveDriver";
private static String hiveConnectionString = "jdbc:hive2://YourHiveServer:Port";
private static String tableName = "SOME_TABLE";
Class c = Class.forName(driverName);
Connection con = DriverManager.getConnection(hiveConnectionString, "user", "pwd");
Statement stmt = con.createStatement();
String sql = "select * from "+tableName;
ResultSet res = stmt.executeQuery(sql);
parseResultsToObjects(res);

how to use presto to query hive data

I just installed presto and when I use the presto-cli to query hive data, I get the following error:
$ ./presto --server node6:8080 --catalog hive --schema default
presto:default> show tables;
Query 20131113_150006_00002_u8uyp failed: Table hive.information_schema.tables does not exist
The config.properties is:
coordinator=true
datasources=jmx,hive
http-server.http.port=8080
presto-metastore.db.type=h2
presto-metastore.db.filename=/root/h2
task.max-memory=1GB
discovery-server.enabled=true
discovery.uri=`http://node6:8080`
And the hive.properties is:
connector.name=hive-cdh4
hive.metastore.uri=thrift://node6:9083
The hadoop distribution I used is CDH 4.4. I believe it's properly installed and hive can process queries successfully on its own.
Can anyone help me work it out? Any ideas will be appreciated.

As recommended by the Getting Started, I created a controller (jmx only) and a separate worker (jmx,hive), each on separate machines.
What finally solved this for me was to specify the worker's hostname and http-server.http.port as the --server argument to presto. When specifying the controller, it didn't work.
This all makes sense, but I am still wondering what will happen when I have two Presto-Hive workers...

Add more line to etc/catalog/hive.properties
"hive.config.resources=/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml"
ofcourse check values of path before do it.
presto-metastore.db.filename= <- is this the value for Hive Warehouse
Directory ?
=> this presto's metastore,not hive.

I just figured out what was wrong in my case:
you also have to add following line to $HIVE_HOME/conf/hive-env.sh for informing hive to open thrift port(same as mentioned under hive.metastore.uris property in hive-site.xml file). This port is used by hive client to connect to Metastore through RPC.
export METASTORE_PORT=9084
in the hive-env.sh file in the conf folder.
This should sync your hive with presto.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string