How to configure SSL between Spark and Cassandra? - apache-spark

I'm trying to configure SSL for the Cassandra Spark connector, but I couldn't find an example of how to do it.
I'm trying to configure it like this:
SparkConf conf = new SparkConf().setAppName("someApp")
.set("spark.cassandra.connection.host", "111.111.111.111")
.set("spark.cassandra.connection.ssl.enabled", "true")
.set("spark.cassandra.connection.ssl.trustStore.path", "/some/tfile.jks")
.set("spark.cassandra.connection.ssl.trustStore.password", "apassword")
.set("spark.cassandra.connection.ssl.trustStore.type", "JKS")
.set("spark.cassandra.connection.ssl.enabledAlgorithms", "TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA")
.set("spark.cassandra.connection.ssl.keyStore.path", "/some/kfile.jks")
.set("spark.cassandra.connection.ssl.keyStore.password", "anotherpassword")
.set("spark.cassandra.connection.ssl.keyStore.type", "JKS")
.set("spark.cassandra.connection.ssl.protocol", "TLS");
When I try to submit the spark job, I get these errors:
Exception in thread "main" com.datastax.spark.connector.util.ConfigCheck$ConnectorConfigurationException: Invalid Config Variables
Only known spark.cassandra.* variables are allowed when using the Spark Cassandra Connector.
spark.cassandra.connection.ssl.keyStore.password is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.enabled is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.protocol is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.keyStore.type is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.trustStore.path is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.enabledAlgorithms is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.keyStore.path is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.trustStore.password is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.trustStore.type is not a valid Spark Cassandra Connector variable.
No likely matches found.
So I'm not sure if this is supported or I'm just using the wrong property names.
I saw this ticket for release 1.2.3 of the connector, but I couldn't find an example of how to use it and it sounded like it may not support keystores. I'm using version 1.4.0-M1 of the connector.
Can anyone show me an example of how to configure SSL for the Spark Cassandra connector? Thanks.

Though I don't see any keystore configurations, I can see below config variables and they are working fine for me.
Note: I am using 1.5.0-M1 version. Not sure if there is any other bug in the version you are using.
sparkConf.set("spark.cassandra.connection.ssl.enabled", "true");
sparkConf.set("spark.cassandra.connection.ssl.trustStore.password", "password");
sparkConf.set("spark.cassandra.connection.ssl.trustStore.path", "jks file path");

Related

how can spark read / write from azurite

I am trying to read (and eventually write) from azurite (version 3.18.0) using spark (3.1.1)
i can't understand what spark configurations and file uri i need to set to make this work properly
for example these are the containers and files i have inside azurite
/devstoreaccount1/container1/file1.avro
/devstoreaccount1/container2/file2.avro
This is the code that im running - the uri val is one of the values below
val uri = ...
val spark = SparkSession.builder()
.appName(appName)
.master("local")
.config("spark.driver.host", "127.0.0.1").getOrCreate()
spark.conf.set("spark.hadoop.fs.wasbs.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.conf.set(s"spark.hadoop.fs.azure.account.auth.type.devstoreaccount1.blob.core.windows.net", "SharedKey")
spark.conf.set(s"spark.hadoop.fs.azure.account.key.devstoreaccount1.blob.core.windows.net", <azurite account key>)
spark.read.format("avro").load(uri)
uri value - what is the correct one?
http://127.0.0.1:10000/container1/file1.avro
I get UnsupportedOperationException when i perform the spark.read.format("avro").load(uri) because spark will use the HttpFileSystem implementation and it doesn't support listStatus
wasb://container1#devstoreaccount1.blob.core.windows.net/file1.avro
Spark will try to authenticate against azure servers (and will fail for obvious reasons)
I have tried to follow this stackoverflow post without success.
I have also tried to remove the blob.core.windows.net configuration postfix but then i don't how to give spark the endpoint for the azurite container?
So my question is what are the correct configurations to give spark so it will be able to read from azurite, and what are the correct file path formats to pass as the URI?

How to disable 'spark.security.credentials.${service}.enabled' in Structured streaming while connecting to a kafka cluster

I am trying to read data from a secured Kafka cluster using spark structured streaming.
Also I am using the below library to read the data - "spark-sql-kafka-0-10_2.12":"3.0.0-preview" since it has the feature to specify our custom group id (instead of spark setting its own custom group id)
Dependency used in code:
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.12</artifactId>
<version>3.0.0-preview</version>
I am getting the below error - even after specifying the required JAAS configuration in spark options.
Caused by: java.lang.IllegalArgumentException: requirement failed: Delegation token must exist for this connector.
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.kafka010.KafkaTokenUtil$.isConnectorUsingCurrentToken(KafkaTokenUtil.scala:299)
at org.apache.spark.sql.kafka010.KafkaDataConsumer.getOrRetrieveConsumer(KafkaDataConsumer.scala:533)
at org.apache.spark.sql.kafka010.KafkaDataConsumer.$anonfun$get$1(KafkaDataConsumer.scala:275)
Following document specifies that we can disable the feature of obtaining delegation token - https://spark.apache.org/docs/3.0.0-preview/structured-streaming-kafka-integration.html
I tried setting this property spark.security.credentials.kafka.enabled to false in spark config, but it is still failing with the same error.
Apparently there seems to be a bug on the preview release and has been fixed on the GA Spark 3.x release.
Reference :
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-30495
Now, we can specify our custom consumer group name while fetching the data from Kafka (Even though it's not recommended and we will see a warning message while specifying it).

How to print out Spark connection of Spark session ?

Suppose I run pyspark command and got global variable spark of type SparkSession. As I understand, this spark holds a connection to the Spark master. Can I print out the details of this connection including the hostname of this Spark master ?
For basic information you can use master property:
spark.sparkContext.master
To get details on YARN you might have to dig through hadoopConfiguration:
hadoopConfiguration = spark.sparkContext._jsc.hadoopConfiguration()
hadoopConfiguration.get("yarn.resourcemanager.hostname")
or
hadoopConfiguration.get("yarn.resourcemanager.address")
When submitted to YARN Spark uses Hadoop configuration to determine the resource manger so these values should match ones present in configuration placed in HADOOP_CONF_DIR or YARN_CONF_DIR.

Connecting to Cassandra with Spark

First, I have bought the new O'Reilly Spark book and tried those Cassandra setup instructions. I've also found other stackoverflow posts and various posts and guides over the web. None of them work as-is. Below is as far as I could get.
This is a test with only a handful of records of dummy test data. I am running the most recent Cassandra 2.0.7 Virtual Box VM provided by plasetcassandra.org linked from the main Cassandra project page.
I downloaded Spark 1.2.1 source and got the latest Cassandra Connector code from github and built both against Scala 2.11. I have JDK 1.8.0_40 and Scala 2.11.6 setup on Mac OS 10.10.2.
I run the spark shell with the cassandra connector loaded:
bin/spark-shell --driver-class-path ../spark-cassandra-connector/spark-cassandra-connector/target/scala-2.11/spark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar
Then I do what should be a simple row count type test on a test table of four records:
import com.datastax.spark.connector._
sc.stop
val conf = new org.apache.spark.SparkConf(true).set("spark.cassandra.connection.host", "192.168.56.101")
val sc = new org.apache.spark.SparkContext(conf)
val table = sc.cassandraTable("mykeyspace", "playlists")
table.count
I get the following error. What is confusing is that it is getting errors trying to find Cassandra at 127.0.0.1, but it also recognizes the host name that I configured which is 192.168.56.101.
15/03/16 15:56:54 INFO Cluster: New Cassandra host /192.168.56.101:9042 added
15/03/16 15:56:54 INFO CassandraConnector: Connected to Cassandra cluster: Cluster on a Stick
15/03/16 15:56:54 ERROR ServerSideTokenRangeSplitter: Failure while fetching splits from Cassandra
java.io.IOException: Failed to open thrift connection to Cassandra at 127.0.0.1:9160
<snip>
java.io.IOException: Failed to fetch splits of TokenRange(0,0,Set(CassandraNode(/127.0.0.1,/127.0.0.1)),None) from all endpoints: CassandraNode(/127.0.0.1,/127.0.0.1)
BTW, I can also use a configuration file at conf/spark-defaults.conf to do the above without having to close/recreate a spark context or pass in the --driver-clas-path argument. I ultimately hit the same error though, and the above steps seem easier to communicate in this post.
Any ideas?
Check the rpc_address config in your cassandra.yaml file on your cassandra node. It's likely that the spark connector is using that value from the system.local/system.peers tables and it may be set to 127.0.0.1 in your cassandra.yaml.
The spark connector uses thrift to get token range splits from cassandra. Eventually I'm betting this will be replaced as C* 2.1.4 has a new table called system.size_estimates (CASSANDRA-7688). It looks like it's getting the host metadata to find the nearest host and then making the query using thrift on port 9160.

how to use presto to query hive data

I just installed presto and when I use the presto-cli to query hive data, I get the following error:
$ ./presto --server node6:8080 --catalog hive --schema default
presto:default> show tables;
Query 20131113_150006_00002_u8uyp failed: Table hive.information_schema.tables does not exist
The config.properties is:
coordinator=true
datasources=jmx,hive
http-server.http.port=8080
presto-metastore.db.type=h2
presto-metastore.db.filename=/root/h2
task.max-memory=1GB
discovery-server.enabled=true
discovery.uri=`http://node6:8080`
And the hive.properties is:
connector.name=hive-cdh4
hive.metastore.uri=thrift://node6:9083
The hadoop distribution I used is CDH 4.4. I believe it's properly installed and hive can process queries successfully on its own.
Can anyone help me work it out? Any ideas will be appreciated.
As recommended by the Getting Started, I created a controller (jmx only) and a separate worker (jmx,hive), each on separate machines.
What finally solved this for me was to specify the worker's hostname and http-server.http.port as the --server argument to presto. When specifying the controller, it didn't work.
This all makes sense, but I am still wondering what will happen when I have two Presto-Hive workers...
Add more line to etc/catalog/hive.properties
"hive.config.resources=/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml"
ofcourse check values of path before do it.
presto-metastore.db.filename= <- is this the value for Hive Warehouse
Directory ?
=> this presto's metastore,not hive.
I just figured out what was wrong in my case:
you also have to add following line to $HIVE_HOME/conf/hive-env.sh for informing hive to open thrift port(same as mentioned under hive.metastore.uris property in hive-site.xml file). This port is used by hive client to connect to Metastore through RPC.
export METASTORE_PORT=9084
in the hive-env.sh file in the conf folder.
This should sync your hive with presto.

Resources