Getting Tableau to talk to Spark and Cassandra - cassandra

The DataStax spark cassandra connector is great for interacting with Cassandra through Apache Spark. With Spark SQL 1.1, we can use the thrift server to interact with Spark with Tableau. Since Tableau can talk to Spark, and Spark can talk to Cassandra, there's surely some way to get Tableau talking to Cassandra through Spark (or rather Spark SQL). I can't figure out how to get this running. Ideally, I'd like to do this with Spark Standalone cluster + a cassandra cluster (i.e. without additional hadoop set up). Is this possible? Any pointers are appreciated.

The HiveThriftServer has a HiveThriftServer2.startWithContext(sqlContext) option so you could create your sqlContext referencing C* and the appropriate table / CF and then pass that context to the thrift server.
So something like this:
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.catalyst.types._
import java.sql.Date
val sparkContext = sc
import sparkContext._
val sqlContext = new HiveContext(sparkContext)
import sqlContext._
makeRDD((1,"hello") :: (2,"world") ::Nil).toSchemaRDD.cache().registerTempTable("t")
import org.apache.spark.sql.hive.thriftserver._
HiveThriftServer2.startWithContext(sqlContext)
So instead of starting the default thriftserver from Spark you could just lunch you cusotm one.

Related

How to execute SQL scripts with Spark

I want to create a database in Spark, and for this purpose, I have written a few SQL scripts which create the SQL tables.
My question is, how to integrate the SQL tables (the database) into Spark for later processing?
Could that be done using a Scala script or through the Spark console?
Thank you.
Using Scala :
import scala.io.Source
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder
.appName("execute-query-files")
.master("local[*]") //since the jar will be executed locally
.getOrCreate()
val sqlQuery = Source.fromFile("path/to/data.sql").mkString //read file
spark.sql(sqlQuery) //execute query
Where spark is your spark session, already created.

Difference between various sparkcontexts in Spark 1.x and 2.x

Can anyone explain the difference between SparkContext, SQLContext, HiveContext and SparkSession EntryPoints and each one's usecases.
SparkContext is used for basic RDD API on both Spark1.x and Spark2.x
SparkSession is used for DataFrame API and Struct Streaming API on Spark2.x
SQLContext & HiveContext are used for DataFrame API on Spark1.x and deprecated from Spark2.x
Spark Context is Class in Spark API which is the first stage to build the spark application. Functionality of the spark context is to create memory in RAM we call this as driver memory, allocation of number of executers and cores in short its all about the cluster management. Spark Context can be used to create RDD and shared variables. SparkContext is a Class to access this we need to create object of it.
This way we can create Spark Context :: var sc=new SparkContext()
Spark Session this is new Object added since spark 2.x which is replacement of Sql Context and Hive Context.
Earlier we had two options like one is Sql Context which is way to do sql operation on Dataframe and second is Hive Context which manage the Hive connectivity related stuff and fetch/insert the data from/to the hive tables.
Since 2.x came We can create SparkSession for the SQL operation on Dataframe and if you have any Hive related work just call the Method enablehivesupport() then you can use the SparkSession for the both Dataframe and hive related SQL operations.
This way we can create SparkSession for Sql operation on Dataframe
val sparksession=SparkSession.builder().getOrCreate();
Second way is to create SparkSession for Sql operation on Dataframe as well as Hive Operation.
val sparkSession=SparkSession.builder().enableHiveSupport().getOrCreate()

pyspark, how to read Hive tables with SQLContext?

I am new to the Hadoop ecosystem and I am still confused with few things. I am using Spark 1.6.0 (Hive 1.1.0-cdh5.8.0, Hadoop 2.6.0-cdh5.8.0)
I have some Hive table that exist and I can do some SQL queries using HUE web interface with Hive (map reduce) and Impala (mpp).
I am now using pySpark (I think behind this is pyspark-shell) and I wanted to understand and test HiveContext and SQLContext. There are many thready that discussed the differences between the two and for various version of Spark.
With Hive context, I have no issue to query the Hive tables:
from pyspark.sql import HiveContext
mysqlContext = HiveContext(sc)
FromHive = mysqlContext.sql("select * from table.mytable")
FromHive.count()
320
So far so good. Since SQLContext is subset of HiveContext, I was thinking that a basic SQL select should work:
from pyspark.sql import SQLContext
sqlSparkContext = SQLContext(sc)
FromSQL = mysqlContext.sql("select * from table.mytable")
FromSQL.count()
Py4JJavaError: An error occurred while calling o81.sql.
: org.apache.spark.sql.AnalysisException: Table not found: `table`.`mytable`;
I added the hive-site.xml to pyspark-shell. When running
sc._conf.getAll(
I see:
('spark.yarn.dist.files', '/etc/hive/conf/hive-site.xml'),
My questions are:
Can I acess Hive table with SQLContext for simple queries (I know
HiveContext is more powerfull but for me this is just to understand
things)
If this is possible what is missing ? I couldn't find any info apart
from the hive-site.xml that I tried but doesn't seems to work
Thanks a lot
Cheers
Fabien
As mentioned in other answer, you can't use SQLContext to access Hive tables, they've given a seperate HiveContext in Spark 1.x.x which is basically an extension of SQLContext.
Reason::
Hive uses an external metastore to keep all the metadata, for example the information about db and tables. This metastore can be configured to be kept in MySQL etc. Default is derby.
This done so that all the users accessing Hive may see all the contents facilitated by metastore.
Derby creates a private metastore as a directory metastore_db in the directory from where the spark app is executed. Since this metastore is private, what ever you create or edit in this session, will not be accessible to anyone else. SQLContext basically facilitates a connection to a private metastore.
Needless to say, in Spark 2.x.x they've merged the two into SparkSession which acts as a singular entry point to spark. You can enable Hive support while creating SparkSession by .enableHiveSupport()
You cannot use standard SQLContext to access Hive directly. To work with Hive you need Spark binaries built with Hive support and HiveContext.
You could use use JDBC data source, but it won't be acceptable performance wise for large scale processing.
To access SQLContext tables, you need to register it temporarily. Then you can easily make SQL queries on it. Suppose you have some data in the form of JSON. You can make it in dataframe.
Like below:
from pyspark.sql import SQLContext
sqlSparkContext = SQLContext(sc)
df = sqlSparkContext.read.json("your json data")
sql_df = df.registerTempTable("mytable")
FromSQL = sqlSparkContext.sql("select * from mytable")
FromSQL.show()
Also you can collect the SQL data in row type array as below:-
r = FromSSQL.collect()
print r.column_Name
Try without keeping sc into sqlContext,I think when we create sqlContext object with sc spark is trying to call HiveContext but we are having sqlContext instead
>>>df=sqlContext.sql("select * from <db-name>.<table-name>")
Use the superset of SQL Context i.e HiveContext to Connect and load the hive tables to spark dataframes
>>>df=HiveContext(sc).sql("select * from <db-name>.<table-name>")
(or)
>>>df=HiveContext(sc).table("default.text_Table")
(or)
>>> hc=HiveContext(sc)
>>> df=hc.sql("select * from default.text_Table")

Is SparkEnv created after the creation of SparkSession in Spark 2?

In Spark 1.6, a SparkEnv is automatically created after the creating a new SparkContext object.
In Spark 2.0, SparkSession was introduced as the entry point to Spark SQL.
Is SparkEnv created automatically after the creation of SparkSession in Spark 2?
Yes, SparkEnv, SparkConf and SparkContext are all automatically created when SparkSession is created (and that's why corresponding code in Spark SQL is more high-level and hopefully less error-prone).
SparkEnv is a part of Spark runtime infrastructure and is required to have all the Spark Core's low-level services up and running before you can use the high-level APIs in Spark SQL (or Spark MLlib). Nothing has changed here.
scala> :type spark
org.apache.spark.sql.SparkSession
scala> spark.sparkContext
res1: org.apache.spark.SparkContext = org.apache.spark.SparkContext#1e86506c

how to connect spark streaming with cassandra?

I'm using
Cassandra v2.1.12
Spark v1.4.1
Scala 2.10
and cassandra is listening on
rpc_address:127.0.1.1
rpc_port:9160
For example, to connect kafka and spark-streaming, while listening to kafka every 4 seconds, I have the following spark job
sc = SparkContext(conf=conf)
stream=StreamingContext(sc,4)
map1={'topic_name':1}
kafkaStream = KafkaUtils.createStream(stream, 'localhost:2181', "name", map1)
And spark-streaming keeps listening to kafka broker every 4 seconds and outputs the contents.
Same way, I want spark streaming to listen to cassandra and output the contents of the specified table every say 4 seconds.
How to convert the above streaming code to make it work with cassandra instead of kafka?
The non-streaming solution
I can obviously keep running the query in an infinite loop but that's not true streaming right?
spark job:
from __future__ import print_function
import time
import sys
from random import random
from operator import add
from pyspark.streaming import StreamingContext
from pyspark import SparkContext,SparkConf
from pyspark.sql import SQLContext
from pyspark.streaming import *
sc = SparkContext(appName="sparkcassandra")
while(True):
time.sleep(5)
sqlContext = SQLContext(sc)
stream=StreamingContext(sc,4)
lines = stream.socketTextStream("127.0.1.1", 9160)
sqlContext.read.format("org.apache.spark.sql.cassandra")\
.options(table="users", keyspace="keyspace2")\
.load()\
.show()
run like this
sudo ./bin/spark-submit --packages \
datastax:spark-cassandra-connector:1.4.1-s_2.10 \
examples/src/main/python/sparkstreaming-cassandra2.py
and I get the table values which rougly looks like
lastname|age|city|email|firstname
So what's the correct way of "streaming" the data from cassandra?
Currently the "Right Way" to stream data from C* is not to Stream Data from C* :) Instead it usually makes much more sense to have your message queue (like Kafka) in front of C* and Stream off of that. C* doesn't easily support incremental table reads although this can be done if the clustering key is based on insert time.
If you are interested in using C* as a streaming source be sure to check out and comment on
https://issues.apache.org/jira/browse/CASSANDRA-8844
Change Data Capture
Which is most likely what you are looking for.
If you are actually just trying to read the full table periodically and do something you may be best off with just a cron job launching a batch operation as you really have no way of recovering state anyway.
Currently Cassandra is not natively supported as a streaming source in Spark 1.6, you must implement a custom receiver for your own case(listen to cassandra and output the contents of the specified table every say 4 seconds.).
Please refer to the implementation guide:
Spark Streaming Custom Receivers

Resources