I have a Cassandra cluster with a co-located Spark cluster, and I can run the usual Spark jobs by compiling them, copying them over, and using the ./spark-submit script. I wrote a small job that accepts SQL as a command-line argument, submits it to Spark as Spark SQL, Spark runs that SQL against Cassandra and writes the output to a csv file.
Now I feel like I'm going round in circles trying to figure out if it's possible to query Cassandra via Spark SQL directly in a JDBC connection (eg from Squirrel SQL). The Spark SQL documentation says
Connect through JDBC or ODBC.
A server mode provides industry standard JDBC and ODBC connectivity for
business intelligence tools.
The Spark SQL Programming Guide says
Spark SQL can also act as a distributed query engine using its JDBC/ODBC or
command-line interface. In this mode, end-users or applications can interact
with Spark SQL directly to run SQL queries, without the need to write any
code.
So I can run the Thrift Server, and submit SQL to it. But what I can't figure out, is how do I get the Thrift Server to connect to Cassandra? Do I simply pop the Datastax Cassandra Connector on the Thrift Server classpath? How do I tell the Thrift Server the IP and Port of my Cassandra cluster? Has anyone done this already and can give me some pointers?
Configure those properties in spark-default.conf file
spark.cassandra.connection.host 192.168.1.17,192.168.1.19,192.168.1.21
# if you configured security in you cassandra cluster
spark.cassandra.auth.username smb
spark.cassandra.auth.password bigdata#123
Start your thrift server with spark-cassandra-connector dependencies and mysql-connector dependencies with some port that you will connect via JDBC or Squirrel.
sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.bind.host 192.168.1.17 --hiveconf hive.server2.thrift.port 10003 --jars <shade-jar>-0.0.1.jar --driver-class-path <shade-jar>-0.0.1.jar
For getting cassandra table run Spark-SQL queries like
CREATE TEMPORARY TABLE mytable USING org.apache.spark.sql.cassandra OPTIONS (cluster 'BDI Cassandra', keyspace 'testks', table 'testtable');
why don`t you use the spark-cassandra-connector and cassandra-driver-core? Just add the dependencies, specify the host address/login in your spark context and then you can read/write to cassandra using sql.
Related
HIVE has a metastore and HIVESERVER2 listens for SQL requests; with the help of metastore, the query is executed and the result is passed back.
The Thrift framework is actually customised as HIVESERVER2. In this way, HIVE is acting as a service. Via programming language, we can use HIVE as a database.
The relationship between Spark-SQL and HIVE is that:
Spark-SQL just utilises the HIVE setup (HDFS file system, HIVE Metastore, Hiveserver2). When we invoke /sbin/start-thriftserver2.sh (present in spark installation), we are supposed to give hiveserver2 port number, and the hostname. Then via spark's beeline, we can actually create, drop and manipulate tables in HIVE. The API can be either Spark-SQL or HIVE QL.
If we create a table / drop a table, it will be clearly visible if we login into HIVE and check(say via HIVE beeline or HIVE CLI). To put in other words, changes made via Spark can be seen in HIVE tables.
My understanding is that Spark does not have its own meta store setup like HIVE. Spark just utilises the HIVE setup and simply the SQL execution happens via Spark SQL API.
Is my understanding correct here?
Then I am little confused about the usage of bin/spark-sql.sh (which is also present in Spark installation). Documentation says that via this SQL shell, we can create tables like we do above (via Thrift Server/Beeline). Now my question is: How the metadata information is maintained by spark then?
Or like the first approach, can we make spark-sql CLI to communicate to HIVE (to be specific: hiveserver2 of HIVE) ?
If yes, how can we do that ?
Thanks in advance!
My understanding is that Spark does not have its own meta store setup like HIVE
Spark will start a Derby server on its own, if a Hive metastore is not provided
can we make spark-sql CLI to communicate to HIVE
Start an external metastore process, add a hive-site.xml file to $SPARK_CONF_DIR with hive.metastore.uris, or use SET SQL statements for the same.
Then spark-sql CLI should be able to query Hive tables. From code, you need to use enableHiveSupport() method on the SparkSession.
With SparkSQL Cassandra connector can a JDBC client tool (ie DBVisualizer, Tableau, Alteryx.etc) join 2 cassandra tables with SparkSQL?
All documentation I see refers to joinWithCassandraTable (which I assume only works in scala/java code or spark-shell but not a standard SQL client)
https://github.com/datastax/spark-cassandra-connector
DSE should support this if you're using JDBC driver that is available from DataStax Academy Downloads page. You'll need to run the Spark SQL Thrift server (via dse spark-sql-thriftserver command)... If you're just starting, DSE 6 has more improvements around this part (so-called Always On SQL Service (AOSS)).
Here is the old blog post that talks about ODBC driver + Spark SQL and joins, but the same should be for JDBC drivers.
I have Spark 2.2 installed but not Hive and I would like to expose Spark tables through ODBC. I am able to start thrift server , with apparently no errors and my ODBC driver application is able to connect to thrift sever, but can’t see any Spark tables. Do I need to have Hive installed up and running in order to my ODBC applications access the Spark tables that I create?
Thanks
Spark uses Hive metastore.
You need to setup hiveserver as well to get access to hive tables.
I read post on quora which tell that Spark Thrift server is related to Apache Thrift which is d binary communication protocol. Spark Thrift server is the interface to Hive, but how does Spark Thrift server use Apache Thrift for communication with Hive via binary protocol/rpc?
Spark Thrift Server is a Hive-compatible interface for Spark.
That means, it creates implementation of HiveServer2, you can connect with beeline, however almost all the computation will be computed with Spark, not Hive.
In the previous versions, query parser was from Hive. Currently Spark Thrift Server works with Spark query parser.
Apache Thrift is a framework to develop RPC - Remote Procedure Calls - so there are many implementations using Thrift. Also Cassandra used Thrift, now it's replaced with Cassandra native protocol.
So, Apache Thrift is a framework to develop RPCs, Spark Thrift Server is an implementation of Hive protol, but it uses Spark as a computation framework.
For more details, please see this link from #RussS
You can bring up the Spark thrift Server on AWS EMR using the following command - sudo /usr/lib/spark/sbin/start-thriftserver.sh --master yarn-client
On EMR, the default port for Spark thrift Server is 10001
While using the beeline for spark use the following command on EMR
/usr/lib/spark/bin/beeline -u 'jdbc:hive2://:10001/default' -e "show databases;"
By Default Hive thrift Server is always up and running on EMR but not the Spark thrift Server
You can also connect any application to the Spark thrift Server using ODBC/JDBC and can also monitor the query on EMR Cluster by Clicking the Application Master link for "org.apache.spark.sql.hive.thriftserver.HiveThriftServer2" job on Yarn Resource Manager:8088 on EMR
Can someone spell out the differences between using the Spark SQL CLI vs. Thriftserver/Beeline to query/modify data in Hive ? The Spark SQL documentation
mentions both of them but when would you use one or the other or are they equivalent alternatives from a functional point of view ?
For clarification:
spark-sql is a program that runs a single instance of Spark and you interact with it as if it were a mysql-like shell prompt and it makes use of the spark-warehouse and those types of features
Spark with Thriftserver is an application that exposes a connection to a running instance of Spark over a JDBC connection.
https://community.hortonworks.com/questions/33715/why-do-we-need-to-setup-spark-thrift-server.html
Beeline is a query / consumer tool that one uses to consume / connect to a running JDBC hive2 table (and thus in the spark documentation, they use beeline to test that the JDBC connection is in fact working). Note: query / connector programs like SQL Workbench can be made to connect to Spark with Thriftserver if it imports the proper Hive2 JDBC drivers & jars