Spark context for the thrift server - apache-spark

Does every jdbc connection to apache spark thrift server create a seperate spark context? If the answer is "No", how to create a seperate spark context for every JDBC connection to thrift server.

The Spark Context in thrift server is just one.
Spark Thrift server is not suitable for high concurrent application access.

Related

Use Spark SQL JDBC Server/Beeline or spark-sql

In Spark SQL, there are two options to submit sql.
spark-sql, for each sql, it will kick off a new Spark application.
Spark JDBC Server and Beeline, The Jdbc Server is actually a long running standalone spark application, and the sqls submitted to it will share the resources
We are having about 30 big sql queries,each would like to occupy 200 cores and 800G memory to finish in reasonable time(30 mins).
As of spark-sql and jdbc server/beeline, which option is better for my case?
To me, I would like to use spark-sql, and I have no idea how many resources should be given to jdbc server to make my queries to finish in reasonable time.
If I can submit the 30 queries to Jdbc Server, then how many resources(cores/memory) that this Jdbc Server should be given(5000+ cores and 10T+ memory?)?

How Spark Thrift server is related to Apache Thirft

I read post on quora which tell that Spark Thrift server is related to Apache Thrift which is d binary communication protocol. Spark Thrift server is the interface to Hive, but how does Spark Thrift server use Apache Thrift for communication with Hive via binary protocol/rpc?
Spark Thrift Server is a Hive-compatible interface for Spark.
That means, it creates implementation of HiveServer2, you can connect with beeline, however almost all the computation will be computed with Spark, not Hive.
In the previous versions, query parser was from Hive. Currently Spark Thrift Server works with Spark query parser.
Apache Thrift is a framework to develop RPC - Remote Procedure Calls - so there are many implementations using Thrift. Also Cassandra used Thrift, now it's replaced with Cassandra native protocol.
So, Apache Thrift is a framework to develop RPCs, Spark Thrift Server is an implementation of Hive protol, but it uses Spark as a computation framework.
For more details, please see this link from #RussS
You can bring up the Spark thrift Server on AWS EMR using the following command - sudo /usr/lib/spark/sbin/start-thriftserver.sh --master yarn-client
On EMR, the default port for Spark thrift Server is 10001
While using the beeline for spark use the following command on EMR
/usr/lib/spark/bin/beeline -u 'jdbc:hive2://:10001/default' -e "show databases;"
By Default Hive thrift Server is always up and running on EMR but not the Spark thrift Server
You can also connect any application to the Spark thrift Server using ODBC/JDBC and can also monitor the query on EMR Cluster by Clicking the Application Master link for "org.apache.spark.sql.hive.thriftserver.HiveThriftServer2" job on Yarn Resource Manager:8088 on EMR

Second Spark thrift Server without Kerberos

I have a kerberized HDP 2.5 Cluster with a running Spark Thrift Server (Spark 1.6.2).
A colleague would like to connect to the thrift server with a client that doesn't support passing Kerberos parameters.
Is it possible to start on another node a second thrift server that doesn't need kerberos for authentication?

thrift server - hive contexts - load/update data from spark code

Does the ThriftServer create its own HiveContext?
My aim is to create tables/load data from spark code (spark-submit) by HiveContext such that clients of thriftServer will be able to see it.
yes, of course it creates context:
Thrift Code
But I have seen strange issue - it looks like hive context is cached on starting of thrift server. If I run some other app which creates/changes hive table, thrift server doesn't see the changes. Only restarting the service helps

Use JDBC (eg Squirrel SQL) to query Cassandra with Spark SQL

I have a Cassandra cluster with a co-located Spark cluster, and I can run the usual Spark jobs by compiling them, copying them over, and using the ./spark-submit script. I wrote a small job that accepts SQL as a command-line argument, submits it to Spark as Spark SQL, Spark runs that SQL against Cassandra and writes the output to a csv file.
Now I feel like I'm going round in circles trying to figure out if it's possible to query Cassandra via Spark SQL directly in a JDBC connection (eg from Squirrel SQL). The Spark SQL documentation says
Connect through JDBC or ODBC.
A server mode provides industry standard JDBC and ODBC connectivity for
business intelligence tools.
The Spark SQL Programming Guide says
Spark SQL can also act as a distributed query engine using its JDBC/ODBC or
command-line interface. In this mode, end-users or applications can interact
with Spark SQL directly to run SQL queries, without the need to write any
code.
So I can run the Thrift Server, and submit SQL to it. But what I can't figure out, is how do I get the Thrift Server to connect to Cassandra? Do I simply pop the Datastax Cassandra Connector on the Thrift Server classpath? How do I tell the Thrift Server the IP and Port of my Cassandra cluster? Has anyone done this already and can give me some pointers?
Configure those properties in spark-default.conf file
spark.cassandra.connection.host 192.168.1.17,192.168.1.19,192.168.1.21
# if you configured security in you cassandra cluster
spark.cassandra.auth.username smb
spark.cassandra.auth.password bigdata#123
Start your thrift server with spark-cassandra-connector dependencies and mysql-connector dependencies with some port that you will connect via JDBC or Squirrel.
sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.bind.host 192.168.1.17 --hiveconf hive.server2.thrift.port 10003 --jars <shade-jar>-0.0.1.jar --driver-class-path <shade-jar>-0.0.1.jar
For getting cassandra table run Spark-SQL queries like
CREATE TEMPORARY TABLE mytable USING org.apache.spark.sql.cassandra OPTIONS (cluster 'BDI Cassandra', keyspace 'testks', table 'testtable');
why don`t you use the spark-cassandra-connector and cassandra-driver-core? Just add the dependencies, specify the host address/login in your spark context and then you can read/write to cassandra using sql.

Resources