Second Spark thrift Server without Kerberos - apache-spark

I have a kerberized HDP 2.5 Cluster with a running Spark Thrift Server (Spark 1.6.2).
A colleague would like to connect to the thrift server with a client that doesn't support passing Kerberos parameters.
Is it possible to start on another node a second thrift server that doesn't need kerberos for authentication?

Related

Running spark thrift server to access remote hive/HDFS remote cluster

We have two cluster, spark cluster with Spark/YARN/HDFS (cluster A) and one Hive/HDFS cluster (cluster B). Both clusters are kerberized. Is it possible to run spark thrift server on cluster A to provide SQL interface to query the Hive tables on cluster B using the compute resources on cluster A? I know that the remote read will impact the query performance, but it is not a concern at the moment.
The problem I have is to run the thrift server in cluster mode on cluster A and which hdfs-site.xml is to use. When I run the thrift with local mode it works.
Thanks,
Suri

Can't use Tableau on a EMR Spark cluster

I have a client that wants to use Tableau on their EMR Spark cluster.
The documentation seems straightforward but I'm getting errors when I try to connect.
Here is the setup:
EMR cluster's master doesn't have a public IP, but from the Tableau desktop EC2 instance I am able to ping and telnet into the port 10001 where thrift is running
I am able to test thrift with beeline and it connects fine
I am not using SSL or authentication given the limit access the cluster has
I have installed both data direct 8.0 and simbaodbc
I'm using emr-5.13.0, the Hadoop distribution is Amazon 2.8.3 and the Spark version is 2.3.0.
The error is
Unable to connect to the ODBC Data Source. Check that the necessary drivers are installed and that the connection properties are valid.
[Simba][ThriftExtension] (5) Error occurred while contacting server: No more data to read.. This could be because you are trying to establish a non-SSL connection to an SSL-enabled server.
Unable to connect to the server "IP". Check that the server is running and that you have access privileges to the requested database."
I simply followed the documentation provided by Tableau which says to install the driver only (not mess with ODBC), then us it in Tableau. I have verified that I have set no SSL and no authentication before trying to connect. I also verified by running Datagrip and doing a query from the Tableau EC2 instance, which works as expected.
resolved the issue by ignoring the documentation and just setting up the odbc driver, then choosing it instead of sparksql as a source.

How Spark Thrift server is related to Apache Thirft

I read post on quora which tell that Spark Thrift server is related to Apache Thrift which is d binary communication protocol. Spark Thrift server is the interface to Hive, but how does Spark Thrift server use Apache Thrift for communication with Hive via binary protocol/rpc?
Spark Thrift Server is a Hive-compatible interface for Spark.
That means, it creates implementation of HiveServer2, you can connect with beeline, however almost all the computation will be computed with Spark, not Hive.
In the previous versions, query parser was from Hive. Currently Spark Thrift Server works with Spark query parser.
Apache Thrift is a framework to develop RPC - Remote Procedure Calls - so there are many implementations using Thrift. Also Cassandra used Thrift, now it's replaced with Cassandra native protocol.
So, Apache Thrift is a framework to develop RPCs, Spark Thrift Server is an implementation of Hive protol, but it uses Spark as a computation framework.
For more details, please see this link from #RussS
You can bring up the Spark thrift Server on AWS EMR using the following command - sudo /usr/lib/spark/sbin/start-thriftserver.sh --master yarn-client
On EMR, the default port for Spark thrift Server is 10001
While using the beeline for spark use the following command on EMR
/usr/lib/spark/bin/beeline -u 'jdbc:hive2://:10001/default' -e "show databases;"
By Default Hive thrift Server is always up and running on EMR but not the Spark thrift Server
You can also connect any application to the Spark thrift Server using ODBC/JDBC and can also monitor the query on EMR Cluster by Clicking the Application Master link for "org.apache.spark.sql.hive.thriftserver.HiveThriftServer2" job on Yarn Resource Manager:8088 on EMR

Issue: connect rest application with Cassandra

I am using Cassandra 3.11 docker image for my application. But I am getting below. Can someone suggest a fix here
Enable Thrift server of Cassandra
The error means that JanusGraph can't connect to cassandra with thrift, So you have to enable thrift on cassandra.
Use the below command to enable thrift
nodetool enablethrift

Spark context for the thrift server

Does every jdbc connection to apache spark thrift server create a seperate spark context? If the answer is "No", how to create a seperate spark context for every JDBC connection to thrift server.
The Spark Context in thrift server is just one.
Spark Thrift server is not suitable for high concurrent application access.

Resources