Free SparkSQL ODBC Driver - apache-spark

Is there any free ODBC driver that you could use to connect to the Spark SQL instance? It seems many companies offer the driver but you can either use it with their products only or pay for a license.
I need a driver to connect my service to the local spark cluster. However, I struggle to find one.

Related

How to connect multiple cassandra intances using single odbc driver ( from SAS ETL)

We are facing challenges to connect multiple Cassandra instances using a single ODBC driver. We have a SAS ETL server using that we want to connect multiple Cassandra instances, but we are not able to figure out how to do this?
If you have the ODBC driver installed, you can connect to different Cassandra clusters as long as you configure the appropriate ODBC URL/DSN connection for each cluster.
If for example, you want to configure the driver to use multiple contact points, you can only do it if you are connecting to a DataStax Enterprise cluster since that is an enterprise-only feature in the Simba Spark ODBC driver which connects to the AlwaysOn SQL Service in DSE. Cheers!

Is there a Spark SQL jdbc driver?

I'm looking for a client jdbc driver that supports Spark SQL.
I have been using Jupyter so far to run SQL statements on Spark (running on HDInsight) and I'd like to be able to connect using JDBC so I can use third-party SQL clients (e.g. SQuirreL, SQL Explorer, etc.) instead of the notebook interface.
I found an ODBC driver from Microsoft but this doesn't help me with java-based SQL clients. I also tried downloading the Hive jdbc driver from my cluster, but the Hive JDBC driver does not appear to support more advance SQL features that Spark does. For example, the Hive driver complains about not supporting join statements that are not equajoins, where I know that this is a supported feature of Spark because I've executed the same SQL in Jupyter successfully.
the Hive JDBC driver does not appear to support more advance SQL features that Spark does
Regardless of the support that it provides, the Spark Thrift Server is fully compatible with Hive/Beeline's JDBC connection.
Therefore, that is the JAR you need to use. I have verified this works in DBVisualizer.
The alternative solution would be to run Spark code in your Java clients (non-third party tools) directly and skip the need for the JDBC connection.

Connecting open-source Cassandra to Tableau Desktop

I am trying to use the Business Intelligence (BI) software Tableau Desktop to see into a local Cassandra cluster. The Cassandra cluster is the open-source version and not the proprietary version that one pays to use. The version of Cassandra I am using is 2.2.x.
I can successfully connect Cassandra and Tableau after configuring the 64 bit ODBC driver. However, actually querying the tables in the NoSQL database throws errors. For instance in the 'Data Source' view selecting 'Update Now' results in an error from a SQL statement that starts with SELECT 1... I do not think Cassandra can understand, process, SELECT 1 statements.
Errors are also thrown when trying to build graphs of the data as this also results in failed queries.
In the 'Advanced Options' for the ODBC driver I selected to use CQL as the 'Query Mode' and still there were problems with the queries Tableau was sending to Cassandra.
Does anyone know how to get these two technologies to work together? I found this tutorial but it was made almost a year ago, as of this writing, and does not work from my experience. Please see the link to what I am talking about here: http://www.datastax.com/dev/blog/datastax-odbc-cql-connector-apache-cassandra-datastax-enterprise In this article they say to download the driver from here: https://academy.datastax.com/downloads/download-drivers?dxt=DX I am wondering this specific version of the ODBC driver is the problem.
I also read a previous post on this and it was not helpful as it is also obsolete from my experience. The post I am referring to is at the following URL: Connecting cassandra to Tableau Software The first answer is probably the obsolete one but the second one recommends to use the Simba driver, which is some type of proprietary driver. My current hypotheses is maybe the Simba driver is needed to use Tableau and Cassandra together.
Thank-you for reading this.
DataStax licenses Simba Technologies ODBC driver, but the version on their website may be behind the latest version available from Simba. Please download a free evaluation version of the driver and see if you have the same issue: http://www.simba.com/drivers/cassandra-odbc-jdbc/
'SELECT 1' is not a valid CQL query (http://docs.datastax.com/en/cql/3.1/cql/cql_reference/select_r.html).

Microsoft PowerBI with Hortonworks Hive/HBase/Spark Integration

I'm thrilled with Microsoft's offering with PowerBI but still not able to find any possible direct way to integrate with my Hortonworks Hadoop cluster.
I went through the tutorials and found two things:
PowerBI can fetch data from HDInsights Azure cluster using thrift, if that's possible then is it possible to connect with any other Hadoop distro to connect to it as well?
We can connect using ODBC driver which is offered by Simba Technologies but I was wondering if it's possible to connect using Apache Phoenix drivers which offer JDBC drivers for HBase?
Appreciate your thoughts/suggestions/help!
Splice Machine has an ODBC driver for retrieving data that is stored ultimately in HBase.
Checkout the ODBC driver on this page..
http://community.splicemachine.com/
Yes, it is possible to connect Power BI to other Hadoop distros via ODBC driver e.g. http://www.simba.com/webinar/powerbi-demo/
Power BI doesn't support JDBC drivers, but if you are interested in testing ODBC driver for Phoenix please contact Simba Technologies.

Possibilities of Hadoop with MSSQL Reporting

I have been evaluating Hadoop on azure HDInsight to find a big data solution for our reporting application. The key part of this technology evaluation is that the I need to integrate with MSSQL Reporting Services as that is what our application already uses. We are very short on developer resources so the more I can make this into an engineering exercise the better. What I have tried so far
Use an ODBC connection from MSSQL mapped to the Hive on HDInsight.
Use an ODBC connection from MSSQL using HBASE on HDInsight.
Use SPARKQL locally on the azure HDInsight Remote desktop
What I have found is that HBASE and Hive are far slower to use with our reports. For test data I used a table with 60k rows and found that the report on MSSQL ran in less than 10 seconds. I ran the query on the hive query console and on the ODBC connection and found that it took over a minute to execute. Spark was faster (30 seconds) but there is no way to connect to it externally since ports cannot be opened on the HDInsight cluster.
Big data and Hadoop are all new to me. My question is, am I looking for Hadoop to do something it is not designed to do and are there ways to make this faster?I have considered caching results and periodically refreshing them, but it sounds like a management nightmare. Kylin looks promising but we are pretty married to windows azure, so I am not sure that is a viable solution.
Look at this documentation on optimizing Hive queries: https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-optimize-hive-query/
Specifically look at ORC and using Tez. I would create a cluster that has Tez on by default and then store your data in ORC format. Your queries should be much more performant then.
If going through Spark is fast enough, you should consider using the Microsoft Spark ODBC driver. I am using it and the performance is not comparable to what you'll get with MSSQL, other RDBMS or something like ElasticSearch but it does work pretty reliably.

Resources