Microsoft PowerBI with Hortonworks Hive/HBase/Spark Integration - apache-spark

I'm thrilled with Microsoft's offering with PowerBI but still not able to find any possible direct way to integrate with my Hortonworks Hadoop cluster.
I went through the tutorials and found two things:
PowerBI can fetch data from HDInsights Azure cluster using thrift, if that's possible then is it possible to connect with any other Hadoop distro to connect to it as well?
We can connect using ODBC driver which is offered by Simba Technologies but I was wondering if it's possible to connect using Apache Phoenix drivers which offer JDBC drivers for HBase?
Appreciate your thoughts/suggestions/help!

Splice Machine has an ODBC driver for retrieving data that is stored ultimately in HBase.
Checkout the ODBC driver on this page..
http://community.splicemachine.com/

Yes, it is possible to connect Power BI to other Hadoop distros via ODBC driver e.g. http://www.simba.com/webinar/powerbi-demo/
Power BI doesn't support JDBC drivers, but if you are interested in testing ODBC driver for Phoenix please contact Simba Technologies.

Related

Free SparkSQL ODBC Driver

Is there any free ODBC driver that you could use to connect to the Spark SQL instance? It seems many companies offer the driver but you can either use it with their products only or pay for a license.
I need a driver to connect my service to the local spark cluster. However, I struggle to find one.

How to connect multiple cassandra intances using single odbc driver ( from SAS ETL)

We are facing challenges to connect multiple Cassandra instances using a single ODBC driver. We have a SAS ETL server using that we want to connect multiple Cassandra instances, but we are not able to figure out how to do this?
If you have the ODBC driver installed, you can connect to different Cassandra clusters as long as you configure the appropriate ODBC URL/DSN connection for each cluster.
If for example, you want to configure the driver to use multiple contact points, you can only do it if you are connecting to a DataStax Enterprise cluster since that is an enterprise-only feature in the Simba Spark ODBC driver which connects to the AlwaysOn SQL Service in DSE. Cheers!

How to link Virtuoso distributed version to Hadoop

I Have a cluster of 4 nodes, I installed Hadoop+ Spark (GraphX)...
Now I have to process a big RDF dataset,
my question is : Can I install Virtuoso on the cluster so to store this RDF datasets and to be able to execute SPARQL distributed queries?
To the best of your knowledge, I need a web endpoint to allow users putting their SPARQL Queries.
in other words: is Virtuoso a good solution that works in a hadoop cluster, and can use SPARK to execute the distributed queries?
The Apache Spark website indicates that Spark SQL can be used to query across JDBC and JSON data sources --
DataFrames and SQL provide a common way to access a variety of data sources, including Hive, Avro, Parquet, ORC, JSON, and JDBC. You can even join data across these sources.
Virtuoso (both Open Source and Enterprise Edition) can deliver SPARQL results as JSON serializations, so that is an option.
We (OpenLink Software) also provide JDBC drivers for Virtuoso (again, both Open Source and Enterprise Edition), so that is also an option.
We are not Apache Spark experts, so we cannot provide much guidance for getting these working beyond assisting with Virtuoso JDBC URLs and/or retrieving SPARQL query results in JSON serialization.
In the other direction, Virtuoso (Enterprise Edition; not Open Source Edition) can be used to query against external ODBC data sources, and there are ODBC drivers available for Hadoop/SPARK data sources, so this is also an option.
We are not Apache Spark experts, so we cannot provide much guidance for getting their drivers working, but once you have a functional ODBC DSN on the Virtuoso host, we can assist in getting Virtuoso connected to and querying against it.
Are you seeking to upload RDF datasets from your Hadoop Cluster using SPARK jobs? If so, you can use JDBC and the connection to Virtuoso.
I stumbled upon a Dzone doc that covers SPARK and JDBC which once understood you can apply to Virtuoso via its ability to process SPARQL queries via SQL connections.
I hope that helps, if not, we can discuss further.

Is there a Spark SQL jdbc driver?

I'm looking for a client jdbc driver that supports Spark SQL.
I have been using Jupyter so far to run SQL statements on Spark (running on HDInsight) and I'd like to be able to connect using JDBC so I can use third-party SQL clients (e.g. SQuirreL, SQL Explorer, etc.) instead of the notebook interface.
I found an ODBC driver from Microsoft but this doesn't help me with java-based SQL clients. I also tried downloading the Hive jdbc driver from my cluster, but the Hive JDBC driver does not appear to support more advance SQL features that Spark does. For example, the Hive driver complains about not supporting join statements that are not equajoins, where I know that this is a supported feature of Spark because I've executed the same SQL in Jupyter successfully.
the Hive JDBC driver does not appear to support more advance SQL features that Spark does
Regardless of the support that it provides, the Spark Thrift Server is fully compatible with Hive/Beeline's JDBC connection.
Therefore, that is the JAR you need to use. I have verified this works in DBVisualizer.
The alternative solution would be to run Spark code in your Java clients (non-third party tools) directly and skip the need for the JDBC connection.

Performing Analytics over Cassandra DB

I am working for a small concern and very new to apache cassandra. Studying about cassandra and performing some small analytics like sum function on cassandra DB for creating reports. For the same, Hive and Accunu can be choices.
Datastax Enterprise provides the solution for Apache Cassandra and Hive Integration. Is Datastax Enterprise is the only solution for such integration. Is there any way to resolve the hive and cassandra integration. If so, Can I get the links or documents regarding the same. Is that possible to work the same with the windows platform.
Is any other solution to perform analytics on cassandra DB?
Thanks in advance .
I was trying to download DataStax Enterprise (DSE) for Windows but found there is no such option on their website. I suppose they do not support DSE for Windows.
Apache Cassandra does have builtin Hadoop support. You need to set up a standalone Hadoop cluster colocated with Apache Cassandra nodes and then use ColumnFamilyInputFormat and ColumnFamilyOutputFormat to read/write data from/to your Hadoop cluster.

Resources