Tableau connection to Spark SQL

Tableau connection to Spark SQL - apache-spark

I am trying to connect Tableau Desktop 10 (mac) to Spark SQL 2.1 (on centos 7 server). I am connecting via Simba ODBC driver with Authentication = Username and Username = . It doesn't give any error but I don't see the tables which are available in Hive. After searching and choosing 'default' schema, and searching for tables, I only see default (default.default) table. However, when I use beeline on the server to connect to Spark SQL, the hive tables are visible.
If I use the custom SQL feature I can query the tables and use the data, but I still have no way to list the tables in Tableau.
I am not sure if the issue is on Tableau side or Spark side. I'd greatly appreciate any help with troubleshooting this issue.

The reason for this behaviour is following:
In spark 2.0, show tables output format is : 'tableName', 'isTemporary'
and
In Spark 2.1 show tables output format is 'database', 'tablename', 'isTemporary'
Now Tableau 10.2.3 or greater are able to parse the output from spark2.1, but 10.2.1 and less are unable to parse this new output format.

Related

How to prevent Spark SQL + Power BI OOM

Now I'm testing Spark SQL like an query engine for Microsoft Power BI.
What I have:
A huge Cassandra table with data I need to analyze.
An Amazon server with 8 cores and 16Gb of RAM.
A Spark Thrift server on this server. Version of Spark - 1.6.1
A Hive table mapped to a huge Cassandra table.
create table data using org.apache.spark.sql.cassandra options (cluster 'Cluster', keyspace 'myspace', table 'data');
All was ok until I tried to connect Power BI to Spark. The problem is that Power BI is trying to fetch all data from huge Cassandra table. Obviously Spark Thrift Server crashes with OOM Error. In this case I cant just add RAM to Spark Thrift Server because Cassandra table with raw data is really huge. Also I cant rely on custom initial query on BI side, because every time user forget about setting this query server would crash.
The best approach I see is in automatically wrapping all queries from BI in some kind of
SELECT * FROM (... BI select ...) LIMIT 1000000
It will be okay for current use cases.
So, is it possible on the server side? How I can do it?
If not, how I can prevent Spark Thrift Server crashes? Is there a possibility to drop or cancel huge queries before getting OOM?
Thanks.

Ok, I find a magic configuration option that solves my problem:
spark.sql.thriftServer.incrementalCollect=true
When this option is set, Spark splits the data that is fetched by a volume-consuming query to chunks

Hive tables Not Visible in Tableau

I have created a table, ztest7 in the default database in my hive. I am able to query it using beeline. In tableau, I can query it using a custom sql.
However the table does NOT show when I search for it.
Am I missing something here?
Tableau Desktop Version = v10.1.1
Hive = v2.0.1
Spark = v2.1.0
Best Regards

I have the same issue with Tableau Desktop 10 (mac) to Hive (2.1.1) via Spark SQL 2.1 (on centos 7 server)
This is what I got from Tableau Support:
In Tableau Desktop, the ability to connect to Spark SQL without a
defining a default schema is not currently built into the product.
As a preliminary step, to define a default schema, configure the Spark
SQL hivemetastore to utilize a SchemaRDD or DataFrame. This must be
defined in the Hive Metastore for Tableau Desktop to be able to access
it. Pure schema-less Spark RDD's can not be queried by Spark SQL
because of the lack of a schema. RDDs can be converted into
SchemaRDDs, which have additional schema metadata as Spark SQL
provides access to SchemaRDDs. When a SchemaRDD is created, it is only
available in the local namespace or context, and is unavailable to
external services accessing Spark through ODBC and the Spark Thrift
Server. For Tableau to have access, the SchemaRDD needs to be
registered in a catalog that is available outside of just the local
context; the Hive Metastore is currently the only supported service.
I don't know how to check/implement this.
PS: I'd have posted this as a comment because I am not allowed to as I am new to Stack Overflow.

In the file labeled Table on the left side of the screen, Try selecting contains, entering part of your table name and hitting enter

I ran into similar issue. In my case, I had loaded tables using HIVE but the tableau connection to the data source was made using Impala as shown in the image below.
To fix the issue of not seeing the tables in tableau dropdown, try running INVALIDATE METADATA database.table_name in the impala interface. This fixed the problem for me.
To know why this fixes the issue, refer this link.

Unable to Connect Tableau with Cassandra

I was trying to connect Tableau with Cassandra.
Tableau version: 10.0(I also tried 8.3)
Cassandra version: 3.0.8
DataStax Enterprise Server 5.0.2
I installed Datastax ODBC driver 2.4 (64 bit), and configured DSN(Data Source Name). The connection to Cassandra was successful when I tested from ODBC Data Source Administrator.
But when I tried to connect from Tableau, I got this error:
I was able to connect to Cassandra from Datastax DevCenter, so I think the problem is either on Tableau end or the driver itself.
I tried both 10.0 and 8.3 version of Tableau, neither work.
Here are the error logs from Datastax ODBC driver:
Oct 14 14:25:04.869 ERROR 5376 Statement::SQLPrepareW: [DataStax][CassandraODBC] (10) Error while executing a query in Cassandra: [33562624] : line 1:7 no viable alternative at input '1' (SELECT 1)
Oct 14 14:39:56.491 ERROR 6112 Statement::SQLPrepareW: [DataStax][CassandraODBC] (10) Error while executing a query in Cassandra: [33562624] : line 1:7 no viable alternative at input '1' (SELECT 1)
Seems like ODBC driver was not able to compose the right CQL query.
Can someone help me ? Thanks
I followed this instruction:
http://www.datastax.com/dev/blog/datastax-odbc-cql-connector-apache-cassandra-datastax-enterprise

When you select “Other Databases (ODBC)” in Tableau and choose the DSN you created, make sure to fill out the Server, Port, and Database fields.
AS the error says the server's permission settings could also be a factor.
Also, once you connect to the server, in the Data Source tab manually type in schema name which your keyspace + to add to Tableau and then in the table field type in your table name and click +

The ODBC Driver isn't supported for DSE 5.0.
https://docs.datastax.com/en/developer/driver-matrix/doc/common/driverMatrix.html

Simba Spark ODBC Driver is not working ms excel

I am using Spark(1.5.0) for utilizing Spark-SQL feature using Spark ThriftServer application and also using Simba Spark ODBC Driver for getting connection.
Using Tableau, I am able to connect and able to do Spark-SQL operations.
But when, I tried to connect Spark-SQL to MS-Excel, It goes connected but not listing database and table names. And I also tried Microsoft Query option of MS-Excel according to Doc to execute custom SQL queries (select * default.airline), but it's throwing error with query (select * from SPARK.default.airline) with catalog name SPARK.
Problem is that how to remove that catalog name from the query, I tried with all the available options.

I work as a Sales Engineer with Simba. The Simba Spark driver should work in Excel with both MS Query and through the Connection Wizard.
Can you please provide more information on this problem? You can enable driver logging through the configuration options in ODBC Administrator. Choose your DSN, go to logging options, and set it to TRACE.
Then restart Excel and try the query again.
Send the logs and a screenshot of your DSN to sales#simba.com
Thanks,
Jeff

Spark Sql JDBC Support

Currently we are building a reporting platform as a data store we used Shark. Since the development of Shark is stopped so we are in the phase of evaluating Spark SQL. Based on the use cases we have we had few questions.
1) We have data from various sources( MySQL, Oracle, Cassandra, Mongo). We would like to know how can we get this data into Spark SQL? Does there exist any utility which we can use? Does this utility support continuous refresh of data (sync of new add/update/delete on data store to Spark SQL?
2) Is the a way to create multiple database in Spark SQL?
3) For Reporting UI we use Jasper, we would like to connect from Jasper to Spark SQL. When we did our initial search we got to know currently there is no support for consumer to connect Spark SQL through JDBC, but in future releases you would like the add the same. We would like to know by when Spark SQL would have a stable release which would have JDBC Support? Meanwhile we took the source code from https://github.com/amplab/shark/tree/sparkSql but we had some difficulty in setting it up locally and evaluating it . It would be great if you can help us with setup instructions.(I can share the issue we are facing please let me know where can I post the error logs)
4) We would also require a SQL prompt where we can execute queries, currently Spark Shell provides SCALA prompt where SCALA code can be executed, from SCALA code we can fire SQL queries. Like Shark we would like to have SQL prompt in Spark SQL. When we did our search we found that in future release of Spark this would be added. It would be great if you can tell us which release of Spark would address the same.

as for
3) Spark 1.1 provides better support for SparkSQL ThriftServer interface, which you may want to use for JDBC interfacing. Hive JDBC clients that support v. 0.12.0 are able to connect and interface with such server.
4) Spark 1.1 also provides a SparkSQL CLI interface that can be used for entering queries. In the same fashion that Hive CLI or Impala Shell.
Please, provide more details about what you are trying to achieve for 1 and 2.

I can answer (1):
Apache Sqoop was made specifically to solve this problem for the relational databases. The tool was made for HDFS, HBase, and Hive -- as such it can be used to make data available to Spark, via HDFS and the Hive metastore.
http://sqoop.apache.org/
I believe Cassandra is available to SparkContext via this connector from DataStax: https://github.com/datastax/spark-cassandra-connector -- which I have never used.
I'm not aware of any connector for MongoDB.

1) We have data from various sources( MySQL, Oracle, Cassandra, Mongo)
You have to use different driver for each case. For cassandra there is datastax driver (but i encountered some compatibility problems with SparkSQL). For any SQL system you can use JdbcRDD. The usage is straightforward, look at the scala example:
test("basic functionality") {
sc = new SparkContext("local", "test")
val rdd = new JdbcRDD(
sc,
() => { DriverManager.getConnection("jdbc:derby:target/JdbcRDDSuiteDb") },
"SELECT DATA FROM FOO WHERE ? <= ID AND ID <= ?",
1, 100, 3,
(r: ResultSet) => { r.getInt(1) } ).cache()
assert(rdd.count === 100)
assert(rdd.reduce(_+_) === 10100)
}
But notion that it's just an RDD, so you should work with this data through map-reduce api, not in SQLContext.
Does there exist any utility which we can use?
There is Apache Sqoop project but it's in active development state. The current stable version even doesn't save files in parquet format.

Spark SQL is a capability of the Spark framework. It shouldn't be compared to Shark because Shark is a service. (Recall that with Shark, you run a ThriftServer that you can then connect to from your Thrift app or even ODBC.)
Can you elaborate on what you mean by "get this data into Spark SQL"?

There are a couple of Spark - MongoDB connectors:
- the mongodb connector for hadoop (which doesn't actually need Hadoop at all!) https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html
the Stratio mongodb connector https://github.com/Stratio/spark-mongodb

If your data is huge and need to perform a lot of transformations then Spark SQL can be used for ETL purpose, else presto could solve all your problems. Addressing your queries one by one:
As your data is in MySQL, Oracle, Cassandra, Mongo all these can be integrated in Presto as it has connectors https://prestodb.github.io/docs/current/connector.html for all these databases.
Once you install Presto in cluster mode you can query all these databases together in one platform, which also provides to join a table from Cassandra and other tables from Mongo, this flexibility is unparalleled.
Presto can be used to connect to Apache Superset https://superset.incubator.apache.org/ which is open source and provides all sets Dashboarding. Also Presto can be connected to Tableau.
You can install MySQL workbench with presto connecting details which helps in providing a UI for all your databases at one place.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string