Enable inserts into cassandra database in presto on spark - apache-spark

I am working on presto on spark , I have noticed that with this package inserting data into databases only work with hive-hadoop connector. All the other connectors return "connector does not support page sink commit" when trying insert queries. I want to use presto on spark to execute my queries and insert the result into cassandra. Any idea how i can acheive this?? Thanks in advance

Related

How do I drop a Cassandra table with Spark JDBC?

I'm writing a Spark-based app and have to drop some tables in Cassandra DB.
I know how to read from tables with spark.read.format("jdbc"). I know how to save dataframe with df.write.format("jbdc").
But how can I drop a table that I don't need anymore?
To drop a Cassandra table, you can simply use the Spark SQL DROP TABLE command:
spark.sql("DROP TABLE table_name")
Note that the JDBC API is limited so our general recommendation is to use the Spark Cassandra connector. It is fully open-source so is free to use.
The Spark Cassandra connector is a library specifically designed for connecting to Cassandra clusters from Spark applications. Cassandra tables are exposed as DataFrames or RDDs and the connector also allows execution of the full CQL API. Cheers!

How to submit hive query to spark thrift server?

Here is the short story:
A BI tool (PowerBI) connects to Spark cluster and uses HiveThriftServer2 application to get aggregated data via hive queries.
However, each query takes a lot of time since every time it reads data from files. I would like to cache my table in this application and looking for the way to send query "cache table myTable" through same channel, so next queries will run quick.
What would be a solution to send hive query to specific application? If it matters, the application is a thrift service of Spark.
Thanks a lot!
Looks like I succeed to do it, by installing Spark Odbc driver and using it to connect to thift server and send the sql query "cache table xxx". I wonder if there is more elegant way

Spark SQL push down of Cassandra UDF?

Following this questions on Spark SQL I'm wondering if Spark SQL with the Cassandra connector is able to push down the UDF's present in the SQL query to Cassandra UDF (if it exists).
I tried to have a look at the source but I wasn't able to get a clear answer.
No, there is currently no support for pushing down udfs.

connecting to spark data frames in tableau

We are trying to generate reports in tableau by spark SQL connectivity, But i found out that we are ultimately connecting to hive meta-store.
If this is the case what are the advantages of this new spark SQL connection. Is there a way to connect to spark data frames that are persisted, from tableau using spark SQL.
The problem here is a Tableau problem more than a Spark problem. Spark SQL Connector launches a Spark job each time you connect to a database. Part of that Spark job loads the underlying Hive table into the distributed memory that Spark manages, and each time you make a change or select on a graph, the refresh has to go a level deeper to Hive metastore to get the data, through Spark. That is how Tableau is designed. The only option here is to change Tableau for Spotfire (or some other tool) where by pre-caching the underlying Hive table, the Spark SQL Connector can query it directly from Spark distributed memory, skipping the load step.
Disclosure: I am in no way associated with Spotfire makers

Spark Sql JDBC Support

Currently we are building a reporting platform as a data store we used Shark. Since the development of Shark is stopped so we are in the phase of evaluating Spark SQL. Based on the use cases we have we had few questions.
1) We have data from various sources( MySQL, Oracle, Cassandra, Mongo). We would like to know how can we get this data into Spark SQL? Does there exist any utility which we can use? Does this utility support continuous refresh of data (sync of new add/update/delete on data store to Spark SQL?
2) Is the a way to create multiple database in Spark SQL?
3) For Reporting UI we use Jasper, we would like to connect from Jasper to Spark SQL. When we did our initial search we got to know currently there is no support for consumer to connect Spark SQL through JDBC, but in future releases you would like the add the same. We would like to know by when Spark SQL would have a stable release which would have JDBC Support? Meanwhile we took the source code from https://github.com/amplab/shark/tree/sparkSql but we had some difficulty in setting it up locally and evaluating it . It would be great if you can help us with setup instructions.(I can share the issue we are facing please let me know where can I post the error logs)
4) We would also require a SQL prompt where we can execute queries, currently Spark Shell provides SCALA prompt where SCALA code can be executed, from SCALA code we can fire SQL queries. Like Shark we would like to have SQL prompt in Spark SQL. When we did our search we found that in future release of Spark this would be added. It would be great if you can tell us which release of Spark would address the same.
as for
3) Spark 1.1 provides better support for SparkSQL ThriftServer interface, which you may want to use for JDBC interfacing. Hive JDBC clients that support v. 0.12.0 are able to connect and interface with such server.
4) Spark 1.1 also provides a SparkSQL CLI interface that can be used for entering queries. In the same fashion that Hive CLI or Impala Shell.
Please, provide more details about what you are trying to achieve for 1 and 2.
I can answer (1):
Apache Sqoop was made specifically to solve this problem for the relational databases. The tool was made for HDFS, HBase, and Hive -- as such it can be used to make data available to Spark, via HDFS and the Hive metastore.
http://sqoop.apache.org/
I believe Cassandra is available to SparkContext via this connector from DataStax: https://github.com/datastax/spark-cassandra-connector -- which I have never used.
I'm not aware of any connector for MongoDB.
1) We have data from various sources( MySQL, Oracle, Cassandra, Mongo)
You have to use different driver for each case. For cassandra there is datastax driver (but i encountered some compatibility problems with SparkSQL). For any SQL system you can use JdbcRDD. The usage is straightforward, look at the scala example:
test("basic functionality") {
sc = new SparkContext("local", "test")
val rdd = new JdbcRDD(
sc,
() => { DriverManager.getConnection("jdbc:derby:target/JdbcRDDSuiteDb") },
"SELECT DATA FROM FOO WHERE ? <= ID AND ID <= ?",
1, 100, 3,
(r: ResultSet) => { r.getInt(1) } ).cache()
assert(rdd.count === 100)
assert(rdd.reduce(_+_) === 10100)
}
But notion that it's just an RDD, so you should work with this data through map-reduce api, not in SQLContext.
Does there exist any utility which we can use?
There is Apache Sqoop project but it's in active development state. The current stable version even doesn't save files in parquet format.
Spark SQL is a capability of the Spark framework. It shouldn't be compared to Shark because Shark is a service. (Recall that with Shark, you run a ThriftServer that you can then connect to from your Thrift app or even ODBC.)
Can you elaborate on what you mean by "get this data into Spark SQL"?
There are a couple of Spark - MongoDB connectors:
- the mongodb connector for hadoop (which doesn't actually need Hadoop at all!) https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html
the Stratio mongodb connector https://github.com/Stratio/spark-mongodb
If your data is huge and need to perform a lot of transformations then Spark SQL can be used for ETL purpose, else presto could solve all your problems. Addressing your queries one by one:
As your data is in MySQL, Oracle, Cassandra, Mongo all these can be integrated in Presto as it has connectors https://prestodb.github.io/docs/current/connector.html for all these databases.
Once you install Presto in cluster mode you can query all these databases together in one platform, which also provides to join a table from Cassandra and other tables from Mongo, this flexibility is unparalleled.
Presto can be used to connect to Apache Superset https://superset.incubator.apache.org/ which is open source and provides all sets Dashboarding. Also Presto can be connected to Tableau.
You can install MySQL workbench with presto connecting details which helps in providing a UI for all your databases at one place.

Resources