How to prevent Spark SQL + Power BI OOM - apache-spark

Now I'm testing Spark SQL like an query engine for Microsoft Power BI.
What I have:
A huge Cassandra table with data I need to analyze.
An Amazon server with 8 cores and 16Gb of RAM.
A Spark Thrift server on this server. Version of Spark - 1.6.1
A Hive table mapped to a huge Cassandra table.
create table data using org.apache.spark.sql.cassandra options (cluster 'Cluster', keyspace 'myspace', table 'data');
All was ok until I tried to connect Power BI to Spark. The problem is that Power BI is trying to fetch all data from huge Cassandra table. Obviously Spark Thrift Server crashes with OOM Error. In this case I cant just add RAM to Spark Thrift Server because Cassandra table with raw data is really huge. Also I cant rely on custom initial query on BI side, because every time user forget about setting this query server would crash.
The best approach I see is in automatically wrapping all queries from BI in some kind of
SELECT * FROM (... BI select ...) LIMIT 1000000
It will be okay for current use cases.
So, is it possible on the server side? How I can do it?
If not, how I can prevent Spark Thrift Server crashes? Is there a possibility to drop or cancel huge queries before getting OOM?
Thanks.

Ok, I find a magic configuration option that solves my problem:
spark.sql.thriftServer.incrementalCollect=true
When this option is set, Spark splits the data that is fetched by a volume-consuming query to chunks

Related

Possible memory leak on hadoop cluster ? (hive, hiveserver2, zeppelin, spark)

The heap usage of hiveserver2 is constantly increasing (first pic).
There are applications such as nifi, zeppelin, spark related to hive. Nifi use puthivesql, zeppelin use jdbc(hive) and spark use spark-sql. I couldn't find any clue to this.
Hive requires a lot of resources for establishing connection. So, first reason is a lot of queries in your puthiveql processor, cause for everyone of them hive need to open connection. Get attention on your hive job browser (you can use hue for this purpose)
Possible resolution: e.g. if you use insert queries - so use orc files to insert data. If you use update queries - use temporary table and merge query.

Caching DataFrame in Spark Thrift Server

I have a Spark Thrift Server. I connect to the Thrift Server and get data of Hive table. If I query the same table again, it will again load the file in memory and execute the query.
Is there any way I can cache the table data using Spark Thrift Server? If yes, please let me know how to do it
Two things:
use CACHE LAZY TABLE as in this answer: Spark SQL: how to cache sql query result without using rdd.cache() and cache tables in apache spark sql
use spark.sql.hive.thriftServer.singleSession=true so that other clients can use this cached table.
Remember that caching is lazy, so it will be cached during first computation
Pay attention that memory could be consumed by the Driver, not the executor (depend on your settings, local/cluster ...), so don't forget to allocate more memory to your driver.
To put in data:
CACHE TABLE today AS
SELECT * FROM datahub WHERE year=2017 AND fullname IN ("api.search.search") LIMIT 40000
Start by limiting the data, then look how memory is consumed to avoid OOM exception.

How to submit hive query to spark thrift server?

Here is the short story:
A BI tool (PowerBI) connects to Spark cluster and uses HiveThriftServer2 application to get aggregated data via hive queries.
However, each query takes a lot of time since every time it reads data from files. I would like to cache my table in this application and looking for the way to send query "cache table myTable" through same channel, so next queries will run quick.
What would be a solution to send hive query to specific application? If it matters, the application is a thrift service of Spark.
Thanks a lot!
Looks like I succeed to do it, by installing Spark Odbc driver and using it to connect to thift server and send the sql query "cache table xxx". I wonder if there is more elegant way

connecting to spark data frames in tableau

We are trying to generate reports in tableau by spark SQL connectivity, But i found out that we are ultimately connecting to hive meta-store.
If this is the case what are the advantages of this new spark SQL connection. Is there a way to connect to spark data frames that are persisted, from tableau using spark SQL.
The problem here is a Tableau problem more than a Spark problem. Spark SQL Connector launches a Spark job each time you connect to a database. Part of that Spark job loads the underlying Hive table into the distributed memory that Spark manages, and each time you make a change or select on a graph, the refresh has to go a level deeper to Hive metastore to get the data, through Spark. That is how Tableau is designed. The only option here is to change Tableau for Spotfire (or some other tool) where by pre-caching the underlying Hive table, the Spark SQL Connector can query it directly from Spark distributed memory, skipping the load step.
Disclosure: I am in no way associated with Spotfire makers

Possibilities of Hadoop with MSSQL Reporting

I have been evaluating Hadoop on azure HDInsight to find a big data solution for our reporting application. The key part of this technology evaluation is that the I need to integrate with MSSQL Reporting Services as that is what our application already uses. We are very short on developer resources so the more I can make this into an engineering exercise the better. What I have tried so far
Use an ODBC connection from MSSQL mapped to the Hive on HDInsight.
Use an ODBC connection from MSSQL using HBASE on HDInsight.
Use SPARKQL locally on the azure HDInsight Remote desktop
What I have found is that HBASE and Hive are far slower to use with our reports. For test data I used a table with 60k rows and found that the report on MSSQL ran in less than 10 seconds. I ran the query on the hive query console and on the ODBC connection and found that it took over a minute to execute. Spark was faster (30 seconds) but there is no way to connect to it externally since ports cannot be opened on the HDInsight cluster.
Big data and Hadoop are all new to me. My question is, am I looking for Hadoop to do something it is not designed to do and are there ways to make this faster?I have considered caching results and periodically refreshing them, but it sounds like a management nightmare. Kylin looks promising but we are pretty married to windows azure, so I am not sure that is a viable solution.
Look at this documentation on optimizing Hive queries: https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-optimize-hive-query/
Specifically look at ORC and using Tez. I would create a cluster that has Tez on by default and then store your data in ORC format. Your queries should be much more performant then.
If going through Spark is fast enough, you should consider using the Microsoft Spark ODBC driver. I am using it and the performance is not comparable to what you'll get with MSSQL, other RDBMS or something like ElasticSearch but it does work pretty reliably.

Resources