Caching DataFrame in Spark Thrift Server - apache-spark

I have a Spark Thrift Server. I connect to the Thrift Server and get data of Hive table. If I query the same table again, it will again load the file in memory and execute the query.
Is there any way I can cache the table data using Spark Thrift Server? If yes, please let me know how to do it

Two things:
use CACHE LAZY TABLE as in this answer: Spark SQL: how to cache sql query result without using rdd.cache() and cache tables in apache spark sql
use spark.sql.hive.thriftServer.singleSession=true so that other clients can use this cached table.
Remember that caching is lazy, so it will be cached during first computation

Pay attention that memory could be consumed by the Driver, not the executor (depend on your settings, local/cluster ...), so don't forget to allocate more memory to your driver.
To put in data:
CACHE TABLE today AS
SELECT * FROM datahub WHERE year=2017 AND fullname IN ("api.search.search") LIMIT 40000
Start by limiting the data, then look how memory is consumed to avoid OOM exception.

Related

Spark local rdd Write to local Cassandra DB

I have a DSE cluster where every node in the cluster has both spark and Cassandra running.
When I load data from Cassandra to spark Rdd and do some action on the rdd, i know the data would be distributed into multi nodes. In my case, I want to write these rdds from every node to its local Cassandra dB table directly, is there anyway to do it.
If I do normal rdd collect, all data from spark nodes would be merged and go back to node with driver.
I do not want this to happen as the data flow from nodes back to driver node may take Long time, I want the data been save to local node directly to avoid the data movement across the spark nodes.
When Spark executor is reading data from Cassandra it's sending request to the "best node" that is selected based on the different factors:
When Spark is collocated with Cassandra, then Spark is trying to pull data from the same node
When Spark is on different node, then it's using token-aware routing, and read data from multiple nodes in parallel, as it's defined by the partition ranges.
When it's comes to the writing, and you have multiple executors, then each executor is opening multiple connections to each node, and writing the data using the token-aware routing, meaning that data is sent directly to one of the replicas. Also, Spark is trying to batch multiple rows that are belonging to the same partition into an UNLOGGED BATCH as it's more performant. Even if the Spark partition is colocated with the Cassandra partition, writing could involve an additional network overhead as SCC is writing using the consistency level TWO.
You can get colocated data if you re-partitioned the data to match Cassandra's partitioning), but such re-partition may induce Spark shuffle that could be much more heavyweight compared to the writing data from executor to another node.
P.S. You can find a lot of additional information about Spark Cassandra Connector in the Russell Spitzer's blog.
A word of warning: i only use Cassandra and Spark as separate open source projects, i do not have expertise with DSE.
I am afraid the data need to hit the network to replicate, even when every spark node talks to its local cassandra node.
Without replication and running a Spark job to make sure all data is hashed and preshuffled to the corresponding Cassandra node, it should be possible to use 127.0.0.1:9042 and avoid the network.

Possible memory leak on hadoop cluster ? (hive, hiveserver2, zeppelin, spark)

The heap usage of hiveserver2 is constantly increasing (first pic).
There are applications such as nifi, zeppelin, spark related to hive. Nifi use puthivesql, zeppelin use jdbc(hive) and spark use spark-sql. I couldn't find any clue to this.
Hive requires a lot of resources for establishing connection. So, first reason is a lot of queries in your puthiveql processor, cause for everyone of them hive need to open connection. Get attention on your hive job browser (you can use hue for this purpose)
Possible resolution: e.g. if you use insert queries - so use orc files to insert data. If you use update queries - use temporary table and merge query.

How to prevent Spark SQL + Power BI OOM

Now I'm testing Spark SQL like an query engine for Microsoft Power BI.
What I have:
A huge Cassandra table with data I need to analyze.
An Amazon server with 8 cores and 16Gb of RAM.
A Spark Thrift server on this server. Version of Spark - 1.6.1
A Hive table mapped to a huge Cassandra table.
create table data using org.apache.spark.sql.cassandra options (cluster 'Cluster', keyspace 'myspace', table 'data');
All was ok until I tried to connect Power BI to Spark. The problem is that Power BI is trying to fetch all data from huge Cassandra table. Obviously Spark Thrift Server crashes with OOM Error. In this case I cant just add RAM to Spark Thrift Server because Cassandra table with raw data is really huge. Also I cant rely on custom initial query on BI side, because every time user forget about setting this query server would crash.
The best approach I see is in automatically wrapping all queries from BI in some kind of
SELECT * FROM (... BI select ...) LIMIT 1000000
It will be okay for current use cases.
So, is it possible on the server side? How I can do it?
If not, how I can prevent Spark Thrift Server crashes? Is there a possibility to drop or cancel huge queries before getting OOM?
Thanks.
Ok, I find a magic configuration option that solves my problem:
spark.sql.thriftServer.incrementalCollect=true
When this option is set, Spark splits the data that is fetched by a volume-consuming query to chunks

Spark JDBC data fetch Optimization from relational database

a) Is there a way in which Spark can optimize the data fetch from a Relational Database when compared to a traditional java JDBC call.
b) How to reduce the load on the database while running Spark queries as we will be hitting production database directly for all queries. Assume 30 million order records and 150 million order line records in Production for the Spark reporting case.
Re a)
You can of course .cache() the data frame in your Spark app to avoid repeated executions of the JDBC for that data frame during life time of your Spark app
You can read the data frame in via range partitioned parallel JDBC calls using partitionColumn, lowerBound, upperBound and numPartitions properties. This makes sense for distributed (partitioned) database backends.
You can use an integrated Spark cluster with a distributed database engine such as IBM dashDB, which runs Spark executors co-located with the database partitions and exercises local IPC data exchange mechanisms between Spark and the database: https://ibmdatawarehousing.wordpress.com/category/theme-ibm-data-warehouse/
b) Above mentioned Spark-side caching can help if applicable. In addition JDBC data source in Spark does try to push down projections and filter predicated from your Spark SQL / data frame operations to the underlying SQL database. Check the resulting SQLs that hit the database.

GridGain - Load data from HBase

I have an application that queries HBase and persist those data to an Oracle database. These data is used for report generation. I would like know whether the same approach is possible when using GridGain in place of Oracle database. Is it possible to get the HBase data and load it to the GridGain memory cache and use for generating reports?
I think the approach of loading data into GridGain cache is the right approach. Once loaded, you will be either able to run Parallel Computations or Distributed SQL Queries directly over GridGain cache.

Resources