Possible memory leak on hadoop cluster ? (hive, hiveserver2, zeppelin, spark) - apache-spark

The heap usage of hiveserver2 is constantly increasing (first pic).
There are applications such as nifi, zeppelin, spark related to hive. Nifi use puthivesql, zeppelin use jdbc(hive) and spark use spark-sql. I couldn't find any clue to this.

Hive requires a lot of resources for establishing connection. So, first reason is a lot of queries in your puthiveql processor, cause for everyone of them hive need to open connection. Get attention on your hive job browser (you can use hue for this purpose)
Possible resolution: e.g. if you use insert queries - so use orc files to insert data. If you use update queries - use temporary table and merge query.

Related

Caching DataFrame in Spark Thrift Server

I have a Spark Thrift Server. I connect to the Thrift Server and get data of Hive table. If I query the same table again, it will again load the file in memory and execute the query.
Is there any way I can cache the table data using Spark Thrift Server? If yes, please let me know how to do it
Two things:
use CACHE LAZY TABLE as in this answer: Spark SQL: how to cache sql query result without using rdd.cache() and cache tables in apache spark sql
use spark.sql.hive.thriftServer.singleSession=true so that other clients can use this cached table.
Remember that caching is lazy, so it will be cached during first computation
Pay attention that memory could be consumed by the Driver, not the executor (depend on your settings, local/cluster ...), so don't forget to allocate more memory to your driver.
To put in data:
CACHE TABLE today AS
SELECT * FROM datahub WHERE year=2017 AND fullname IN ("api.search.search") LIMIT 40000
Start by limiting the data, then look how memory is consumed to avoid OOM exception.

Running Spark App: Persist Metastore

I work on a Spark 2.1 application that also uses SparkSQL and saves data with dataframe.write.saveAsTable(tbl). My understanding is that an in-memory Derby DB is used for the Hive metastore (right?). This means that a table that I create in the first execution is not available in any subsequent executions. In many cases that might be the intended behaviour - but I would like to persist the metastore across executions (since this is also the behavior I have in my production system).
So, a simple question: How can I change the configuration to persist the metastore on disc?
One remark: I am not starting the Spark job with spark-shell or spark-submit, but as a standalone Scala application.
It is already persisted on disk. As long as both sessions use the same working directory or specific metastore configuration, the permanent table will be persisted between sessions.

Does Spark SQL use Hive Metastore?

I am developing a Spark SQL application and I've got few questions:
I read that Spark-SQL uses Hive metastore under the cover? Is this true? I'm talking about a pure Spark-SQL application that does not explicitly connect to any Hive installation.
I am starting a Spark-SQL application, and have no need to use Hive. Is there any reason to use Hive? From what I understand Spark-SQL is much faster than Hive; so, I don't see any reason to use Hive. But am I correct?
I read that Spark-SQL uses Hive metastore under the cover? Is this true? I'm talking about a pure Spark-SQL application that does not explicitly connect to any Hive installation.
Spark SQL does not use a Hive metastore under the covers (and defaults to in-memory non-Hive catalogs unless you're in spark-shell that does the opposite).
The default external catalog implementation is controlled by spark.sql.catalogImplementation internal property and can be one of the two possible values: hive and in-memory.
Use the SparkSession to know what catalog is in use.
scala> :type spark
org.apache.spark.sql.SparkSession
scala> spark.version
res0: String = 2.4.0
scala> :type spark.sharedState.externalCatalog
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener
scala> println(spark.sharedState.externalCatalog.unwrapped)
org.apache.spark.sql.hive.HiveExternalCatalog#49d5b651
Please note that I used spark-shell that does start a Hive-aware SparkSession and so I had to start it with --conf spark.sql.catalogImplementation=in-memory to turn it off.
I am starting a Spark-SQL application, and have no need to use Hive. Is there any reason to use Hive? From what I understand Spark-SQL is much faster than Hive; so, I don't see any reason to use Hive.
That's a very interesting question and can have different answers (some even primarily opinion-based so we have to be extra careful and follow the StackOverflow rules).
Is there any reason to use Hive?
No.
But...if you want to use the very recent feature of Spark 2.2, i.e. cost-based optimizer, you may want to consider it as ANALYZE TABLE for cost statistics can be fairly expensive and so doing it once for tables that are used over and over again across different Spark application runs could give a performance boost.
Please note that Spark SQL without Hive can do it too, but have some limitation as the local default metastore is just for a single-user access and reusing the metadata across Spark applications submitted at the same time won't work.
I don't see any reason to use Hive.
I wrote a blog post Why is Spark SQL so obsessed with Hive?! (after just a single day with Hive) where I asked a similar question and to my surprise it's only now (almost a year after I posted the blog post on Apr 9, 2016) when I think I may have understood why the concept of Hive metastore is so important, esp. in multi-user Spark notebook environments.
Hive itself is just a data warehouse on HDFS so not much use if you've got Spark SQL, but there are still some concepts Hive has done fairly well that are of much use in Spark SQL (until it fully stands on its own legs with a Hive-like metastore).
It will connect to a Hive Metastore or instantiate one if none is found when you initialize a HiveContext() object or a spark-shell.
The main reason to use Hive is if you are reading HDFS data in from Hive's managed tables or if you want the convenience of selecting from external tables.
Remember that Hive is simply a lens for reading and writing HDFS files and not an execution engine in and of itself.

Improve reading speed of Cassandra in Spark (Parallel reads implementation)

I am new to Spark and trying to combine Cassandra and Spark to do some analytical tasks.
From the Spark web UI I found that most of the time are consumed in the reading process.
When I dig into this particular task, I found that only single executor is working on it.
Is it possible to improve the performance of this task via some tricks like parallelization?
p.s. I am using the pyspark cassandra connector (https://github.com/TargetHolding/pyspark-cassandra).
UPDATE: I am using a 3-node Spark cluster running Spark 1.6 and a 3-node Cassandra cluster running Cassandra 2.2.4.
And I am selecting data in the form of
"select * from tbl where partitionKey IN [pk_1,pk_2,....,pk_N] where
clusteringKey > ck_1 and clusteringKey < ck_2"
UPDATE2: Ive read an article suggesting to replace the IN clause with parallel reads. (https://ahappyknockoutmouse.wordpress.com/2014/11/12/246/) How can this be achieved in spark?
Will able to answer to point, if you provide more details about cluster, spark and Cassandra versions and related stuff.Though I will try to answer it as per my understanding.
Make sure you are partitioning RDD parallelized-collections
If your spark job is running on only single executor, please verify spark submit command.you can get more details about spark submit commands here as per your cluster manager.
For speeding up Cassandra read operations, make use of proper indexing. I will recommend use of Solr, which will help you in fast data retrieval from Cassandra.

write like query using cassandra

How to write the like query in cassandra.
select * from user where user_name like '%abcd%'
How to write it into CQL(Cassandra query language)
Because i have to search some content base on keyword.
If it doesn't need to be real-time, you could use Hive or Shark. This enables you to run exactly the query you're speaking about. If you use DSE it works out of the box with Hive. If not, you'll want to check out this Hive driver.
To get this working with open source Cassandra, you'll need:
HDFS running co-located with your Cassandra nodes
If you use Spark, you'll need Spark workers (ideally co-located as well, though this isn't a hard requirement)
Hive or Shark running on a machine that can access the cluster

Resources