Use Spark SQL JDBC Server/Beeline or spark-sql - apache-spark

In Spark SQL, there are two options to submit sql.
spark-sql, for each sql, it will kick off a new Spark application.
Spark JDBC Server and Beeline, The Jdbc Server is actually a long running standalone spark application, and the sqls submitted to it will share the resources
We are having about 30 big sql queries,each would like to occupy 200 cores and 800G memory to finish in reasonable time(30 mins).
As of spark-sql and jdbc server/beeline, which option is better for my case?
To me, I would like to use spark-sql, and I have no idea how many resources should be given to jdbc server to make my queries to finish in reasonable time.
If I can submit the 30 queries to Jdbc Server, then how many resources(cores/memory) that this Jdbc Server should be given(5000+ cores and 10T+ memory?)?

Related

Why spark sql is preferred over hive?

I am evaluating both spark sql and hive with processing engine as spark .Most people prefer to use spark sql over hive with spark . I feel hive with spark is same as spark sql . Or I am missing anything here. Is there any advantages of using spark sql over hive which run on spark processing engine.
Any clue would be helpful
One point would be the difference in how the queries are executed.
While Hive with Spark execution engine, for each query you spin up a new set of executors, on Spark SQL you have a Spark session with a set of long-living executors where you can cache data (create temporary tables) which can speed up your queries substantially.

Possible memory leak on hadoop cluster ? (hive, hiveserver2, zeppelin, spark)

The heap usage of hiveserver2 is constantly increasing (first pic).
There are applications such as nifi, zeppelin, spark related to hive. Nifi use puthivesql, zeppelin use jdbc(hive) and spark use spark-sql. I couldn't find any clue to this.
Hive requires a lot of resources for establishing connection. So, first reason is a lot of queries in your puthiveql processor, cause for everyone of them hive need to open connection. Get attention on your hive job browser (you can use hue for this purpose)
Possible resolution: e.g. if you use insert queries - so use orc files to insert data. If you use update queries - use temporary table and merge query.

Spark SQL query history

Is there any way to get a list of Spark SQL queries executed by various users in a Hadoop cluster?
For example, is there any log file where a Spark application stores the query in string format ?
There is a Spark History Server (port 18080 by default). If you have spark.eventLog.enabled,spark.eventLog.dir configured and Spark HS is running - you can check what Spark apps have been executed on your cluster. Each job there might contain SQL tab in UI where you can see SQL queries. But there are no single place or log file which stores them all.

Use JDBC (eg Squirrel SQL) to query Cassandra with Spark SQL

I have a Cassandra cluster with a co-located Spark cluster, and I can run the usual Spark jobs by compiling them, copying them over, and using the ./spark-submit script. I wrote a small job that accepts SQL as a command-line argument, submits it to Spark as Spark SQL, Spark runs that SQL against Cassandra and writes the output to a csv file.
Now I feel like I'm going round in circles trying to figure out if it's possible to query Cassandra via Spark SQL directly in a JDBC connection (eg from Squirrel SQL). The Spark SQL documentation says
Connect through JDBC or ODBC.
A server mode provides industry standard JDBC and ODBC connectivity for
business intelligence tools.
The Spark SQL Programming Guide says
Spark SQL can also act as a distributed query engine using its JDBC/ODBC or
command-line interface. In this mode, end-users or applications can interact
with Spark SQL directly to run SQL queries, without the need to write any
code.
So I can run the Thrift Server, and submit SQL to it. But what I can't figure out, is how do I get the Thrift Server to connect to Cassandra? Do I simply pop the Datastax Cassandra Connector on the Thrift Server classpath? How do I tell the Thrift Server the IP and Port of my Cassandra cluster? Has anyone done this already and can give me some pointers?
Configure those properties in spark-default.conf file
spark.cassandra.connection.host 192.168.1.17,192.168.1.19,192.168.1.21
# if you configured security in you cassandra cluster
spark.cassandra.auth.username smb
spark.cassandra.auth.password bigdata#123
Start your thrift server with spark-cassandra-connector dependencies and mysql-connector dependencies with some port that you will connect via JDBC or Squirrel.
sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.bind.host 192.168.1.17 --hiveconf hive.server2.thrift.port 10003 --jars <shade-jar>-0.0.1.jar --driver-class-path <shade-jar>-0.0.1.jar
For getting cassandra table run Spark-SQL queries like
CREATE TEMPORARY TABLE mytable USING org.apache.spark.sql.cassandra OPTIONS (cluster 'BDI Cassandra', keyspace 'testks', table 'testtable');
why don`t you use the spark-cassandra-connector and cassandra-driver-core? Just add the dependencies, specify the host address/login in your spark context and then you can read/write to cassandra using sql.

Spark JDBC data fetch Optimization from relational database

a) Is there a way in which Spark can optimize the data fetch from a Relational Database when compared to a traditional java JDBC call.
b) How to reduce the load on the database while running Spark queries as we will be hitting production database directly for all queries. Assume 30 million order records and 150 million order line records in Production for the Spark reporting case.
Re a)
You can of course .cache() the data frame in your Spark app to avoid repeated executions of the JDBC for that data frame during life time of your Spark app
You can read the data frame in via range partitioned parallel JDBC calls using partitionColumn, lowerBound, upperBound and numPartitions properties. This makes sense for distributed (partitioned) database backends.
You can use an integrated Spark cluster with a distributed database engine such as IBM dashDB, which runs Spark executors co-located with the database partitions and exercises local IPC data exchange mechanisms between Spark and the database: https://ibmdatawarehousing.wordpress.com/category/theme-ibm-data-warehouse/
b) Above mentioned Spark-side caching can help if applicable. In addition JDBC data source in Spark does try to push down projections and filter predicated from your Spark SQL / data frame operations to the underlying SQL database. Check the resulting SQLs that hit the database.

Resources