Spark SQL query history - apache-spark

Is there any way to get a list of Spark SQL queries executed by various users in a Hadoop cluster?
For example, is there any log file where a Spark application stores the query in string format ?

There is a Spark History Server (port 18080 by default). If you have spark.eventLog.enabled,spark.eventLog.dir configured and Spark HS is running - you can check what Spark apps have been executed on your cluster. Each job there might contain SQL tab in UI where you can see SQL queries. But there are no single place or log file which stores them all.

Related

How to use datahub to get spark transformation lineage?

I have setup datahub and spark in k8s in different namespace, I can run spark with datahub configurations following this guide: https://datahubproject.io/docs/metadata-integration/java/spark-lineage/, my spark application will get data from Minio, then do some transformations including groupby, pivot, rename and a spark SQL query, then write result to cassandra database.
After spark execution finished, I can see my spark application in datahub, but there is no information for "Tasks" or "Lineage" in datahub, what should I do to get these data? There is very limited information in datahub document. Thanks!

Spark SQL taking different time to execute the same query

I am running spark sql queries on hive table stored at a remote HDFS. But I am observing that the same sql query is taking different times to execute.
Now I wanted to do a POC between our old configuration and new configuration, but I am not able to figure out how I can do this if the execution times differ by that much?

Presto can search data from multiple hadoop cluster once time?

I want to deploy multiple hadoop cluster,and the different of them just data.
Presto can search data from them once time?
Assuming you mean that you have multiple Hive installations (HDFS + Hive Metastore), yes you can access all of them from a single Presto query. Simply, add a hive catalog file (with a different name) for each cluster. See https://prestodb.io/docs/current/connector/hive.html for more information on setting up connections to hive.

Registering temp tables in ThriftServer

I am new to Spark and am trying to understand how (if at all) is it possible to register dataframes as temp tables in the Spark thrift server.
To clarify, this is what I am trying to do:
Submit an application that generates a dataframe and registers it as a temporary table
Connect from a JDBC client to the Spark ThriftServer (running on the master) and query the temporary table, even after the application that registered it completed.
So far I've had no success with this - the Spark ThriftServer is running on the Spark master, but I'm unable to actually register any temp table to it.
Is this possible? I know I can use HiveThriftServer2.startWithContext to serve a dataframe via JDBC, but that requires the application to keep running forever + it requires me to launch additional applications.
The key idea is to register all temp tables in the Spark job and finally start SparkThriftServer from this job. It will keep your job running until you terminate thrift server. Also you will be able to query SparkThriftServer for all temp table via JDBC.
Here it is described with example

Use JDBC (eg Squirrel SQL) to query Cassandra with Spark SQL

I have a Cassandra cluster with a co-located Spark cluster, and I can run the usual Spark jobs by compiling them, copying them over, and using the ./spark-submit script. I wrote a small job that accepts SQL as a command-line argument, submits it to Spark as Spark SQL, Spark runs that SQL against Cassandra and writes the output to a csv file.
Now I feel like I'm going round in circles trying to figure out if it's possible to query Cassandra via Spark SQL directly in a JDBC connection (eg from Squirrel SQL). The Spark SQL documentation says
Connect through JDBC or ODBC.
A server mode provides industry standard JDBC and ODBC connectivity for
business intelligence tools.
The Spark SQL Programming Guide says
Spark SQL can also act as a distributed query engine using its JDBC/ODBC or
command-line interface. In this mode, end-users or applications can interact
with Spark SQL directly to run SQL queries, without the need to write any
code.
So I can run the Thrift Server, and submit SQL to it. But what I can't figure out, is how do I get the Thrift Server to connect to Cassandra? Do I simply pop the Datastax Cassandra Connector on the Thrift Server classpath? How do I tell the Thrift Server the IP and Port of my Cassandra cluster? Has anyone done this already and can give me some pointers?
Configure those properties in spark-default.conf file
spark.cassandra.connection.host 192.168.1.17,192.168.1.19,192.168.1.21
# if you configured security in you cassandra cluster
spark.cassandra.auth.username smb
spark.cassandra.auth.password bigdata#123
Start your thrift server with spark-cassandra-connector dependencies and mysql-connector dependencies with some port that you will connect via JDBC or Squirrel.
sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.bind.host 192.168.1.17 --hiveconf hive.server2.thrift.port 10003 --jars <shade-jar>-0.0.1.jar --driver-class-path <shade-jar>-0.0.1.jar
For getting cassandra table run Spark-SQL queries like
CREATE TEMPORARY TABLE mytable USING org.apache.spark.sql.cassandra OPTIONS (cluster 'BDI Cassandra', keyspace 'testks', table 'testtable');
why don`t you use the spark-cassandra-connector and cassandra-driver-core? Just add the dependencies, specify the host address/login in your spark context and then you can read/write to cassandra using sql.

Resources