I'm using presto for querying Hive warehouse, I got query history in presto web interface.
Question:
In hive query history logs will available in the Hadoop file system hive path.
No, the queries Presto executes are not available in HDFS (or anywhere, really). They are temporarily available in-memory on the Coordinator to show in the UI. What you can do is create an implementation of an EventListener in order to receive the queries Presto executes. You can then do whatever you like with that information, e.g. log it or write it to a database.
Related
HIVE has a metastore and HIVESERVER2 listens for SQL requests; with the help of metastore, the query is executed and the result is passed back.
The Thrift framework is actually customised as HIVESERVER2. In this way, HIVE is acting as a service. Via programming language, we can use HIVE as a database.
The relationship between Spark-SQL and HIVE is that:
Spark-SQL just utilises the HIVE setup (HDFS file system, HIVE Metastore, Hiveserver2). When we invoke /sbin/start-thriftserver2.sh (present in spark installation), we are supposed to give hiveserver2 port number, and the hostname. Then via spark's beeline, we can actually create, drop and manipulate tables in HIVE. The API can be either Spark-SQL or HIVE QL.
If we create a table / drop a table, it will be clearly visible if we login into HIVE and check(say via HIVE beeline or HIVE CLI). To put in other words, changes made via Spark can be seen in HIVE tables.
My understanding is that Spark does not have its own meta store setup like HIVE. Spark just utilises the HIVE setup and simply the SQL execution happens via Spark SQL API.
Is my understanding correct here?
Then I am little confused about the usage of bin/spark-sql.sh (which is also present in Spark installation). Documentation says that via this SQL shell, we can create tables like we do above (via Thrift Server/Beeline). Now my question is: How the metadata information is maintained by spark then?
Or like the first approach, can we make spark-sql CLI to communicate to HIVE (to be specific: hiveserver2 of HIVE) ?
If yes, how can we do that ?
Thanks in advance!
My understanding is that Spark does not have its own meta store setup like HIVE
Spark will start a Derby server on its own, if a Hive metastore is not provided
can we make spark-sql CLI to communicate to HIVE
Start an external metastore process, add a hive-site.xml file to $SPARK_CONF_DIR with hive.metastore.uris, or use SET SQL statements for the same.
Then spark-sql CLI should be able to query Hive tables. From code, you need to use enableHiveSupport() method on the SparkSession.
Hive can have its metadata and stores the tables,columns,partitions information over there.
If I do not want to use the hive.Can we create a metadata for spark same as hive.
I want to query spark SQL (not using dataframe) like Hive (select, from and where) Can we do that? if yes, which relational DB can we use for metadata storage?
Can we create a metadata for spark same as hive.
Spark does this for you and you don't have to use a separate installation of Hive or even just part of it (e.g. a Hive metastore).
Regardless of the installation of Apache Spark you use, Spark SQL uses a Hive metastore internally for the same purpose as Hive does (but the metastore is now part of Spark SQL).
if yes which relational DB can we use for metadata storage?
Anything that Hive supports, e.g. Oracle, MySQL, PostgreSQL. The configuration is pretty much as you would do with a separate Hive installation (which is usually the case in such enterprisey installations).
You may want to read Hive Metastore.
Spark is essentially a distributed computation system instead of a distributed storage. Therefore, we mostly use Spark to do the computation work, which needs the metadata from different storage.
However, Spark internally provides an InMemoryCatalog to store the metadata if it's not configured with Hive.
You can take a look at this for more information.
Is there any way to get a list of Spark SQL queries executed by various users in a Hadoop cluster?
For example, is there any log file where a Spark application stores the query in string format ?
There is a Spark History Server (port 18080 by default). If you have spark.eventLog.enabled,spark.eventLog.dir configured and Spark HS is running - you can check what Spark apps have been executed on your cluster. Each job there might contain SQL tab in UI where you can see SQL queries. But there are no single place or log file which stores them all.
Here is the short story:
A BI tool (PowerBI) connects to Spark cluster and uses HiveThriftServer2 application to get aggregated data via hive queries.
However, each query takes a lot of time since every time it reads data from files. I would like to cache my table in this application and looking for the way to send query "cache table myTable" through same channel, so next queries will run quick.
What would be a solution to send hive query to specific application? If it matters, the application is a thrift service of Spark.
Thanks a lot!
Looks like I succeed to do it, by installing Spark Odbc driver and using it to connect to thift server and send the sql query "cache table xxx". I wonder if there is more elegant way
How to write the like query in cassandra.
select * from user where user_name like '%abcd%'
How to write it into CQL(Cassandra query language)
Because i have to search some content base on keyword.
If it doesn't need to be real-time, you could use Hive or Shark. This enables you to run exactly the query you're speaking about. If you use DSE it works out of the box with Hive. If not, you'll want to check out this Hive driver.
To get this working with open source Cassandra, you'll need:
HDFS running co-located with your Cassandra nodes
If you use Spark, you'll need Spark workers (ideally co-located as well, though this isn't a hard requirement)
Hive or Shark running on a machine that can access the cluster