Presto can search data from multiple hadoop cluster once time? - presto

I want to deploy multiple hadoop cluster,and the different of them just data.
Presto can search data from them once time?

Assuming you mean that you have multiple Hive installations (HDFS + Hive Metastore), yes you can access all of them from a single Presto query. Simply, add a hive catalog file (with a different name) for each cluster. See https://prestodb.io/docs/current/connector/hive.html for more information on setting up connections to hive.

Related

Need use case or example for Spark’s Relationship to Hive

I am reading Spark Definitive Guide
In the "Spark’s Relationship to Hive" section ..the below lines are give
"With Spark SQL, you can connect to your Hive metastore (if you already have one) and access table metadata to reduce file listing when accessing information. This is popular for users who are migrating from a legacy Hadoop environment and beginning to run all their workloads using Spark."
I am not able to understand what it means. Someone please help me with examples for the above use case.
Spark being the latest tool in Hadoop ecosystem has connectivity with earlier Hadoop tools. Hive was the most popular until recent times. Most Hadoop platforms have data stored in Hive tables which can be accessed using Hive as a SQL engine. However, Spark can also do the same things.
So, the given statements mention that you can connect to Hive metastore (which contains information about existing tables, databases, their location, schema, file types, etc.) and then you can run similar Hive queries on them just like you would with Hive.
Below are two examples that you can do with spark once you can connect to Hive metastore.
spark.sql("show databases")
spark.sql("select * from test_db.test_table")
I hope this answers your question.

How to set up metadata database for Spark SQL?

Hive can have its metadata and stores the tables,columns,partitions information over there.
If I do not want to use the hive.Can we create a metadata for spark same as hive.
I want to query spark SQL (not using dataframe) like Hive (select, from and where) Can we do that? if yes, which relational DB can we use for metadata storage?
Can we create a metadata for spark same as hive.
Spark does this for you and you don't have to use a separate installation of Hive or even just part of it (e.g. a Hive metastore).
Regardless of the installation of Apache Spark you use, Spark SQL uses a Hive metastore internally for the same purpose as Hive does (but the metastore is now part of Spark SQL).
if yes which relational DB can we use for metadata storage?
Anything that Hive supports, e.g. Oracle, MySQL, PostgreSQL. The configuration is pretty much as you would do with a separate Hive installation (which is usually the case in such enterprisey installations).
You may want to read Hive Metastore.
Spark is essentially a distributed computation system instead of a distributed storage. Therefore, we mostly use Spark to do the computation work, which needs the metadata from different storage.
However, Spark internally provides an InMemoryCatalog to store the metadata if it's not configured with Hive.
You can take a look at this for more information.

Spark SQL query history

Is there any way to get a list of Spark SQL queries executed by various users in a Hadoop cluster?
For example, is there any log file where a Spark application stores the query in string format ?
There is a Spark History Server (port 18080 by default). If you have spark.eventLog.enabled,spark.eventLog.dir configured and Spark HS is running - you can check what Spark apps have been executed on your cluster. Each job there might contain SQL tab in UI where you can see SQL queries. But there are no single place or log file which stores them all.

Can sparksql or hiveserver2 connect to 2 different metastores simultaneously?

Use case is:
Datastax DSE Cluster running Cassandra, HiveMetastore(Cassandra based) and SparkSQL holds the DDL of External tables pointing to DSEFS
EC2 Cluster running HiveMetastore(MySQL based), HiveServer2 and SparkSQL. This metastore holds the DDL of External tables pointing to S3
Is it possible to have a single SparkSQL connection that can read data from both metastores (ie the DSEFS tables from HMS(Cassandra) and the S3 tables from HMS(MySQL))? From what I've seen a single HMS does not handle both S3 and DSEFS external tables.

write like query using cassandra

How to write the like query in cassandra.
select * from user where user_name like '%abcd%'
How to write it into CQL(Cassandra query language)
Because i have to search some content base on keyword.
If it doesn't need to be real-time, you could use Hive or Shark. This enables you to run exactly the query you're speaking about. If you use DSE it works out of the box with Hive. If not, you'll want to check out this Hive driver.
To get this working with open source Cassandra, you'll need:
HDFS running co-located with your Cassandra nodes
If you use Spark, you'll need Spark workers (ideally co-located as well, though this isn't a hard requirement)
Hive or Shark running on a machine that can access the cluster

Resources