We have two cluster, spark cluster with Spark/YARN/HDFS (cluster A) and one Hive/HDFS cluster (cluster B). Both clusters are kerberized. Is it possible to run spark thrift server on cluster A to provide SQL interface to query the Hive tables on cluster B using the compute resources on cluster A? I know that the remote read will impact the query performance, but it is not a concern at the moment.
The problem I have is to run the thrift server in cluster mode on cluster A and which hdfs-site.xml is to use. When I run the thrift with local mode it works.
Thanks,
Suri
Related
we have an HDP 2.6.4 spark cluster with 10 linux worker machines.
The cluster runs spark applications over HDFS. The HDFS is installed on all the workers.
We wish to install presto that will query the HDFS of the cluster, however due to lack of CPU resources in the worker machines (only 32 cores per machine) the plan is to install presto outside of the cluster.
For that purpose we have several ESX, each ESX will have 2 VMs, and each VM will run a single presto server.
All the ESX machines will connected to the spark cluster via 10g network cards so that the two clusters will be in the same network.
My question is - can we install presto on the VM cluster and although the HDFS is not on the ESX cluster (but instead on the spark cluster)?
EDIT:
Fromt eh answer we got it seems that installing presto on VM is standard, so I'd like to clarify my question:
Presto has a configuration file named hive.properties under presto/etc.
Inside that file there’s a parameter named hive.config.resources with the following value:
/etc/hadoop/conf/presto-hdfs-site.xml,/etc/hadoop/conf/presto-core-site.xml
These files are HDFS config files, but since the VM cluster and the spark cluster (which contains the HDFS) are separate ones (the presto on the VM cluster should access the HDFS that resides on the spark cluster), the question is –
should these files be copied from the spark cluster to the VM cluster?
Regarding to your question - My question is - can we install presto on the VM cluster and although the HDFS is not on the ESX cluster (but instead on the spark cluster)?
The answer is YES
On this cluster that isn't co hosted with HDFS , don't forget to set the fowling parameter in hive.properties
hive.force-local-scheduling=false
As long as the Presto VMs are configured as edge nodes (aka gateway nodes) and have all the necessary config files and tools you shouldn't have any problem. For details on edge nodes see:
Do we need to Install Hadoop on Edge Node
How to create an Edge Node when creating a cloudera cluster
Previously we had three nodes cluster with two Cassandra nodes datacenter in one dc and one spark enabled node in different dc.
Spark was running smoothly in that configurations.
Then we tried adding another node in analytics dc with spark enabled. We had configured GossipingPropertyFileSnitch as well as added seeds.
But now when we start the cluster, spark master is assigned to both the nodes separately. So spark job still runs on a single node. What configurations are we missing regarding running spark job in a cluster?
Most probably you didn't make an adjustments in the Analytics keyspace replication, or didn't run the repair after you added a node. Please refer to instructions in official documentation.
Also, please check that you configured the same DC for both of Analytics nodes, because the Spark master is elected per DC.
In client deploy mode a Spark driver needs to be able to receive incoming TCP connections from Spark executors. However, if the Spark driver is behind a NAT, it cannot receive incoming connections. Will running the Spark driver in YARN cluster deploy mode overcome this limitation of being behind a NAT, because the Spark driver is then apparently executed on the Spark master?
Will running the Spark driver in YARN cluster deploy mode overcome this limitation of being behind a NAT, because the Spark driver is then apparently executed on the Spark master?
Yes, it will. Another possible approach is to configure:
spark.driver.port
spark.driver.bindAddress
and create SSH tunnel to one of the nodes.
I am attempting to leverage a Hadoop Spark Cluster in order to batch load a graph into Titan using the SparkGraphComputer and BulkLoaderVertex program, as specified here. This requires setting the spark configuration in a properties file, telling Titan where Spark is located, where to read the graph input from, where to store its output, etc.
The problem is that all of the examples seem to specify a local spark cluster through the option:
spark.master=local[*]
I, however, want to run this job on a remote Spark cluster which is on the same VNet as the VM where the titan instance is hosted. From what I have read, it seems that this can be accomplished by setting
spark.master=<spark_master_IP>:7077
This is giving me the error that all Spark masters are unresponsive, which disallows me from sending the job to the spark cluster to distribute the batch loading computations.
For reference, I am using Titan 1.0.0 and a Spark 1.6.4 cluster, which are both hosted on the same VNet. Spark is being managed by yarn, which also may be contributing to this difficulty.
Any sort of help/reference would be appreciated. I am sure that I have the correct IP for the spark master, and that I am using the right gremlin commands to accomplish bulk loading through the SparkGraphComputer. What I am not sure about is how to properly configure the Hadoop properties file in order to get Titan to communicate with a remote Spark cluster over a VNet.
I have a Cassandra cluster with a co-located Spark cluster, and I can run the usual Spark jobs by compiling them, copying them over, and using the ./spark-submit script. I wrote a small job that accepts SQL as a command-line argument, submits it to Spark as Spark SQL, Spark runs that SQL against Cassandra and writes the output to a csv file.
Now I feel like I'm going round in circles trying to figure out if it's possible to query Cassandra via Spark SQL directly in a JDBC connection (eg from Squirrel SQL). The Spark SQL documentation says
Connect through JDBC or ODBC.
A server mode provides industry standard JDBC and ODBC connectivity for
business intelligence tools.
The Spark SQL Programming Guide says
Spark SQL can also act as a distributed query engine using its JDBC/ODBC or
command-line interface. In this mode, end-users or applications can interact
with Spark SQL directly to run SQL queries, without the need to write any
code.
So I can run the Thrift Server, and submit SQL to it. But what I can't figure out, is how do I get the Thrift Server to connect to Cassandra? Do I simply pop the Datastax Cassandra Connector on the Thrift Server classpath? How do I tell the Thrift Server the IP and Port of my Cassandra cluster? Has anyone done this already and can give me some pointers?
Configure those properties in spark-default.conf file
spark.cassandra.connection.host 192.168.1.17,192.168.1.19,192.168.1.21
# if you configured security in you cassandra cluster
spark.cassandra.auth.username smb
spark.cassandra.auth.password bigdata#123
Start your thrift server with spark-cassandra-connector dependencies and mysql-connector dependencies with some port that you will connect via JDBC or Squirrel.
sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.bind.host 192.168.1.17 --hiveconf hive.server2.thrift.port 10003 --jars <shade-jar>-0.0.1.jar --driver-class-path <shade-jar>-0.0.1.jar
For getting cassandra table run Spark-SQL queries like
CREATE TEMPORARY TABLE mytable USING org.apache.spark.sql.cassandra OPTIONS (cluster 'BDI Cassandra', keyspace 'testks', table 'testtable');
why don`t you use the spark-cassandra-connector and cassandra-driver-core? Just add the dependencies, specify the host address/login in your spark context and then you can read/write to cassandra using sql.