Connecting Spark from my local machine to a remote HiveServer - apache-spark

How can I connect Spark from my local machine in Eclipse to a remote HiveServer?

Get a copy of the hive-site.xml from the remote server, and add it to $SPARK_HOME/conf
Then, assuming Spark2, you need to use SparkSession.enableHiveSupport() method, and any spark.sql() queries should be able to communicate with Hive.
Also see my answer here

Related

How to load data into spark from a remote HDFS?

Our data is stored at a remote Hadoop Cluster, but for doing some PoC I need to run spark application locally on my machine. How can I load data from that remote HDFS?
You can configure spark to access any hadoop instance you have access to.(Ports open, nodes reachable)
Custom Hadoop/Hive Configuration
If your Spark application is interacting with Hadoop, Hive, or both, there are probably Hadoop/Hive
configuration files in Spark’s classpath.
Multiple running applications might require different Hadoop/Hive
client side configurations. You can copy and modify hdfs-site.xml,
core-site.xml, yarn-site.xml, hive-site.xml in Spark’s classpath for
each application. In a Spark cluster running on YARN, these
configuration files are set cluster-wide, and cannot safely be changed
by the application.
As you want to access HDFS you need: hdfs-site.xml and core-site.xml from your cluster you are trying to access.
For anyone, who wants to access remote HDFS from Spark Java app, here is steps.
Firstly, you need to add --conf key to your run command. Depends on Spark version:
(Spark 1.x-2.1)
spark.yarn.access.namenodes=hdfs://clusterA,hdfs://clusterB
(Spark 2.2+) spark.yarn.access.hadoopFileSystems=hdfs://clusterA,hdfs://clusterB
Secondly, when you creating Spark’s Java context, add that:
javaSparkContext.hadoopConfiguration().addResource(new Path("core-site-clusterB.xml"));
javaSparkContext.hadoopConfiguration().addResource(new Path("hdfs-site-clusterB.xml"));
If you facing this exception:
java.net.UnknownHostException: clusterB
then try to put full namenode address of your remote HDFS with port (instead of hdfs/cluster short name) to --conf into your running command.
More details in my article: https://mchesnavsky.tech/spark-java-access-remote-hdfs.

Connect PowerBI Desktop with Apache Spark local machine installation

Can someone guide me how to connect PBI Desktop to APACHE SPARK installed on a local windows machine? What should be the server details I should pass?
I have read thrift connections are very slow so would want to avoid them unless they are the only choice.
Edit -
Based on the suggestion, I tried to set up thrift connection following the below link - medium.com/#waqasrafiq327/… . Mine is a windows installation. Given paths seems to be for linux? I cant see a hive-site.xml file under /spark/conf folder. I also dont see a /apachehive/conf folder in my spark installtion. My spark installation is the latest version of spark release available. Please guide.
You have to use the thrift server as it is required if you want to connect via ODBC or JDBC. This is the only way to connect from Power BI to Apache Spark.

Can't use Tableau on a EMR Spark cluster

I have a client that wants to use Tableau on their EMR Spark cluster.
The documentation seems straightforward but I'm getting errors when I try to connect.
Here is the setup:
EMR cluster's master doesn't have a public IP, but from the Tableau desktop EC2 instance I am able to ping and telnet into the port 10001 where thrift is running
I am able to test thrift with beeline and it connects fine
I am not using SSL or authentication given the limit access the cluster has
I have installed both data direct 8.0 and simbaodbc
I'm using emr-5.13.0, the Hadoop distribution is Amazon 2.8.3 and the Spark version is 2.3.0.
The error is
Unable to connect to the ODBC Data Source. Check that the necessary drivers are installed and that the connection properties are valid.
[Simba][ThriftExtension] (5) Error occurred while contacting server: No more data to read.. This could be because you are trying to establish a non-SSL connection to an SSL-enabled server.
Unable to connect to the server "IP". Check that the server is running and that you have access privileges to the requested database."
I simply followed the documentation provided by Tableau which says to install the driver only (not mess with ODBC), then us it in Tableau. I have verified that I have set no SSL and no authentication before trying to connect. I also verified by running Datagrip and doing a query from the Tableau EC2 instance, which works as expected.
resolved the issue by ignoring the documentation and just setting up the odbc driver, then choosing it instead of sparksql as a source.

How to connect to spark (CDH-5.8 docker vms at remote)? Do I need to map port 7077 at container?

Currently, I can access the HDFS from inside my application, but I'd also like to - instead of running my local spark - to use Cloudera's spark as it is enabled in Cloudera Manager.
Righ now I have the HDFS defined at core-site.xml, and I run my app as (--master) YARN. Thus I don't need to set the machine address to my HDFS files. In this way, my SPARK job runs locally and not in the "cluster." I don't want that for now. When I try to set --master to [namenode]:[port] it does not connect. I wonder if I'm directing to the correct port, or if I have to map this port at docker container. Or if I'm missing something about Yarn setup.
Additionally, I've been testing SnappyData (Inc) solution as a Spark SQL in-memory database. So my goal is to run snappy JVMs locally, but redirecting spark jobs to the VM cluster. The whole idea here is to test some performance against some Hadoop implementation. This solution is not a final product (if snappy is local, and spark is "really" remote, I believe it won't be efficient - but in this scenario, I would bring snappy JVMs to the same cluster..)
Thanks in advance!

How can I connect PySpark (local machine) to my EMR cluster?

I have deployed a 3-node AWS ElasticMapReduce cluster bootstrapped with Apache Spark. From my local machine, I can access the master node by SSH:
ssh -i <key> hadoop#ec2-xx-xxx-xxx-xxx.compute-1.amazonaws.com
Once ssh'd into the master node, I can access PySpark via pyspark.
Additionally, (although insecure) I have configured my master node's security group to accept TCP traffic from my local machine's IP address specifically on port 7077.
However, I am still unable to connect my local PySpark instance to my cluster:
MASTER=spark://ec2-master-node-public-address:7077 ./bin/pyspark
The above command results in a number of exceptions and causes PySpark to unable to initialize a SparkContext object.
Does anyone know how to successfully create a remote connection like the one I am describing above?
Unless your local machine is the master node for your cluster, you cannot do that. You won't be able to do that with AWS EMR.

Resources