Not able to connect to Snowflake from EMR Cluster using Pyspark - apache-spark

I am trying to connect to Snowflake from EMR cluster using pyspark.
I am using these two jars in spark-submit.
snowflake-jdbc-3.5.2.jar
spark-snowflake_2.11-2.7.0-spark_2.4.jar
But it failing with connect time out error.
I have correct proxy configured for the EMR cluster. From the same EC2 (EMR Master)
I am able to connect to Snowflake using snowsql and python connector.
I am not sure why it is getting timed out for pyspark.

You can use our SnowCD tool to check the connectivity diagnostics. This is related to network issues.
https://docs.snowflake.com/en/user-guide/snowcd.html
Below were my commands when I tried running through EMR shell.
pyspark --packages net.snowflake:snowflake-jdbc:3.6.27,net.snowflake:spark-snowflake_2.12:2.4.14-spark_2.4
spark-submit --packages net.snowflake:snowflake-jdbc:3.8.0,net.snowflake:spark-snowflake_2.11:2.4.14-spark_2.4 SparkConnPythonWithCert.py
Spark-shell --packages net.snowflake:snowflake-jdbc:3.8.0,net.snowflake:spark-snowflake_2.11:2.4.14-spark_2.4

Related

How to connect documentdb to a spark application in an emr instance

I'm getting error while I'm trying to configure spark with mongodb in my EMR instance. Below is the command -
spark-shell --conf "spark.mongodb.output.uri=mongodb://admin123:Vibhuti21!#docdb-2021-09-18-15-29-54.cluster-c4paykiwnh4d.us-east-1.docdb.amazonaws.com:27017/?replicaSet=rs0&readPreference=secondaryPreferred&retryWrites=false" "spark.mongodb.output.collection="ecommerceCluster" --packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.3
I'm a beginner in Spark & AWS. Can anyone please help?
DocumentDB requires a CA bundle to be installed on each node where your spark executors will launch. As such you firstly need to install the CA certs on each instance, AWS has a guide under the JAVA section for this in two bash scripts which makes things easier.1
Once these certs are installed, your spark command needs to reference the truststores and its passwords using the configuration parameters you can pass to Spark. Here is an example that I ran and this worked fine.
spark-submit
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.3
--conf "spark.executor.extraJavaOptions=
-Djavax.net.ssl.trustStore=/tmp/certs/rds-truststore.jks
-Djavax.net.ssl.trustStorePassword=<yourpassword>" pytest.py
you can provide those same configuration options in both spark-shell as well.
One thing i did find tricky, was that the mongo spark connector doesnt appear to know the ssl_ca_certs parameter in the connection string, so i removed this to avoid warnings from Spark as the Spark executors would reference the keystore in the configuration anyway.

Databricks-connect Java connects to local instead of remote

I have a Java application that connects to an Apache Spark cluster and performs some operations. I'm trying to connect to a Databricks cluster on Azure, using databricks-connect 7.3. If I run from the terminal databricks-connect test, everything works perfectly. I'm following their documentation, I included the jars in IntelliJ, added spark.databricks.service.server.enabled true to the cluster in Databricks and used the following to create the SparkSession:
SparkSession spark = SparkSession
.builder()
.master("local")
.getOrCreate();
The problem is that this command connects to a local cluster that is instantiated at runtime, and does not connect to the remote Databricks cluster. Am I missing something?

Apache livy error - "Unable to connect to server" spark://[hostname]:7077

I'm trying to submit a Spark batch job using Apache Livy.
I am running my Spark cluster and Livy service on Openshift.
the Spark cluster is in standalone cluster mode, Livy runs spark in cluster deploy mode.
I've configured Livy to the Spark cluster master service:
livy.spark.master = "spark://spark.spark.svc:7077
when i try to run a scala spark job, using the POST /batches method and get the exception:
Unable to connect to server spark://spark.spark.svc:7077
Anyone experienced this issue?

Submitting pyspark script to a remote Spark server?

This is probably a really silly question, but I can't find the answer with Google. I've written a simple pyspark ETL script that reads in a CSV and writes it to Parquet, something like this:
spark = SparkSession.builder.getOrCreate()
sqlContext = SQLContext(spark.sparkContext)
df = sqlContext.read.csv(input_filename)
df.write.parquet(output_path)
To run it, I start up a local Spark cluster in Docker:
$ docker run --network=host jupyter/pyspark-notebook
I run the Python script and it connects to this local Spark cluster and all works as expected.
Now I'd like to run the same script on a remote Spark cluster (AWS EMR). Can I just specify a remote IP address somewhere when initialising the Spark context? Or am I misunderstanding how Spark works?
You can create a spark session by specifying the IP address of the remote master.
spark = SparkSession.builder.master("spark://<ip>:<port>").getOrCreate()
In case of AWS EMR, standalone mode is not supported. You need to use yarn in either client or cluster mode, and point HADOOP_CONF_DIR to a location on your local server where all files from /etc/hadoop/conf are present. Then setup dynamic port forwarding to connect to the EMR cluster. Create a spark session like:
spark = SparkSession.builder.master('yarn').config('spark.submit.deployMode', 'cluster').getOrCreate()
refer https://aws.amazon.com/premiumsupport/knowledge-center/emr-submit-spark-job-remote-cluster/

How to connect to Spark EMR from the locally running Spark Shell

I have created a Spark EMR cluster. I would like to execute jobs either on my localhost or EMR cluster.
Assuming I run spark-shell on my local computer how can I tell it to connect to the Spark EMR cluster, what would be the exact configuration options and/or commands to run.
It looks like others have also failed at this and ended up running the Spark driver on EMR, but then making use of e.g. Zeppelin or Jupyter running on EMR.
Setting up our own machines as spark drivers that connected to the core nodes on EMR would have been ideal. Unfortunately, this was impossible to do and we forfeited after trying many configuration changes. The driver would start up and then keep waiting unsuccessfully, trying to connect to the slaves.
Most of our Spark development is on pyspark using Jupyter Notebook as our IDE. Since we had to run Jupyter from the master node, we couldn’t risk losing our work if the cluster were to go down. So, we created an EBS volume and attached it to the master node and placed all of our work on this volume. [...]
source
Note: If you go down this route, I would consider using S3 for storing notebooks, then you don't have to manage EBS volumes.
One way of doing this is to add your spark job as an EMR step to your EMR cluster. For this, you need AWS CLI installed on your local computer
(see here for installation guide), and your jar file on s3.
Once you have aws cli, assuming your spark class to run is com.company.my.MySparkJob and your jar file is located on s3 at s3://hadi/my-project-0.1.jar, you can run the following command from your terminal:
aws emr add-steps --cluster-id j-************* --steps Type=spark,Name=My_Spark_Job,Args=[-class,com.company.my.MySparkJob,s3://hadi/my-project-0.1.jar],ActionOnFailure=CONTINUE

Resources