How to connect documentdb to a spark application in an emr instance - apache-spark

I'm getting error while I'm trying to configure spark with mongodb in my EMR instance. Below is the command -
spark-shell --conf "spark.mongodb.output.uri=mongodb://admin123:Vibhuti21!#docdb-2021-09-18-15-29-54.cluster-c4paykiwnh4d.us-east-1.docdb.amazonaws.com:27017/?replicaSet=rs0&readPreference=secondaryPreferred&retryWrites=false" "spark.mongodb.output.collection="ecommerceCluster" --packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.3
I'm a beginner in Spark & AWS. Can anyone please help?

DocumentDB requires a CA bundle to be installed on each node where your spark executors will launch. As such you firstly need to install the CA certs on each instance, AWS has a guide under the JAVA section for this in two bash scripts which makes things easier.1
Once these certs are installed, your spark command needs to reference the truststores and its passwords using the configuration parameters you can pass to Spark. Here is an example that I ran and this worked fine.
spark-submit
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.3
--conf "spark.executor.extraJavaOptions=
-Djavax.net.ssl.trustStore=/tmp/certs/rds-truststore.jks
-Djavax.net.ssl.trustStorePassword=<yourpassword>" pytest.py
you can provide those same configuration options in both spark-shell as well.
One thing i did find tricky, was that the mongo spark connector doesnt appear to know the ssl_ca_certs parameter in the connection string, so i removed this to avoid warnings from Spark as the Spark executors would reference the keystore in the configuration anyway.

Related

Airflow and Spark/Hadoop - Unique cluster or one for Airflow and other for Spark/Hadoop

I'm trying to figure out which is the best way to work with Airflow and Spark/Hadoop.
I already have a Spark/Hadoop cluster and I'm thinking about creating another cluster for Airflow that will submit jobs remotely to Spark/Hadoop cluster.
Any advice about it? Looks like it's a little complicated to deploy spark remotely from another cluster and that will create some file configuration duplication.
You really only need to configure a yarn-site.xml file, I believe, in order for spark-submit --master yarn --deploy-mode client to work. (You could try cluster deploy mode, but I think having the driver being managed by Airflow isn't a bad idea)
Once an Application Master is deployed within YARN, then Spark is running locally to the Hadoop cluster.
If you really want, you could add a hdfs-site.xml and hive-site.xml to be submitted as well from Airflow (if that's possible), but otherwise at least hdfs-site.xml files should be picked up from the YARN container classpath (not all NodeManagers could have a Hive client installed on them)
I prefer submitting Spark Jobs using SSHOperator and running spark-submit command which would save you from copy/pasting yarn-site.xml. Also, I would not create a cluster for Airflow if the only task that I perform is running Spark jobs, a single VM with LocalExecutor should be fine.
There are a variety of options for remotely performing spark-submit via Airflow.
Emr-Step
Apache-Livy (see this for hint)
SSH
Do note that none of these are plug-and-play ready and you'll have to write your own operators to get things done.

spark-submit on ibm bluemix

i've just registrated a free instance of "Analitycs for Apache Spark" and followed this tutorial to use spark submit ibm ad hoc designed script to run an app from my local machine on bluemix cloud cluster. The issue is the following: i've made everithing that was described in the tutorial and lunched this script
./spark-submit.sh --vcap ./vcap.json --deploy-mode cluster --master
https://spark.eu-gb.bluemix.net --files /home/vito/vinorosso2.csv
--conf spark.service.spark_version=2.2.0
/home/vito/workspace_2/sbt-esempi/target/scala-2.11/isolationF3.jar
--class progettoSisDis.MasterNode
everything proceed fine (dataset vinorosso2.csv and my fatJar are correctly uploaded) until the terminal sais :" submission complete" at this point when i go to the log file created by the script there was this error message :
Submit job result: Invalid plan and spark version combination in HTTP request (ibm.SparkService.PayGoPersonal, 2.0.0)
Submission ID:
ERROR: Problem submitting job. Exit
So, it wasn't enough to register a free instance of Analitycs for apache spark to submit a spark job? Hope someone can help. By the way, if it helps, on my local machine i'm using spark with intellij idea (scala). Byyye
From https://console.bluemix.net/docs/services/AnalyticsforApacheSpark/index-gentopic2.html#using_spark-submit you need to be using Spark version 1.6.x or 2.0.x. Your submit job is set to version 2.2.0. Try submitting using spark.service.spark_version=2.0.0 (assuming your code will work with this version of Spark).

How to get access to HDFS files in Spark standalone cluster mode?

I am trying to get access to HDFS files in Spark. Everything works fine when I run Spark in local mode, i.e.
SparkSession.master("local")
and get access to HDFS files by
hdfs://localhost:9000/$FILE_PATH
But when I am trying to run Spark in standalone cluster mode, i.e.
SparkSession.master("spark://$SPARK_MASTER_HOST:7077")
Error throws
java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.fun$1 of type org.apache.spark.api.java.function.Function in instance of org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1
So far I have only
start-dfs.sh
in Hadoop and does not really config anything in Spark. Do I need to run Spark using YARN cluster manager instead so that Spark and Hadoop are using the same cluster manager, hence can get access to HDFS files?
I have tried to config yarn-site.xml in Hadoop following tutorialspoint https://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup.htm, and specified HADOOP_CONF_DIR in spark-env.sh, but it does not seem to work and the same error throws. Am I missing some other configurations?
Thanks!
EDIT
The initial Hadoop version is 2.8.0 and the Spark version is 2.1.1 with Hadoop 2.7. Tried to download hadoop-2.7.4 but the same error still exists.
The question here suggests this as a java syntax issue rather than spark hdfs issue. I will try this approach and see if this solves the error here.
Inspired by the post here, solved the problem by myself.
This map-reduce job depends on a Serializable class, so when running in Spark local mode, this serializable class can be found and the map-reduce job can be executed dependently.
When running in Spark standalone cluster mode, the best is to submit the application through spark-submit, rather than running in an IDE. Packaged everything in jar and spark-submit the jar, works as a charm!

How to submit job(jar) to the Azure Spark cluster through commandline interface?

I am new to HDInsight Spark, I am trying to run a use-case to learn how things work in Azure Spark cluster. This is what I have done so far.
Able to create azure spark cluster.
Create jar by following steps as described in the link: create standalone scala application to run on HDInsight Spark cluster. I have used the same scala code as given in the link.
ssh into head node
upload jar to the blob storage using link: using azure CLI with azure storage
copy zip to machine
hadoop fs -copyToLocal
I have checked that the jar gets uploaded to the headnode(machine).
I want to run that jar and get the results as stated in the link given in
point 2 above.
What will be the next step? How can I submit spark job and get results using command line interface?
For example considering you are created jar for program submit.jar. In order to submit this to your cluster with dependency you can use below syntax.
spark-submit --master yarn --deploy-mode cluster --packages "com.microsoft.azure:azure-eventhubs-spark_2.11:2.2.5" --class com.ex.abc.MainMethod "wasb://space-hdfs#yourblob.blob.core.windows.net/xx/xx/submit.jar" "param1.json" "param2"
Here --packages :is to include dependency to you program, you can use --jars option and then followed by jar path. --jars "path/to/dependency/abc.jar"
--class : Main method of your program
after that specify path for your program jar.
you can pass parameters with if you needed as shown above
A couple of options on submitting spark jars:
1) If you want to submit the job on the headnode already, you can use spark-submit
See Apache submit jar documentation
2) An easier alternative is to submit spark jar via livy after uploading the jar to wasb storage.
See submit via livy doc. You can skip step 5 if you do it this way.

Spark submit does automatically upload the jar to cluster?

I'm trying to submit a Spark app from local machine Terminal to my Cluster.
I'm using --master yarn-cluster. I need to run the driver program on my Cluster too, not on the machine I do submit the application i.e my local machine
When I provide the path to application jar which is in my local machine, would spark-submit automatically upload it to my Cluster?
I'm using
bin/spark-submit
--class com.my.application.XApp
--master yarn-cluster --executor-memory 100m
--num-executors 50 /Users/nish1013/proj1/target/x-service-1.0.0-201512141101-assembly.jar
1000
and getting error
Diagnostics: java.io.FileNotFoundException: File file:/Users/nish1013/proj1/target/x-service-1.0.0-201512141101- does not exist
In Documentation ,http://spark.apache.org/docs/latest/submitting-applications.html#launching-applications-with-spark-submit
Advanced Dependency Management When using spark-submit, the
application jar along with any jars included with the --jars option
will be automatically transferred to the cluster.
But seems like it does not !
I see you are quoting the spark-submit page from Spark Docs but I would spend a lot more time on the Running Spark on YARN page. Bottom-line, look at:
There are two deploy modes that can be used to launch Spark
applications on YARN. In yarn-cluster mode, the Spark driver runs
inside an application master process which is managed by YARN on the
cluster, and the client can go away after initiating the application.
In yarn-client mode, the driver runs in the client process, and the
application master is only used for requesting resources from YARN.
Further you note, "I need to run the driver program on my Cluster too, not on the machine I do submit the application i.e my local machine"
So I agree with you that you are right to run --master yarn-cluster instead of --master yarn-client
(and one comment notes what might just be a syntax error where you dropped "assembly.jar" but I think this will apply as well...)
Some of the basic assumptions about non-YARN implementations change a lot when YARN is introduced, mostly related to Classpaths and the need to push jars to the workers.
From an email on the Apache Spark User list:
YARN cluster mode. Spark submit does upload your jars to the cluster.
In particular, it puts the jars in HDFS so your driver can just read
from there. As in other deployments, the executors pull the jars from
the driver.
So finally, from the Apache Spark YARN doc:
Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory
which contains the (client side) configuration files for the Hadoop
cluster. These configs are used to write to HDFS and connect to the
YARN ResourceManager.
NOTE: I only see you adding a single JAR, if there's a need to add other JARs there's a special note about doing that with YARN:
In yarn-cluster mode, the driver runs on a different machine than the
client, so SparkContext.addJar won’t work out of the box with files
that are local to the client. To make files on the client available to
SparkContext.addJar, include them with the --jars option in the launch
command.
That page in the link has some examples.
And of course you downloaded or built the YARN-specific version of Spark.
Background, in a standalone cluster deployment using spark-submit and the option --deploy-mode cluster, yes you do need to make sure every worker node has access to all the dependencies, Spark will not push them to the cluster. This is because in "standalone cluster" mode with Spark as the job manager, you don't know which node the driver will run on! But that doesn't apply to your case.
But if I could, depending on the size of the jars you are uploading, I would still explicitly put the jars on each node, or "globally available" via HDFS, for another reason from the docs:
From Advanced Dependency Management, it seems to present the best of both worlds, but also a great reason for manually pushing your jars out to all nodes:
local: - a URI starting with local:/ is expected to exist as a local
file on each worker node. This means that no network IO will be
incurred, and works well for large files/JARs that are pushed to each
worker, or shared via NFS, GlusterFS, etc.
But I assume that local:/... would change to hdfs:/ ... not sure on that one.
Yes and no. It depends on what you mean. Spark deploys the .jar to the nodes in the cluster. However, it won't upload your .jar file from your local machine to the cluster.
You can find more info in the Submitting Applications page. As you can see, in the arguments you pass to spark-submit, there is one that needs to be globally visible: the application-jar.
application-jar: Path to a bundled jar including your application and
all dependencies. The URL must be globally visible inside of your
cluster, for instance, an hdfs:// path or a file:// path that is
present on all nodes.
As far as I understand, what you want is to use yarn-client, not yarn-cluster. This will run the driver in the client (e.g., the machine which you are trying to call spark-submit on, for example your laptop), without the need of copying the .jar file on the cluster. More about this here.
Try adding --jars option before your /path/to/jar/file
spark-submit --jars /tmp/test.jar

Resources