AWS EMR - Upload file into the application master - apache-spark

I'm using aws cli and I launch a Cluster with the following command:
aws emr create-cluster --name "Config1" --release-label emr-5.0.0 --applications Name=Spark --use-default-role --ec2-attributes KeyName=ChiaveEMR --log-uri 's3://aws-logs-813591802533-us-west-2/elasticmapreduce/' --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.medium InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.medium
after that, I put a file into the master node:
aws emr put --cluster-id j-NSGFSP57255P --key-pair-file "ChiaveEMR.pem" --src "./configS3.txt"
The file is located in /home/hadoop/configS3.txt.
Then I launch a step:
aws emr add-steps --cluster-id ID_CLUSTER --region us-west-2 --steps Type=Spark,Name=SparkSubmit,Args=[--deploy-mode,cluster,--master,yarn,--executor-memory,1G,--class,Traccia2014,s3://tracceale/params/traccia-22-ottobre_2.11-1.0Ale.jar,/home/hadoop/configS3.txt,30,300,2,"s3a://tracceale/Tempi1"],ActionOnFailure=CONTINUE
But I get this error:
17/02/23 14:49:51 ERROR ApplicationMaster: User class threw exception: java.io.FileNotFoundException: /home/hadoop/configS3.txt (No such file or directory)
java.io.FileNotFoundException: /home/hadoop/configS3.txt (No such file or directory)
probably due to the fact that 'configS3.txt' is located on the master and not on the slaves.
How could I pass 'configS3.txt' to spark-submit script? I've tried from S3 too but it doesn't work. Any solutions? Thanks in advance

Since you are using "--deploy-mode cluster", the driver runs on a CORE/TASK instance rather than the MASTER instance, so yes, it's because you uploaded the file to the MASTER instance but then the code that's trying to access the file is not running on the MASTER instance.
Given that the error you are encountering is a FileNotFoundException, it sounds like your application code is trying to open it directly, meaning that of course you can't simply use the S3 path directly. (You can't do something like new File("s3://bucket/key") because Java has no idea how to handle this.) My assumption could be wrong though because you have not included your application code or explained what you are doing with this configS3.txt file.

Maurizio: you're still trying to fix your previous problem.
On a distributed system, you need files which are visible on all machines (which the s3:// filestore delivers) and to use an API which works with data from the distributed filesystem. which SparkContext.hadoopRDD() delivers. You aren't going to get anywhere by trying to work out how to get a file onto the local disk of every VM, because that's not the problem you need to fix: it's how to get your code to read data from the shared object store.
Sorry

Related

403 Error while accessing s3a using Spark/hadoop

I have configured Hadoop and spark in docker through k8s agent container which we are using to run the Jenkins job and we are using AWS EKS. but while running the spark-submit job we are getting the below error
py4j.protocol.Py4JJavaError: An error occurred while calling o40.exists.
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: xxxxxxxxx, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: xxxxxxxxxxxxxxx/xxxxxxxx
we have created a service account in k8s and added annotation as IAM role.(IAM role to access s3 which created in aws )
we see it can copy files from s3 but getting this error in job and not able to find out root cause .
note : Spark version 2.2.1
hadoop version : 2.7.4
Thanks
this is a five year old version of spark built on an eight year old set of hadoop binaries, including the s3a connector. "uch some of the binding logic to pick up iam roles simply isn't there.
Upgrade to spark 3.3.x with a full set of the hadoop-3.3.4 jars and try again.
(Note that "use a recent release" is step one of any problem with an open source application, it'd be the first action required if you ever file a bug report)

Running Spark2.3 on Kubernetes with remote dependency on S3

I am running spark-submit to run on Kubernetes (Spark 2.3). My problem is that the InitContainer does not download my jar file if it's specified as an s3a:// path but does work if I put my jar on an HTTP server and use http://. The spark driver fails, of course, because it can't find my Class (and the jar file in fact is not in the image).
I have tried two approaches:
specifying the s3a path to jar as the argument to spark-submit and
using --jars to specify the jar file's location on s3a, but both fail in the same way.
edit: also, using local:///home/myuser/app.jar does not work with the same symptoms.
On a failed run (dependency on s3a), I logged into the container and found the directory /var/spark-data/spark-jars/ to be empty. The init-container logs don't indicate any type of error.
Questions:
What is the correct way to specify remote dependencies on S3A?
Is S3A not supported yet? Only http(s)?
Any suggestions on how to further debug the InitContainer to determine why the download doesn't happen?

Error while running Zeppelin paragraphs in Spark on Linux cluster in Azure HdInsight

I have been following this tutorial in order to set up Zeppelin on a Spark cluster (version 1.5.2) in HDInsight, on Linux. Everything worked fine, I have managed to successfully connect to the Zeppelin notebook through the SSH tunnel. However, when I try to run any kind of paragraph, the first time I get the following error:
java.io.IOException: No FileSystem for scheme: wasb
After getting this error, if I try to rerun the paragraph, I get another error:
java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
These errors occur regardless of the code I enter, even if there is no reference to the hdfs. What I'm saying is that I get the "No FileSystem" error even for a trivial scala expression, such as parallelize.
Is there a missing configuration step?
I am download the tar ball that the script that you pointed to as I type. But want I am guessing is that your zeppelin install and spark install are not complete to work with wasb. In order to get spark to work with wasb you need to add some jars to the Class path. To do this you need to add something like this to your spark-defaults.conf (the paths might be different in HDInsights, this is from HDP on IaaS)
spark.driver.extraClassPath /usr/hdp/2.3.0.0-2557/hadoop/lib/azure-storage-2.2.0.jar:/usr/hdp/2.3.0.0-2557/hadoop/lib/microsoft-windowsazure-storage-sdk-0.6.0.jar:/usr/hdp/2.3.0.0-2557/hadoop/hadoop-azure-2.7.1.2.3.0.0-2557.jar
spark.executor.extraClassPath /usr/hdp/2.3.0.0-2557/hadoop/lib/azure-storage-2.2.0.jar:/usr/hdp/2.3.0.0-2557/hadoop/lib/microsoft-windowsazure-storage-sdk-0.6.0.jar:/usr/hdp/2.3.0.0-2557/hadoop/hadoop-azure-2.7.1.2.3.0.0-2557.jar
Once you have spark working with wasb, or next step is make those sames jar in zeppelin class path. A good way to test your setup is make a notebook that prints your env vars and class path.
sys.env.foreach(println(_))
val cl = ClassLoader.getSystemClassLoader
cl.asInstanceOf[java.net.URLClassLoader].getURLs.foreach(println)
Also looking at the install script, it trying to pull the zeppelin jar from wasb, you might want to change that config to somewhere else while you try some of these changes out. (zeppelin.sh)
export SPARK_YARN_JAR=wasb:///apps/zeppelin/zeppelin-spark-0.5.5-SNAPSHOT.jar
I hope this helps, if you are still have problems I have some other ideas, but would start with these first.

Where is Spark writing SaveAsTextFile in cluster?

I'm a bit at loss here (Spark newbie). I spun up an EC2 cluster, and submitted a Spark job which saves as text file in the last step. The code reads
reduce_tuples.saveAsTextFile('september_2015')
and the working directory of the python file I'm submitting is /root. I cannot find the directory called september_2005, and if I try to run the job again I get the error:
: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://ec2-54-172-88-52.compute-1.amazonaws.com:9000/user/root/september_2015 already exists
The ec2 address is the master node where I'm ssh'ing to, but I don't have a folder /user/root.
Seems like Spark is creating the september_2015 directory somehwere, but find doesn't find it. Where does Spark write the resulting directory? Why is it pointing me to a directory that doesn't exist in the master node filesystem?
You're not saving it in the local file system, you're saving it in the hdfs cluster. Try eph*-hdfs/bin/hadoop fs -ls /, then you should see your file. See eph*-hdfs/bin/hadoop help for more commands, eg. -copyToLocal.

Bad SSL Key When Trying to Use spark-ec2 script to launch cluster on EC2?

Version of Apache Spark: spark-1.2.1-bin-hadoop2.4
Platform: Ubuntu
I have been using the spark-1.2.1-bin-hadoop2.4/ec2/spark-ec2 script to create temporary clusters on ec2 for testing. All was working well.
Then I started to get the following error when trying to launch the cluster:
[Errno 185090050] _ssl.c:344: error:0B084002:x509 certificate routines:X509_load_cert_crl_file:system lib
I have traced this back to the following line in the spark_ec2.py script:
conn = ec2.connect_to_region(opts.region)
Thus, the first time the script interacts with ec2, it is throwing this error. Spark is using the Python boto library (included with the Spark download) to make this call.
I assume the error I am getting is because of a bad cacert.pem file somewhere.
My question: which cacert.pem file gets used when I try to invoke the spark-ec2 script, and why is it not working?
I also had this error with spark-1.2.0-bin-hadoop2.4
SOLVED: the embedded boto library that comes with Spark found a ~/.boto config file I had for another non-Spark project (actually it was for the Google Cloud Services...GCS installed it, I had forgotten about it). That was screwing everything up.
As soon as I deleted the ~/.boto config file GCS installed, everything started working again for Spark!

Resources