I have configured Hadoop and spark in docker through k8s agent container which we are using to run the Jenkins job and we are using AWS EKS. but while running the spark-submit job we are getting the below error
py4j.protocol.Py4JJavaError: An error occurred while calling o40.exists.
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: xxxxxxxxx, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: xxxxxxxxxxxxxxx/xxxxxxxx
we have created a service account in k8s and added annotation as IAM role.(IAM role to access s3 which created in aws )
we see it can copy files from s3 but getting this error in job and not able to find out root cause .
note : Spark version 2.2.1
hadoop version : 2.7.4
Thanks
this is a five year old version of spark built on an eight year old set of hadoop binaries, including the s3a connector. "uch some of the binding logic to pick up iam roles simply isn't there.
Upgrade to spark 3.3.x with a full set of the hadoop-3.3.4 jars and try again.
(Note that "use a recent release" is step one of any problem with an open source application, it'd be the first action required if you ever file a bug report)
Related
I am using circleCI for deployments, with AKS version 1.11 , the pipelines were working fine but after the AKS upgradation to 1.14.6, failure is seen while applying the deployment and service object files.
I deployed manually at kubernetes cluster, there didn't appear any error but while deploying through circleCI, I am getting following kind of errors while using version 2 of circleCI
error: SchemaError(io.k8s.api.extensions.v1beta1.DeploymentRollback):
invalid object doesn't have additional properties
or the other kind of error appears like -
error: SchemaError(io.k8s.api.core.v1.StorageOSVolumeSource): invalid
object doesn't have additional properties
It's most likely that the version of kubectl used in CircleCI isn't supported by 1.14.6. Note that kubectl version must be either 1.n, 1.(n+1) or 1.(n-1) where n is the minor version of the cluster. In this case your kubectl must be at least 1.13.x or at most 1.15.x
Checkout Kubernetes version and version skew support policy for more details.
We get an exception when setting up flink on Azure HDInsights cluster.
./bin/yarn-session.sh -n 4 -jm 1024m -tm 4096m
Throws:
org.apache.flink.client.deployment.ClusterDeploymentException:
Couldn't deploy Yarn session cluster
Caused by:
Failing this attempt.Diagnostics:
[2018-10-24 00:41:17.703]Resource wasb://../.flink/application_1539730571763_0057/
application_1539730571763_0057-flink-conf.yaml8158650202504017094.tmp
changed on src filesystem (expected 1540341676000, was 1540341677000
java.io.IOException:
at org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:273)
It seems to be because wasb blob storage is not keeping original timestamps for copied files breaking HDFS API abstraction on top of wasb.
Any workarounds for this?
Only other thread I could find was Oozie/yarn: resource changed on src filesystem.
We worked with MS Azure support and they confirmed that it only works with Flink 1.4.x and not 1.5.x or 1.6.x.
We downgraded it to 1.4.x and it is working fine now in Azure cluster.
In order to access my S3 bucket i have exported my creds
export AWS_SECRET_ACCESS_KEY=
export AWS_ACCESSS_ACCESS_KEY=
I can verify that everything works by doing
aws s3 ls mybucket
I can also verify with boto3 that it works in python
resource = boto3.resource("s3", region_name="us-east-1")
resource.Object("mybucket", "text/text.py") \
.put(Body=open("text.py", "rb"),ContentType="text/x-py")
This works and I can see the file in the bucket.
However when I do this with spark:
spark_context = SparkContext()
sql_context = SQLContext(spark_context)
spark_context.textFile("s3://mybucket/my/path/*)
I get a nice
> Caused by: org.jets3t.service.S3ServiceException: Service Error
> Message. -- ResponseCode: 403, ResponseStatus: Forbidden, XML Error
> Message: <?xml version="1.0"
> encoding="UTF-8"?><Error><Code>InvalidAccessKeyId</Code><Message>The
> AWS Access Key Id you provided does not exist in our
> records.</Message><AWSAccessKeyId>[MY_ACCESS_KEY]</AWSAccessKeyId><RequestId>XXXXX</RequestId><HostId>xxxxxxx</HostId></Error>
this is how I submit the job locally
spark-submit --packages com.amazonaws:aws-java-sdk-pom:1.11.98,org.apache.hadoop:hadoop-aws:2.7.3 test.py
Why does it works with command line + boto3 but spark is chocking ?
EDIT:
Same issue using s3a:// with
hadoopConf = spark_context._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3a.access.key", "xxxx")
hadoopConf.set("fs.s3a.secret.key", "xxxxxxx")
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
and same issue using aws-sdk 1.7.4 and hadoop 2.7.2
Spark will automatically copy your AWS Credentials to the s3n and s3a secrets. Apache Spark releases don't touch the s3:// URLs, as in Apache Hadoop, the s3:// schema is associated with the original, now deprecated s3 client, one which is incompatible with everything else.
On Amazon EMR, s3:// is bound to amazon EMR S3; EC2 VMs will provide the secrets for executors automatically. So I don't think it bothers with the env var propagation mechanism. It might also be that how it sets up the authentication chain, you can't override the EC2/IAM data.
If you are trying to talk to S3 and you are not running in an EMR VM, then presumably you are using Apache Spark with the Apache Hadoop JARs, not the EMR versions. In that world use URLs with s3a:// to get the latest S3 client library
If that doesn't work, look at the troubleshooting section of the apache docs. There's a section on "403" there including recommended steps for troubleshooting. It can be due to classpath/JVM version problems as well as credential issues, even clock-skew between client and AWS.
I'm using aws cli and I launch a Cluster with the following command:
aws emr create-cluster --name "Config1" --release-label emr-5.0.0 --applications Name=Spark --use-default-role --ec2-attributes KeyName=ChiaveEMR --log-uri 's3://aws-logs-813591802533-us-west-2/elasticmapreduce/' --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.medium InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.medium
after that, I put a file into the master node:
aws emr put --cluster-id j-NSGFSP57255P --key-pair-file "ChiaveEMR.pem" --src "./configS3.txt"
The file is located in /home/hadoop/configS3.txt.
Then I launch a step:
aws emr add-steps --cluster-id ID_CLUSTER --region us-west-2 --steps Type=Spark,Name=SparkSubmit,Args=[--deploy-mode,cluster,--master,yarn,--executor-memory,1G,--class,Traccia2014,s3://tracceale/params/traccia-22-ottobre_2.11-1.0Ale.jar,/home/hadoop/configS3.txt,30,300,2,"s3a://tracceale/Tempi1"],ActionOnFailure=CONTINUE
But I get this error:
17/02/23 14:49:51 ERROR ApplicationMaster: User class threw exception: java.io.FileNotFoundException: /home/hadoop/configS3.txt (No such file or directory)
java.io.FileNotFoundException: /home/hadoop/configS3.txt (No such file or directory)
probably due to the fact that 'configS3.txt' is located on the master and not on the slaves.
How could I pass 'configS3.txt' to spark-submit script? I've tried from S3 too but it doesn't work. Any solutions? Thanks in advance
Since you are using "--deploy-mode cluster", the driver runs on a CORE/TASK instance rather than the MASTER instance, so yes, it's because you uploaded the file to the MASTER instance but then the code that's trying to access the file is not running on the MASTER instance.
Given that the error you are encountering is a FileNotFoundException, it sounds like your application code is trying to open it directly, meaning that of course you can't simply use the S3 path directly. (You can't do something like new File("s3://bucket/key") because Java has no idea how to handle this.) My assumption could be wrong though because you have not included your application code or explained what you are doing with this configS3.txt file.
Maurizio: you're still trying to fix your previous problem.
On a distributed system, you need files which are visible on all machines (which the s3:// filestore delivers) and to use an API which works with data from the distributed filesystem. which SparkContext.hadoopRDD() delivers. You aren't going to get anywhere by trying to work out how to get a file onto the local disk of every VM, because that's not the problem you need to fix: it's how to get your code to read data from the shared object store.
Sorry
Version of Apache Spark: spark-1.2.1-bin-hadoop2.4
Platform: Ubuntu
I have been using the spark-1.2.1-bin-hadoop2.4/ec2/spark-ec2 script to create temporary clusters on ec2 for testing. All was working well.
Then I started to get the following error when trying to launch the cluster:
[Errno 185090050] _ssl.c:344: error:0B084002:x509 certificate routines:X509_load_cert_crl_file:system lib
I have traced this back to the following line in the spark_ec2.py script:
conn = ec2.connect_to_region(opts.region)
Thus, the first time the script interacts with ec2, it is throwing this error. Spark is using the Python boto library (included with the Spark download) to make this call.
I assume the error I am getting is because of a bad cacert.pem file somewhere.
My question: which cacert.pem file gets used when I try to invoke the spark-ec2 script, and why is it not working?
I also had this error with spark-1.2.0-bin-hadoop2.4
SOLVED: the embedded boto library that comes with Spark found a ~/.boto config file I had for another non-Spark project (actually it was for the Google Cloud Services...GCS installed it, I had forgotten about it). That was screwing everything up.
As soon as I deleted the ~/.boto config file GCS installed, everything started working again for Spark!