java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found - apache-spark

I am new to spark and delta-lake and trying to do one POC with pyspark and using minio as delta-lake's storage backend. However, I am getting error that
Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
I have added the jar in python code and assuming it'll download the required jar on runtime. I am not able to understand where I am doing wrong.
Can someone please help me out ?
Thanks
ENV
OS: Windows 11
Spark: Apache Spark 3.3.1
Java Version: openjdk version "11.0.13" 2021-10-19
Python Version: 3.9.13
Python packages: pyspark 3.2.3, delta-spark 2.0.2
CODE
import pyspark
from delta import *
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.3.1") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.config("spark.hadoop.fs.s3a.access.key", <my key>) \
.config("spark.hadoop.fs.s3a.secret.key", <my secret>) \
.config("spark.hadoop.fs.s3a.endpoint", <my endpoint>) \
.config("spark.databricks.delta.retentionDurationCheck.enabled", "false")
spark = configure_spark_with_delta_pip(builder).getOrCreate()
spark.conf.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
data = spark.range(0, 5)
data.write.format("delta").save("s3a://<my bucket>/delta-lake/demo")
df = spark.read.format("delta").load("tmp/delta-table")
df.show()
OUTPUT
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/C:/Spark/spark-3.2.3-bin-hadoop3.2/jars/spark-unsafe_2.12-3.2.3.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
:: loading settings :: url = jar:file:/C:/Spark/spark-3.2.3-bin-hadoop3.2/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: C:\Users\shari\.ivy2\cache
The jars for the packages stored in: C:\Users\shari\.ivy2\jars
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-050502ed-ca85-4e4f-b7a3-ff69c12689d3;1.0
confs: [default]
found io.delta#delta-core_2.12;2.0.2 in central
found io.delta#delta-storage;2.0.2 in central
found org.antlr#antlr4-runtime;4.8 in central
found org.codehaus.jackson#jackson-core-asl;1.9.13 in central
:: resolution report :: resolve 173ms :: artifacts dl 0ms
:: modules in use:
io.delta#delta-core_2.12;2.0.2 from central in [default]
io.delta#delta-storage;2.0.2 from central in [default]
org.antlr#antlr4-runtime;4.8 from central in [default]
org.codehaus.jackson#jackson-core-asl;1.9.13 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 4 | 0 | 0 | 0 || 4 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-050502ed-ca85-4e4f-b7a3-ff69c12689d3
confs: [default]
0 artifacts copied, 4 already retrieved (0kB/0ms)
23/02/16 16:19:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Traceback (most recent call last):
File "D:\DLT\quin\quin-experian-elt\el\delta_test.py", line 19, in <module>
data.write.format("delta").save("s3a://quin-third-party-data-dev-1/delta-lake/demo")
File "C:\Users\shari\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pyspark\sql\readwriter.py", line 740, in save
self._jwrite.save(path)
File "C:\Users\shari\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\py4j\java_gateway.py", line 1321, in __call__
return_value = get_return_value(
File "C:\Users\shari\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pyspark\sql\utils.py", line 111, in deco
return f(*a, **kw)
File "C:\Users\shari\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\py4j\protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o61.save.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2667)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.spark.sql.delta.DeltaLog$.apply(DeltaLog.scala:620)
at org.apache.spark.sql.delta.DeltaLog$.forTable(DeltaLog.scala:530)
at org.apache.spark.sql.delta.sources.DeltaDataSource.createRelation(DeltaDataSource.scala:153)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:97)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:93)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:93)
at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:78)
at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:115)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:848)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382)
at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:349)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2571)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2665)
... 51 more
SUCCESS: The process with PID 6172 (child process of PID 24096) has been terminated.
SUCCESS: The process with PID 24096 (child process of PID 27152) has been terminated.
SUCCESS: The process with PID 27152 (child process of PID 25700) has been terminated.

The problem is most probably caused by the use of the configure_spark_with_delta_pip function that overrides spark.jars.packages option that you used for specifying Hadoop AWS dependency.
To fix it you need to specify that dependency as additional parameter to the same function:
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.config("spark.hadoop.fs.s3a.access.key", <my key>) \
.config("spark.hadoop.fs.s3a.secret.key", <my secret>) \
.config("spark.hadoop.fs.s3a.endpoint", <my endpoint>) \
.config("spark.databricks.delta.retentionDurationCheck.enabled", "false")
my_packages = ["org.apache.hadoop:hadoop-aws:3.3.1"]
spark = configure_spark_with_delta_pip(builder, extra_packages=my_packages).getOrCreate()
P.S. Check version compatibility of your Spark with Delta - see this compatibility matrix to find a correct version.

It seems that your error is being caused by missing aws jars in the Spark image/jars that you're using.
Spark needs hadoop-aws in order to interface with S3A.
You'll need to add the hadoop-aws jar to your classpath. This should contain the class that's currently missing: org.apache.hadoop.fs.s3a.S3AFileSystem.
Once you're able to write data to S3A from Spark, you'll next need to verify that you are using the S3A committers to write your data. The default committers are not optimised for S3A.
Read this for more information on the S3A committers:
https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/committers.html
Link to maven repository for hadoop-aws:
https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws

Related

How to resolve spark error when reading from s3

I am getting an error(java.io.IOException: No FileSystem for scheme: S3a) when running a spark application. I have looked through various other questions regarding this type of error, but Im not able to determine the solution. Spark is version 3.1.2
Updated details below to reflect current state
pyspark script:
import os
#os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.4 pyspark-shell'
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("s3reader") \
.getOrCreate()\
sc = spark.sparkContext
#sc._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
#sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "xxxxxxx")
#sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "xxxxxxxxxxxx")
#sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint","xxx.x.xxx.x.com", "us-1-east")
#sc._jsc.hadoopConfiguration().set("fs.s3a.path.style.access", "true")
df = spark.read.json("S3a://silver/testfolder/4a2426b2-856c-4e9b-b698-b3dcdca74f48")
print(df)
here are my jar versions:
cloud#spark-dev-master:/usr/local/spark/jars$ ls -ltr *aws*
-rw-rw-r-- 1 cloud cloud 126287 Aug 18 2016 hadoop-aws-2.7.4.jar
-rw-rw-r-- 1 cloud cloud 4479 Sep 17 02:36 aws-java-sdk-1.7.4.jar
stack trace:
Traceback (most recent call last):
File "/home/cloud/sparks3test.py", line 18, in <module>
df = spark.read.json("S3a://silver/testfolder/4a2426b2-856c-4e9b-b698-b3dcdca74f48")
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 372, in json
File "/usr/local/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
File "/usr/local/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o33.json.
: java.io.IOException: No FileSystem for scheme: S3a
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:519)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:829)
You need to use hadoop-aws version 3.2.0.
You can refer my previous answer here.
I am getting an error(java.lang.NoClassDefFoundError:org/apache/hadoop/fs/StreamCapabilities)
This is what you see when you mix hadoop-aws and hadoop-common JAR versions. They must match point for point (as spark JARs also require).
Do not attempt to work around this except by syncing up JARs, you will only be moving stack traces around.
See Hadoop troubleshooting s3a
As there still appeared to be jar dependecies issues, I did a fresh install on spark using 3.1.2 and hadoop 3.2.0 and aligned hadoop-aws and java-sdk jars with aws-common jar version on the master and worker nodes. This corrected the file system issue. Consequently, upgrading to 3.2.0 also corrected the endpoint issue we were running to as well as path.style.access=true is not supported in any hadoop version older than 2.8.0. That issue was documented here: https://issues.apache.org/jira/browse/HADOOP-12963 for reference.

Spark on Kubernetes from jupyterhub not working with IRSA

I am trying to run spark from jupyterhub against a eks cluster that uses IRSA.I followed examples presented in
AWS EKS Spark 3.0, Hadoop 3.2 Error - NoClassDefFoundError: com/amazonaws/services/s3/model/MultiObjectDeleteException
and https://medium.com/swlh/how-to-perform-a-spark-submit-to-amazon-eks-cluster-with-irsa-50af9b26cae for code and IRSA role examples. However I am getting an unable to find a region via the region provider chain error. I have tried using different spark versions and aws sdk versions and hardcoded AWS_DEFAULT_REGION values, it did not resolve the issue. Appreciate any advise on resolving the issue.
SPARK_HADOOP_VERSION="3.2"
HADOOP_VERSION="3.2.0"
SPARK_VERSION="3.0.1"
AWS_VERSION="1.11.874"
TINI_VERSION="0.18.0"
I am adding below jars in my spark jars folder,
"https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar"
"https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/${AWS_VERSION}/aws-java-sdk-bundle-${AWS_VERSION}.jar"
"https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/${AWS_VERSION}/aws-java-sdk-${AWS_VERSION}.jar"
"https://s3.amazonaws.com/redshift-downloads/drivers/jdbc/1.2.51.1078/RedshiftJDBC42-no-awssdk-1.2.51.1078.jar"
Sample spark session builder code
SPARK_DRIVER_PACKAGES = ['org.apache.spark:spark-core_2.12:3.0.1',
'org.apache.spark:spark-avro_2.12:3.0.1',
'org.apache.spark:spark-sql_2.12:3.0.1',
'io.github.spark-redshift-community:spark-redshift_2.12:4.2.0',
'org.postgresql:postgresql:42.2.14',
'mysql:mysql-connector-java:8.0.22',
'org.apache.hadoop:hadoop-aws:3.2.0',
'com.amazonaws:aws-java-sdk-bundle:1.11.874']
spark_session = SparkSession.builder.master(master_host)\
.appName("pyspark_session_app_1")\
.config('spark.driver.host', local_ip)\
.config('spark.kubernetes.authenticate.driver.serviceAccountName', 'spark')\
.config('spark.kubernetes.authenticate.executor.serviceAccountName', 'spark')\
.config("spark.kubernetes.executor.annotation.eks.amazonaws.com/role-arn","arn:aws:iam::xxxxxxxxx:role/spark-irsa") \
.config('spark.kubernetes.driver.limit.cores', 0.2)\
.config('spark.hadoop.fs.s3a.aws.credentials.provider','com.amazonaws.auth.WebIdentityTokenCredentialsProvider')\
.config("spark.kubernetes.authenticate.submission.caCertFile", "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt") \
.config("spark.kubernetes.authenticate.submission.oauthTokenFile", "/var/run/secrets/kubernetes.io/serviceaccount/token")\
.config('spark.kubernetes.executor.request.cores', executor_cores)\
.config('spark.executor.instances', executor_instances)\
.config('spark.executor.memory', executor_memory)\
.config('spark.driver.memory', driver_memory)\
.config('spark.kubernetes.executor.limit.cores', 1)\
.config('spark.scheduler.mode', 'FAIR')\
.config('spark.submit.deployMode', 'client')\
.config('spark.kubernetes.container.image', SPARK_IMAGE)\
.config('spark.kubernetes.container.image.pullPolicy', 'Always')\
.config('spark.kubernetes.namespace', 'prod-data-science')\
.config('spark.sql.execution.arrow.pyspark.enabled', 'true')\
.config('spark.sql.execution.arrow.pyspark.fallback.enabled', 'true')\
.config('spark.executorEnv.ARROW_PRE_0_15_IPC_FORMAT', '1')\
.config('spark.jars.packages', ','.join(SPARK_DRIVER_PACKAGES))\
.config("spark.hadoop.fs.s3a.multiobjectdelete.enable", "false") \
.config("spark.hadoop.fs.s3a.fast.upload","true") \
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
.config('spark.eventLog.enabled','true')\
.config('spark.eventLog.dir','s3a://spark-logs-xxxx/')\
.getOrCreate()
error message received
Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.nio.file.AccessDeniedException: spark-logs-xxxx: org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: No AWS Credentials provided by WebIdentityTokenCredentialsProvider : com.amazonaws.SdkClientException: Unable to find a region via the region provider chain. Must provide an explicit region in the builder or setup environment to supply a region.
at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:187)
at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:111)
at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:265)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322)
at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:261)
at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:236)
at org.apache.hadoop.fs.s3a.S3AFileSystem.verifyBucketExists(S3AFileSystem.java:375)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:311)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1853)
at org.apache.spark.deploy.history.EventLogFileWriter.<init>(EventLogFileWriters.scala:60)
at org.apache.spark.deploy.history.SingleEventLogFileWriter.<init>(EventLogFileWriters.scala:213)
at org.apache.spark.deploy.history.EventLogFileWriter$.apply(EventLogFileWriters.scala:181)
at org.apache.spark.scheduler.EventLoggingListener.<init>(EventLoggingListener.scala:64)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:576)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: No AWS Credentials provided by WebIdentityTokenCredentialsProvider : com.amazonaws.SdkClientException: Unable to find a region via the region provider chain. Must provide an explicit region in the builder or setup environment to supply a region.
at org.apache.hadoop.fs.s3a.AWSCredentialProviderList.getCredentials(AWSCredentialProviderList.java:159)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.getCredentialsFromContext(AmazonHttpClient.java:1257)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.runBeforeRequestHandlers(AmazonHttpClient.java:833)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:783)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5212)
at com.amazonaws.services.s3.AmazonS3Client.getBucketRegionViaHeadRequest(AmazonS3Client.java:6013)
at com.amazonaws.services.s3.AmazonS3Client.fetchRegionFromCache(AmazonS3Client.java:5986)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5196)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5158)
at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1421)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1357)
at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$verifyBucketExists$1(S3AFileSystem.java:376)
at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109)
... 29 more
Caused by: com.amazonaws.SdkClientException: Unable to find a region via the region provider chain. Must provide an explicit region in the builder or setup environment to supply a region.
at com.amazonaws.client.builder.AwsClientBuilder.setRegion(AwsClientBuilder.java:462)
at com.amazonaws.client.builder.AwsClientBuilder.configureMutableProperties(AwsClientBuilder.java:424)
at com.amazonaws.client.builder.AwsSyncClientBuilder.build(AwsSyncClientBuilder.java:46)
at com.amazonaws.auth.STSAssumeRoleWithWebIdentitySessionCredentialsProvider.buildStsClient(STSAssumeRoleWithWebIdentitySessionCredentialsProvider.java:125)
at com.amazonaws.auth.STSAssumeRoleWithWebIdentitySessionCredentialsProvider.<init>(STSAssumeRoleWithWebIdentitySessionCredentialsProvider.java:97)
at com.amazonaws.auth.STSAssumeRoleWithWebIdentitySessionCredentialsProvider.<init>(STSAssumeRoleWithWebIdentitySessionCredentialsProvider.java:40)
at com.amazonaws.auth.STSAssumeRoleWithWebIdentitySessionCredentialsProvider$Builder.build(STSAssumeRoleWithWebIdentitySessionCredentialsProvider.java:226)
at com.amazonaws.services.securitytoken.internal.STSProfileCredentialsService.getAssumeRoleCredentialsProvider(STSProfileCredentialsService.java:40)
at com.amazonaws.auth.profile.internal.securitytoken.STSProfileCredentialsServiceProvider.getProfileCredentialsProvider(STSProfileCredentialsServiceProvider.java:39)
at com.amazonaws.auth.profile.internal.securitytoken.STSProfileCredentialsServiceProvider.getCredentials(STSProfileCredentialsServiceProvider.java:71)
at com.amazonaws.auth.WebIdentityTokenCredentialsProvider.getCredentials(WebIdentityTokenCredentialsProvider.java:76)
at org.apache.hadoop.fs.s3a.AWSCredentialProviderList.getCredentials(AWSCredentialProviderList.java:137)
... 47 more
Identified the issue was the aws region associated with spark pods.Included was region variable on the base spark docker image and the issue got resolved

Why I am getting following logs while running my spark code via spark-submit

I am running the script via this command
spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.0 direct_kafka_wordcount.py localhost 9092
I am unable to connect my Kafka topic and retrieve information. I have tried everything but no luck. I am running this simple code of wordcount of my live Kafka stream.
Ivy Default Cache set to: /home/sagar/.ivy2/cache
The jars for the packages stored in: /home/sagar/.ivy2/jars
:: loading settings :: url = jar:file:/usr/local/spark-2.4.3-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.spark#spark-streaming-kafka-0-10_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-be411cc2-fb3f-4049-b222-e3eca55e020b;1.0
confs: [default]
found org.apache.spark#spark-streaming-kafka-0-10_2.11;2.2.0 in central
found org.apache.kafka#kafka_2.11;0.10.0.1 in central
found com.101tec#zkclient;0.8 in central
found org.slf4j#slf4j-api;1.7.16 in central
found org.slf4j#slf4j-log4j12;1.7.16 in central
found log4j#log4j;1.2.17 in central
found com.yammer.metrics#metrics-core;2.2.0 in central
found org.scala-lang.modules#scala-parser-combinators_2.11;1.0.4 in central
found org.apache.kafka#kafka-clients;0.10.0.1 in central
found net.jpountz.lz4#lz4;1.3.0 in central
found org.xerial.snappy#snappy-java;1.1.2.6 in central
found org.spark-project.spark#unused;1.0.0 in central
:: resolution report :: resolve 1491ms :: artifacts dl 9ms
:: modules in use:
com.101tec#zkclient;0.8 from central in [default]
com.yammer.metrics#metrics-core;2.2.0 from central in [default]
log4j#log4j;1.2.17 from central in [default]
net.jpountz.lz4#lz4;1.3.0 from central in [default]
org.apache.kafka#kafka-clients;0.10.0.1 from central in [default]
org.apache.kafka#kafka_2.11;0.10.0.1 from central in [default]
org.apache.spark#spark-streaming-kafka-0-10_2.11;2.2.0 from central in [default]
org.scala-lang.modules#scala-parser-combinators_2.11;1.0.4 from central in [default]
org.slf4j#slf4j-api;1.7.16 from central in [default]
org.slf4j#slf4j-log4j12;1.7.16 from central in [default]
org.spark-project.spark#unused;1.0.0 from central in [default]
org.xerial.snappy#snappy-java;1.1.2.6 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 12 | 1 | 1 | 0 || 12 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-be411cc2-fb3f-4049-b222-e3eca55e020b
confs: [default]
0 artifacts copied, 12 already retrieved (0kB/8ms)
19/07/09 14:28:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes
where applicable
Traceback (most recent call last):
File "/usr/local/spark-2.4.3-bin-hadoop2.7/examples/src/main/python/streaming/direct_kafka_wordcount.py",
line 48, in
kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py",
line 146, in createDirectStream
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
line 1257, in call
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o26.createDirectStreamWithoutMessageHandler.
: org.apache.spark.SparkException: Broker not in the correct format of : [localhost]
at org.apache.spark.streaming.kafka.KafkaCluster$SimpleConsumerConfig$$anonfun$7.apply(KafkaCluster.scala:390)
at org.apache.spark.streaming.kafka.KafkaCluster$SimpleConsumerConfig$$anonfun$7.apply(KafkaCluster.scala:387)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.streaming.kafka.KafkaCluster$SimpleConsumerConfig.(KafkaCluster.scala:387)
at org.apache.spark.streaming.kafka.KafkaCluster$SimpleConsumerConfig$.apply(KafkaCluster.scala:422)
at org.apache.spark.streaming.kafka.KafkaCluster.config(KafkaCluster.scala:53)
at org.apache.spark.streaming.kafka.KafkaCluster.getPartitionMetadata(KafkaCluster.scala:130)
at org.apache.spark.streaming.kafka.KafkaCluster.getPartitions(KafkaCluster.scala:119)
at org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:211)
at org.apache.spark.streaming.kafka.KafkaUtilsPythonHelper.createDirectStream(KafkaUtils.scala:720)
at org.apache.spark.streaming.kafka.KafkaUtilsPythonHelper.createDirectStreamWithoutMessageHandler(KafkaUtils.scala:688)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Bad syntax, try this (check the kafka broker host part):
spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.0
direct_kafka_wordcount.py localhost:9092
In general terms, connecting to kafka's bootstrap servers always requires a host:port syntax.

how to write spark dataframe into avro file format in jupyter notebook?

I have configured Amazon EMR cluster with 1 master node and 2 cores. Following are the software installation on EMR:
Hive 2.3.4, Pig 0.17.0, Hue 4.3.0, Ganglia 3.7.2, Spark 2.4.0, TensorFlow 1.12.0.
I have not configured any bootstrap action. Now, that the clusters are up and waiting for step. I have started notebook from EMR and below are the details of the code.
sdf = spark.read.csv('hdfs://i....:8020/user/root/temp.csv')
This executes perfectly, and I am able to see my dataframe through sdf.show()
However, when I try to write into the avro file, it fails
sdf.write.format("avro").save("avro_file.avro")
ERR:
u'Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;'
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 736, in save
self._jwrite.save(path)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
AnalysisException: u'Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;'
I tried:
sdf.write.format("org.apache.spark.sql.avro").save("avro_file.avro")
gave same error
u'Failed to find data source: org.apache.spark.sql.avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;'
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 736, in save
self._jwrite.save(path)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
AnalysisException: u'Failed to find data source: org.apache.spark.sql.avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;'
I also tried through spark interactive session:
[ec2-user#ip-xxxx conf]$ sudo pyspark --packages org.apache.spark:spark-avro_2.12:2.4.2
Python 2.7.16 (default, Mar 18 2019, 18:38:44)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.spark#spark-avro_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-e8c82e1e-629a-4d83-844d-a86057fc5ae7;1.0
confs: [default]
found org.apache.spark#spark-avro_2.12;2.4.2 in central
found org.spark-project.spark#unused;1.0.0 in central
:: resolution report :: resolve 209ms :: artifacts dl 6ms
:: modules in use:
org.apache.spark#spark-avro_2.12;2.4.2 from central in [default]
org.spark-project.spark#unused;1.0.0 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 2 | 0 | 0 | 0 || 2 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-e8c82e1e-629a-4d83-844d-a86057fc5ae7
confs: [default]
0 artifacts copied, 2 already retrieved (0kB/6ms)
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/05/02 07:23:00 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
19/05/02 07:23:03 WARN Client: Same path resource file:///root/.ivy2/jars/org.apache.spark_spark-avro_2.12-2.4.2.jar added multiple times to distributed cache.
19/05/02 07:23:03 WARN Client: Same path resource file:///root/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar added multiple times to distributed cache.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.0
/_/
Using Python version 2.7.16 (default, Mar 18 2019 18:38:44)
SparkSession available as 'spark'.
>>> df = spark.createDataFrame(
... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
... ("id", "v"))
>>> df.write.format("avro").save("avro_file.avro")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 736, in save
self._jwrite.save(path)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o83.save.
: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated
at java.util.ServiceLoader.fail(ServiceLoader.java:232)
at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
at scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
at scala.collection.AbstractTraversable.filter(Traversable.scala:104)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:630)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:244)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;)V
at org.apache.spark.sql.avro.AvroFileFormat.<init>(AvroFileFormat.scala:44)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at java.lang.Class.newInstance(Class.java:442)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380)
... 24 more
>>>
I have also tried updating the /etc/spark/conf/spark-defaults.conf to have
spark.jars.packages org.apache.spark:spark-avro_2.12:2.4.2, com.databricks:spark-csv_2.11:1.5.0
However, post this configuration jupyter notebook could not start spark and gave below error:
The code failed because of a fatal error:
Session 4 did not start up in 60 seconds..
Some things to try:
a) Make sure Spark has enough available resources for Jupyter to create a Spark context.
b) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.
c) Restart the kernel.
On spark 2.4.3 :
Going back a spark_arvo vesion to org.apache.spark:spark-avro_2.11:2.4.3, fixed this issue for me.
Also, In your jupyter-notebook before initiating the spark-context add the following line:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark avro_2.11:2.4.3 pyspark-shell'

sparkR on YARN cluster

I can see at url http://ec2-54-186-47-36.us-west-2.compute.amazonaws.com:8080/ that I have two worker nodes and one master node, it is showing spark cluster. by running command jps on my 2 worker node and 1 master I can see that all services are up.
following script I am using to initialise SPARKR session.
if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
Sys.setenv(SPARK_HOME = "/home/ubuntu/spark")
}
but whenever I tried to use Rstudio to initialize session then it fail and shows following ERROR, please advice me , I can not use real benefit of cluster.
sparkR.session(master = "yarn", deployMode="cluster", sparkConfig =
list(spark.driver.memory = "2g"),sparkPackages = "com.databricks:spark-
csv_2.11:1.1.0")
Launching java with spark-submit command /home/ubuntu/spark/bin/spark-
submit --packages com.databricks:spark-csv_2.11:1.1.0 --driver-memory
"2g" "--packages" "com.databricks:spark-csv_2.11:1.1.0" "sparkr-shell"
/tmp/RtmpkSWHWX/backend_port29310cbc7c6
Ivy Default Cache set to: /home/rstudio/.ivy2/cache
The jars for the packages stored in: /home/rstudio/.ivy2/jars
:: loading settings :: url = jar:file:/home/ubuntu/spark/jars/ivy-
2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found com.databricks#spark-csv_2.11;1.1.0 in central
found org.apache.commons#commons-csv;1.1 in central
found com.univocity#univocity-parsers;1.5.1 in central
:: resolution report :: resolve 441ms :: artifacts dl 24ms
:: modules in use:
com.databricks#spark-csv_2.11;1.1.0 from central in [default]
com.univocity#univocity-parsers;1.5.1 from central in [default]
org.apache.commons#commons-csv;1.1 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 3 | 0 | 0 | 0 || 3 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 3 already retrieved (0kB/18ms)
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
17/09/24 23:15:34 WARN NativeCodeLoader: Unable to load native-hadoop library
for your platform... using builtin-java classes where applicable
17/09/24 23:15:42 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It
might have been killed or unable to launch application master.
at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend
.waitForApplication (YarnClientSchedulerBackend.scala:85)
at
org.apache.spark.scheduler.cluster.
YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
at org.apache.spark.scheduler.TaskSchedulerImpl.
start(TaskSchedulerImpl.scala:149)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:500)
at org.apache.spark.api.java.JavaSparkContext.<init>
(JavaSparkContext.scala:58)
at org.apache.spark.api.r.RRDD$.createSparkContext(RRDD.scala:129)
at org.apache.spark.api.r.RRDD.createSparkContext(RRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141)
at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:86)
at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
at java.lang.Thread.run(Thread.java:748)
17/09/24 23:15:42 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
17/09/24 23:15:42 WARN MetricsSystem: Stopping a MetricsSystem that is not running
17/09/24 23:15:42 ERROR RBackendHandler: createSparkContext on org.apache.spark.api.r.RRDD failed
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:85)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:500)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at org.apache.spark.api.r.RRDD$.createSparkContext(RRDD.scala:129)
at org.apache.spark.api.r.RRDD.createSparkContext(RRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Metho
Interactive Spark shells & sessions, such as from RStudio (for R) or from Jupyter notebooks, cannot be run in cluster mode - you should change to deployMode=client.
Here is what happens when trying to run a SparkR shell with --deploy-mode cluster (the situation is practically the same with RStudio):
$ ./sparkR --master yarn --deploy-mode cluster
R version 3.3.2 (2016-10-31) -- "Sincere Pumpkin Patch"
[...]
Error: Cluster deploy mode is not applicable to Spark shells.
See also this answer for the PySpark case.
This does not mean that you do not utilize the distributed benefits of Spark (i.e. cluster computations) in such sessions; from the docs:
There are two deploy modes that can be used to launch Spark
applications on YARN. In cluster mode, the Spark driver runs inside an
application master process which is managed by YARN on the cluster,
and the client can go away after initiating the application. In client
mode, the driver runs in the client process, and the application
master is only used for requesting resources from YARN.

Resources