Spark-submit with Stocator failing with Class com.ibm.stocator.fs.ObjectStoreFileSystem not found error - apache-spark

I am trying to run spark-submit wordcount Python on a Kubernetes cluster by pulling a text file stored in COS.
For the config, I followed the Stocator README.md
./bin/spark-submit \
--master k8s://https://c111.us-south.containers.cloud.ibm.com:32206 \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi --packages com.ibm.stocator:stocator:1.1.3 \
--conf spark.executor.instances=5 --conf spark.hadoop.fs.cos.myobjectstorage.access.key= --conf spark.hadoop.fs.cos.myobjectstorage.secret.key= --conf spark.hadoop.fs.stocator.scheme.list=cos --conf spark.hadoop.fs.cos.impl=com.ibm.stocator.fs.ObjectStoreFileSystem --conf spark.hadoop.fs.stocator.cos.impl=com.ibm.stocator.fs.cos.COSAPIClient --conf spark.hadoop.fs.stocator.cos.scheme=cos --conf spark.jars.ivy=/tmp/.ivy\
--conf spark.kubernetes.container.image=us.icr.io/mods15/spark-py:v1 --conf spark.hadoop.fs.cos.myobjectstorage.endpoint=http://s3.us.cloud-object-storage.appdomain.cloud --conf spark.hadoop.fs.cos.myobjectstorage.v2.signer.type=false --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark local:///opt/spark/examples/src/main/python/wordcount.py cos://vmac-code-engine-bucket.myobjectstorage/book.txt
I can see the driver and executors pods spinning up and after a couple of minutes the driver errors-out with the log below.
Driver stacktrace:
21/01/12 11:52:55 INFO DAGScheduler: Job 0 failed: collect at /opt/spark/examples/src/main/python/wordcount.py:40, took 7.839348 s
Traceback (most recent call last):
File "/opt/spark/examples/src/main/python/wordcount.py", line 40, in <module>
output = counts.collect()
File "/opt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 889, in collect
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 128, in deco
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 172.30.43.123, executor 4): java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.ibm.stocator.fs.ObjectStoreFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2197)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:84)
at org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.<init>(HadoopFileLinesReader.scala:65)
at org.apache.spark.sql.execution.datasources.text.TextFileFormat.$anonfun$readToUnsafeMem$1(TextFileFormat.scala:119)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:147)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:132)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:295)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:607)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:383)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1932)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:218)
Caused by: java.lang.ClassNotFoundException: Class com.ibm.stocator.fs.ObjectStoreFileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
... 26 more
Any idea on how I can make this work? I want to pass the text file stored in COS to the wordcount Python example that comes with Spark download (examples folder)
I am using Spark-3.0.1-hadoop2.7 and for the container images, I followed the documentation here

The part that is failing here is
local:///opt/spark/examples/src/main/python/wordcount.py cos://vmac-code-engine-bucket.myobjectstorage/book.txt
For some reason, the wordcount.py is not able to pick the book.txt file in the COS.
Moving the cos file call inside the python file as mentioned in the link here solved the issue
from pyspark import SparkContext
sc = SparkContext("local", "count app")
sonnets = sc.textFile("cos://COS_BUCKET_NAME.COS_SERVICE_NAME/files")
counts = sonnets.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda v1,v2: v1 + v2)
counts.saveAsTextFile("cos://COS_BUCKET_NAME.COS_SERVICE_NAME/files/wordcount-result")

Related

spark.yarn.jars - py4j.protocol.Py4JError: An error occurred while calling None.None. Trace:

I am trying to run a spark job using a spark2-submit on command. The version of the spark installed on the cluster is cloudera's spark2.1.0 and I am specifying my jars for version 2.4.0 using conf spark.yarn.jars as shown below -
spark2-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=/virtualenv/path/bin/python \
--conf spark.yarn.jars=hdfs:///some/path/spark24/*\
--conf spark.yarn.maxAppAttempts=1\
--conf spark.task.cpus=2\
--executor-cores 2\
--executor-memory 4g\
--driver-memory 4g\
--archives /virtualenv/path \
--files /etc/hive/conf/hive-site.xml \
--name my_app\
test.py
This is the code I have in test.py -
import os
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
print("Spark Session created")
On running the submit command, I see messages like below -
yarn.Client: Source and destination file systems are the same. Not copying hdfs:///some/path/spark24/some.jar
And then I get this error on the line where spark session is being created -
spark = SparkSession.builder.getOrCreate()
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/sql/session.py", line 169, in getOrCreate
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/context.py", line 310, in getOrCreate
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/context.py", line 115, in __init__
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/context.py", line 259, in _ensure_initialized
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/java_gateway.py", line 117, in launch_gateway
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 175, in java_import
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 323, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling None.None. Trace:
Authentication error: unexpected command.
the py4j in the error is coming from the existing spark and not the versions in my jar. Were my spark24 jars not picked up? The same code runs ok if I remove the conf for jars but probably from the existing spark version 2.1.0. Any clues on how to fix this?
Thanks.
The problem turned out to be that python was running from the wrong place. I had to submit from correct place this way -
PYTHONPATH=./${virtualenv}/venv/lib/python3.6/site-packages/ spark2-submit

Spark job fails due to stackoverflow error

My spark job is using Mllib to train a LogisticRegression on a data but it fails due to the Stackoverflow error, here is the error message shown in the spark-shell
java.lang.StackOverflowError
at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:176)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
at scala.collection.generic.GenericCompanion.apply(GenericCompanion.scala:48)
...
when I check the Spark UI, there is no failed stage or job! This is how I run my spark-shell
spark-shell --num-executors 100 --driver-memory 20g --conf spark.driver.maxResultSize=5g --executor-memory 8g --executor-cores 3
I even tried to increase the size of the stack by adding the following line when running the spark-shell, but it didn't help
--conf "spark.driver.extraJavaOptions='-XX:ThreadStackSize=81920'"
What the issue is?

Spark Hive reporting ClassNotFoundException: com.ibm.biginsights.bigsql.sync.BIEventListener

I'm attempting to run a pyspark script on BigInsights on Cloud 4.2 Enterprise that accesses a Hive table.
First I create the hive table:
[biadmin#bi4c-xxxxx-mastermanager ~]$ hive
hive> CREATE TABLE pokes (foo INT, bar STRING);
OK
Time taken: 2.147 seconds
hive> LOAD DATA LOCAL INPATH '/usr/iop/4.2.0.0/hive/doc/examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
Loading data to table default.pokes
Table default.pokes stats: [numFiles=1, numRows=0, totalSize=5812, rawDataSize=0]
OK
Time taken: 0.49 seconds
hive>
Then I create a simple pyspark script:
[biadmin#bi4c-xxxxxx-mastermanager ~]$ cat test_pokes.py
from pyspark import SparkContext
sc = SparkContext()
from pyspark.sql import HiveContext
hc = HiveContext(sc)
pokesRdd = hc.sql('select * from pokes')
print( pokesRdd.collect() )
I attempt to execute with:
[biadmin#bi4c-xxxxxx-mastermanager ~]$ spark-submit \
--master yarn-cluster \
--deploy-mode cluster \
--jars /usr/iop/4.2.0.0/hive/lib/datanucleus-api-jdo-3.2.6.jar, \
/usr/iop/4.2.0.0/hive/lib/datanucleus-core-3.2.10.jar, \
/usr/iop/4.2.0.0/hive/lib/datanucleus-rdbms-3.2.9.jar \
--files /usr/iop/4.2.0.0/hive/conf/hive-site.xml \
test_pokes.py
However, I encounter the error:
You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly
Traceback (most recent call last):
File "test_pokes.py", line 8, in <module>
pokesRdd = hc.sql('select * from pokes')
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0485/container_e09_1477084339086_0485_02_000001/pyspark.zip/pyspark/sql/context.py", line 580, in sql
...
File /container_e09_1477084339086_0485_02_000001/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at
...
...
Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at
...
... 27 more
Caused by: MetaException(message:Failed to instantiate listener named: com.ibm.biginsights.bigsql.sync.BIEventListener, reason: java.lang.ClassNotFoundException: com.ibm.biginsights.bigsql.sync.BIEventListener)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.getMetaStoreListeners(MetaStoreUtils.java:1478)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:481)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:66)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)
at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:199)
at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:74)
... 32 more
See also previous errors related to this issue:
hive spark yarn-cluster job fails with: "ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory"
Spark Hive reporting pyspark.sql.utils.AnalysisException: u'Table not found: XXX' when run on yarn cluster
The solution was to use hive-site.xml from the spark-client folder:
[biadmin#bi4c-xxxxxx-mastermanager ~]$ spark-submit \
--master yarn-cluster \
--deploy-mode cluster \
--jars /usr/iop/4.2.0.0/hive/lib/datanucleus-api-jdo-3.2.6.jar, \
/usr/iop/4.2.0.0/hive/lib/datanucleus-core-3.2.10.jar, \
/usr/iop/4.2.0.0/hive/lib/datanucleus-rdbms-3.2.9.jar \
--files /usr/iop/current/spark-client/conf/hive-site.xml \
test_pokes.py
This is captured in the docs: http://www.ibm.com/support/knowledgecenter/SSPT3X_4.2.0/com.ibm.swg.im.infosphere.biginsights.product.doc/doc/bi_spark.html

Java Out-of-Memory Error in pyspark

My problem is fairly simple: the JVM is running out of memory when I run RandomForest.trainRegressor in pyspark. I'm using about 3GB of training data, 77 features, and with numTrees set to 15. If I set numTrees to 15 or more, the training fails with an out-of-memory error (like below):
error info:
model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo={},numTrees=15, featureSubsetStrategy="sqrt",impurity='variance', maxDepth=20, maxBins=14) #variance
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/pyspark.zip/pyspark/mllib/tree.py", line 412, in trainRegressor
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/pyspark.zip/pyspark/mllib/tree.py", line 270, in _train
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/pyspark.zip/pyspark/mllib/common.py", line 130, in callMLlibFunc
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/pyspark.zip/pyspark/mllib/common.py", line 123, in callJavaFunc
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o81.trainRandomForestModel.
: java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2032)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:703)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:702)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:702)
at org.apache.spark.mllib.tree.DecisionTree$.findBestSplits(DecisionTree.scala:625)
at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:235)
at org.apache.spark.mllib.tree.RandomForest$.trainRegressor(RandomForest.scala:380)
at org.apache.spark.mllib.api.python.PythonMLLibAPI.trainRandomForestModel(PythonMLLibAPI.scala:744)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
I'm using spark 1.5 and my parameters are:
spark-submit --master yarn-client --conf spark.cassandra.connection.host=x.x.x.x \
--jars /home/retail/packages/spark_cassandra_tool-1.0.jar,/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/hive/lib/HiveAuthHook.jar,/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/hbase/hbase-common.jar,/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/hbase/hbase-client.jar,/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/hbase/hbase-server.jar,/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/lib/spark-examples.jar \
--num-executors 30 --executor-cores 4 \
--executor-memory 20000M --driver-memory 200g \
--conf spark.yarn.executor.memoryOverhead=5000 \
--conf spark.kryoserializer.buffer.max=2000m \
--conf spark.kryoserializer.buffer=40m \
--conf spark.driver.extraJavaOptions=\"-Xms2048m -Xmx2048m -XX:+DisableExplicitGC -Dcom.sun.management.jmxremote -XX:PermSize=512m -XX:MaxPermSize=2048m -XX:MaxDirectMemorySize=5g\" \
--conf spark.driver.maxResultSize=10g \
--conf spark.port.maxRetries=100
In my opinion, the trees should be trained in series, so why does the number of them create a higher memory load?
If I want to train 300 trees successfully, how should I set the spark parameters to use the RandomForest.trainRegressor function properly?

Cannot run PySpark (not using interactive shell) on Cloudera VM

When I follow this example and try to use the command spark-submit within the cloudera vm envirionrment, I constantly get the following error:
ERROR spark.SparkContext: Error initializing SparkContext.
org.apache.hadoop.security.AccessControlException: Permission denied: user=cloudera, access=WRITE, inode="/user/spark/applicationHistory":spark:supergroup:drwxr-xr-x
....
Traceback (most recent call last):
File "/home/cloudera/wordcount.py", line 9, in <module>
sc = SparkContext(conf=conf)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/context.py", line 115, in __init__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/context.py", line 172, in _do_init
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/context.py", line 235, in _initialize_context
File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 1064, in __call__
File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: org.apache.hadoop.security.AccessControlException: Permission denied: user=cloudera, access=WRITE, inode="/user/spark/applicationHistory":spark:supergroup:drwxr-xr-x
I have tried these two commands:
1, $ spark-submit --master yarn --deploy-mode client --executor-memory 1g \ --name wordcount --conf "spark.app.id=wordcount" wordcount.py hdfs://namenode_host:8020/path/to/inputfile.txt
2, $ spark-submit --master yarn --deploy-mode client --executor-memory 1g \ --name wordcount --conf "spark.app.id=wordcount" wordcount.py inputfile.txt
Can somebody help?
Try running with the following environment variable:
HADOOP_USER_NAME=hdfs spark-submit <your command>

Resources