Cannot run PySpark (not using interactive shell) on Cloudera VM

Cannot run PySpark (not using interactive shell) on Cloudera VM - apache-spark

When I follow this example and try to use the command spark-submit within the cloudera vm envirionrment, I constantly get the following error:
ERROR spark.SparkContext: Error initializing SparkContext.
org.apache.hadoop.security.AccessControlException: Permission denied: user=cloudera, access=WRITE, inode="/user/spark/applicationHistory":spark:supergroup:drwxr-xr-x
....
Traceback (most recent call last):
File "/home/cloudera/wordcount.py", line 9, in <module>
sc = SparkContext(conf=conf)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/context.py", line 115, in __init__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/context.py", line 172, in _do_init
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/context.py", line 235, in _initialize_context
File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 1064, in __call__
File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: org.apache.hadoop.security.AccessControlException: Permission denied: user=cloudera, access=WRITE, inode="/user/spark/applicationHistory":spark:supergroup:drwxr-xr-x
I have tried these two commands:
1, $ spark-submit --master yarn --deploy-mode client --executor-memory 1g \ --name wordcount --conf "spark.app.id=wordcount" wordcount.py hdfs://namenode_host:8020/path/to/inputfile.txt
2, $ spark-submit --master yarn --deploy-mode client --executor-memory 1g \ --name wordcount --conf "spark.app.id=wordcount" wordcount.py inputfile.txt
Can somebody help?

Try running with the following environment variable:
HADOOP_USER_NAME=hdfs spark-submit <your command>

Related

Spark-submit with Stocator failing with Class com.ibm.stocator.fs.ObjectStoreFileSystem not found error

I am trying to run spark-submit wordcount Python on a Kubernetes cluster by pulling a text file stored in COS.
For the config, I followed the Stocator README.md
./bin/spark-submit \
--master k8s://https://c111.us-south.containers.cloud.ibm.com:32206 \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi --packages com.ibm.stocator:stocator:1.1.3 \
--conf spark.executor.instances=5 --conf spark.hadoop.fs.cos.myobjectstorage.access.key= --conf spark.hadoop.fs.cos.myobjectstorage.secret.key= --conf spark.hadoop.fs.stocator.scheme.list=cos --conf spark.hadoop.fs.cos.impl=com.ibm.stocator.fs.ObjectStoreFileSystem --conf spark.hadoop.fs.stocator.cos.impl=com.ibm.stocator.fs.cos.COSAPIClient --conf spark.hadoop.fs.stocator.cos.scheme=cos --conf spark.jars.ivy=/tmp/.ivy\
--conf spark.kubernetes.container.image=us.icr.io/mods15/spark-py:v1 --conf spark.hadoop.fs.cos.myobjectstorage.endpoint=http://s3.us.cloud-object-storage.appdomain.cloud --conf spark.hadoop.fs.cos.myobjectstorage.v2.signer.type=false --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark local:///opt/spark/examples/src/main/python/wordcount.py cos://vmac-code-engine-bucket.myobjectstorage/book.txt
I can see the driver and executors pods spinning up and after a couple of minutes the driver errors-out with the log below.
Driver stacktrace:
21/01/12 11:52:55 INFO DAGScheduler: Job 0 failed: collect at /opt/spark/examples/src/main/python/wordcount.py:40, took 7.839348 s
Traceback (most recent call last):
File "/opt/spark/examples/src/main/python/wordcount.py", line 40, in <module>
output = counts.collect()
File "/opt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 889, in collect
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 128, in deco
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 172.30.43.123, executor 4): java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.ibm.stocator.fs.ObjectStoreFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2197)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:84)
at org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.<init>(HadoopFileLinesReader.scala:65)
at org.apache.spark.sql.execution.datasources.text.TextFileFormat.$anonfun$readToUnsafeMem$1(TextFileFormat.scala:119)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:147)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:132)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:295)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:607)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:383)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1932)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:218)
Caused by: java.lang.ClassNotFoundException: Class com.ibm.stocator.fs.ObjectStoreFileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
... 26 more
Any idea on how I can make this work? I want to pass the text file stored in COS to the wordcount Python example that comes with Spark download (examples folder)
I am using Spark-3.0.1-hadoop2.7 and for the container images, I followed the documentation here

The part that is failing here is
local:///opt/spark/examples/src/main/python/wordcount.py cos://vmac-code-engine-bucket.myobjectstorage/book.txt
For some reason, the wordcount.py is not able to pick the book.txt file in the COS.
Moving the cos file call inside the python file as mentioned in the link here solved the issue
from pyspark import SparkContext
sc = SparkContext("local", "count app")
sonnets = sc.textFile("cos://COS_BUCKET_NAME.COS_SERVICE_NAME/files")
counts = sonnets.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda v1,v2: v1 + v2)
counts.saveAsTextFile("cos://COS_BUCKET_NAME.COS_SERVICE_NAME/files/wordcount-result")

spark.yarn.jars - py4j.protocol.Py4JError: An error occurred while calling None.None. Trace:

I am trying to run a spark job using a spark2-submit on command. The version of the spark installed on the cluster is cloudera's spark2.1.0 and I am specifying my jars for version 2.4.0 using conf spark.yarn.jars as shown below -
spark2-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=/virtualenv/path/bin/python \
--conf spark.yarn.jars=hdfs:///some/path/spark24/*\
--conf spark.yarn.maxAppAttempts=1\
--conf spark.task.cpus=2\
--executor-cores 2\
--executor-memory 4g\
--driver-memory 4g\
--archives /virtualenv/path \
--files /etc/hive/conf/hive-site.xml \
--name my_app\
test.py
This is the code I have in test.py -
import os
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
print("Spark Session created")
On running the submit command, I see messages like below -
yarn.Client: Source and destination file systems are the same. Not copying hdfs:///some/path/spark24/some.jar
And then I get this error on the line where spark session is being created -
spark = SparkSession.builder.getOrCreate()
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/sql/session.py", line 169, in getOrCreate
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/context.py", line 310, in getOrCreate
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/context.py", line 115, in __init__
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/context.py", line 259, in _ensure_initialized
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/java_gateway.py", line 117, in launch_gateway
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 175, in java_import
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 323, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling None.None. Trace:
Authentication error: unexpected command.
the py4j in the error is coming from the existing spark and not the versions in my jar. Were my spark24 jars not picked up? The same code runs ok if I remove the conf for jars but probably from the existing spark version 2.1.0. Any clues on how to fix this?
Thanks.

The problem turned out to be that python was running from the wrong place. I had to submit from correct place this way -
PYTHONPATH=./${virtualenv}/venv/lib/python3.6/site-packages/ spark2-submit

cant restart spark2 thriftserver,spark-shell,sparksql after change capacityscheduler from default to DominantResourceCalculator on ambari

I change spark on yarn capacityscheduler from default to DominantResourceCalculator on ambari,and restarted yarn。
then i found that spark2 thriftserver stoped,i try to restart on ambari and using start-thriftserver.sh,both failed.
Traceback (most recent call last):
File "/usr/lib/ambari-agent/lib/resource_management/libraries/functions/check_process_status.py", line 57, in check_process_status sudo.kill(pid, 0)
File "/usr/lib/ambari-agent/lib/resource_management/core/sudo.py", line 180, in kill
os.kill(pid, signal)
OSError: [Errno 3] No such process
The above exception was the cause of the following exception:
Traceback (most recent call last):
File "/var/lib/ambari-agent/cache/stacks/HDP/3.0/services/SPARK2/package/scripts/spark_thrift_server.py", line 85, in <module>
SparkThriftServer().execute()
File "/usr/lib/ambari-agent/lib/resource_management/libraries/script/script.py", line 352, in execute
method(env)
File "/var/lib/ambari-agent/cache/stacks/HDP/3.0/services/SPARK2/package/scripts/spark_thrift_server.py", line 53, in start
spark_service('sparkthriftserver', upgrade_type=upgrade_type, action='start')
File "/var/lib/ambari-agent/cache/stacks/HDP/3.0/services/SPARK2/package/scripts/spark_service.py", line 165, in spark_service
check_process_status(status_params.spark_thrift_server_pid_file)
File "/usr/lib/ambari-agent/lib/resource_management/libraries/functions/check_process_status.py", line 61, in check_process_status
raise ComponentIsNotRunning()
resource_management.core.exceptions.ComponentIsNotRunning
When i use spark-submit or spark-shell,spark-sql to submit a job,it also failed :
spark-sql --master yarn --driver-memory 2g --executor-cores 2 --num-executors 5 --executor-memory 4g
The error msg like :
Exception in thread "main" org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:89)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:500)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2493)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:934)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:925)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:925)
at com.im30.idmapping.idmapping_etl.userlog2hive$.main(userlog2hive.scala:21)
at com.im30.idmapping.idmapping_etl.userlog2hive.main(userlog2hive.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:904)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

SparkFiles.get() is not able to fetch file uploaded using --files option of spark-submit

Spark-submit --files option says that the files can be accessed using SparkFiles.get('files.txt')
So I wrote a simple program
from pyspark.sql import SparkSession
from pyspark import SparkFiles
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
print("testfile path : "+SparkFiles.get('testfile.txt'))
df=spark.read.text(SparkFiles.get('testfile.txt'))
df.show()
And then run using the below command :
spark-submit --master yarn --deploy-mode client --files testfile.txt testsubmit.py
From the logs I can see thaat the "testfile.txt" has been copied to hdfs://dummyIP.com:8020/user/root/.sparkStaging/application_1581404080152_0079/testfile.txt
20/03/05 04:41:08 INFO Client: Source and destination file systems are the same. Not copying hdfs://dummyIP.com:8020/hdp/apps/2.6.5.0-292/spark2/spark2-hdp-yarn-archive.tar.gz
20/03/05 04:41:08 INFO Client: Uploading resource file:/root/sumit/test/testfile.txt -> hdfs://dummyIP.com:8020/user/root/.sparkStaging/application_1581404080152_0079/testfile.txt
20/03/05 04:41:09 INFO Client: Uploading resource file:/usr/hdp/current/spark2-client/python/lib/pyspark.zip -> hdfs://dummyIP.com:8020/user/root/.sparkStaging/application_1581404080152_0079/pyspark.zip
But SparkFiles.get('testfile.txt') is trying to fetch 'testfile.txt' from hdfs://dummyIP.com:8020/tmp/spark-f7fedc0b-c3c7-4f6e-b72c-fc0618a03deb/userFiles-c90e2d49-c153-4945-bbe2-b006221002f9/testfile.txt
testfile path : /tmp/spark-f7fedc0b-c3c7-4f6e-b72c-fc0618a03deb/userFiles-c90e2d49-c153-4945-bbe2-b006221002f9/testfile.txt
Traceback (most recent call last):
File "/root/sumit/test/testsubmit.py", line 7, in <module>
df=spark.read.text(SparkFiles.get('testfile.txt'))
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 328, in text
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: u'Path does not exist: hdfs://dummyIP.com:8020/tmp/spark-f7fedc0b-c3c7-4f6e-b72c-fc0618a03deb/userFiles-c90e2d49-c153-4945-bbe2-b006221002f9/testfile.txt;'
Also note that text() function will be executed on executor nodes as is mentioned the Spark Documentations. So it seems like SparkFiles.get('files.txt') is not reading from the same location where the --files is uploading it.

hive spark yarn-cluster job fails with: "ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory"

I'm attempting to run a pyspark script on BigInsights on Cloud 4.2 Enterprise that accesses a Hive table.
First I create the hive table:
[biadmin#bi4c-xxxxx-mastermanager ~]$ hive
hive> CREATE TABLE pokes (foo INT, bar STRING);
OK
Time taken: 2.147 seconds
hive> LOAD DATA LOCAL INPATH '/usr/iop/4.2.0.0/hive/doc/examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
Loading data to table default.pokes
Table default.pokes stats: [numFiles=1, numRows=0, totalSize=5812, rawDataSize=0]
OK
Time taken: 0.49 seconds
hive>
Then I create a simple pyspark script:
[biadmin#bi4c-xxxxxx-mastermanager ~]$ cat test_pokes.py
from pyspark import SparkContext
sc = SparkContext()
from pyspark.sql import HiveContext
hc = HiveContext(sc)
pokesRdd = hc.sql('select * from pokes')
print( pokesRdd.collect() )
I attempt to execute with:
spark-submit --master yarn-cluster test_pokes.py
However, I encounter the error:
You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly
Traceback (most recent call last):
File "test_pokes.py", line 8, in <module>
pokesRdd = hc.sql('select * from pokes')
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/pyspark.zip/pyspark/sql/context.py", line 580, in sql
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/pyspark.zip/pyspark/sql/context.py", line 683, in _ssql_ctx
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/pyspark.zip/pyspark/sql/context.py", line 692, in _get_hive_ctx
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 1064, in __call__
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/pyspark.zip/pyspark/sql/utils.py", line 45, in deco
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
...
...
Caused by: javax.jdo.JDOFatalUserException: Class org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
NestedThrowables:
java.lang.ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory
...
...
at javax.jdo.JDOHelper.forName(JDOHelper.java:2015)
at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1162)
I have seen a number of similar posts for other Hadoop distributions, but not for BigInsights on Cloud.

The solution to this error was to add the jars:
[biadmin#bi4c-xxxxxx-mastermanager ~]$ spark-submit \
--master yarn-cluster \
--deploy-mode cluster \
--jars /usr/iop/4.2.0.0/hive/lib/datanucleus-api-jdo-3.2.6.jar, \
/usr/iop/4.2.0.0/hive/lib/datanucleus-core-3.2.10.jar, \
/usr/iop/4.2.0.0/hive/lib/datanucleus-rdbms-3.2.9.jar \
test_pokes.py
However, I then get a different error:
pyspark.sql.utils.AnalysisException: u'Table not found: pokes; line 1 pos 14'
I've added the other question here: Spark Hive reporting pyspark.sql.utils.AnalysisException: u'Table not found: XXX' when run on yarn cluster
The final solution is captured here: https://stackoverflow.com/a/41272260/1033422

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Cannot run PySpark (not using interactive shell) on Cloudera VM - apache-spark

Try running with the following environment variable: HADOOP_USER_NAME=hdfs spark-submit <your command>

Related

Spark-submit with Stocator failing with Class com.ibm.stocator.fs.ObjectStoreFileSystem not found error

spark.yarn.jars - py4j.protocol.Py4JError: An error occurred while calling None.None. Trace:

cant restart spark2 thriftserver,spark-shell,sparksql after change capacityscheduler from default to DominantResourceCalculator on ambari

SparkFiles.get() is not able to fetch file uploaded using --files option of spark-submit

hive spark yarn-cluster job fails with: "ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory"

Categories

Resources