PredictionIO train failed in HDInsight Yarn Cluster - azure

I have tried to run pio train command in HDInsight Spark cluster using following command
pio train -- --deploy-mode cluster --master yarn
But there is following error have been provided
2018-11-05 11:40:05 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" java.io.IOException: No FileSystem for scheme: wasb
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:172)
at org.apache.spark.deploy.yarn.Client$$anonfun$5.apply(Client.scala:121)
at org.apache.spark.deploy.yarn.Client$$anonfun$5.apply(Client.scala:121)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.deploy.yarn.Client.<init>(Client.scala:121)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1520)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
2018-11-05 11:40:07 INFO ShutdownHookManager:54 - Shutdown hook called
I use following script to test connection and there is no issue, script successful connect and return available items from Azure Storage
hadoop fs -ls wasb://my_container_name#my_blob_account_name.blob.core.windows.net
Has anyone ideas with solution of the issue?

Had the same issues, where hadoop would support the wasb:// protocole, but not pio According to https://github.com/hning86/articles/blob/master/hadoopAndWasb.md
You have to use the hadoop-azure-2.7.1.jar and azure-storage-2.0.0.jar in your CLASSPATH
To solve this issue, you need to add the two jars to the CLASSPATH of pio itself.
With PredictionIO 0.13.1, according to /usr/local/pio/bin/compute-classpath.sh, this can be acheived by adding the jar to the subdirectory plugins
ls /usr/local/pio/plugins/azure-storage-2.0.0.jar
ls /usr/local/pio/plugins/hadoop-azure-2.7.1.jar

Related

How to specify --jars in spark-submit?

I want to add a couple of gcs connector jars to spark-submit. These will allow my spark application to read files from google storage.
I can do the following in python:
from pyspark import pandas as pd
spark = SparkSession.builder.config("spark.jars","/usr/local/share/google/dataproc/lib/gcs-connector.jar,/usr/l
ocal/share/google/dataproc/lib/gcs-connector-hadoop3-2.2.3.jar").getOrCreate()
pd.read_csv("gs://monsoon-credittech.appspot.com/mar19/training_data.csv",nrows=2)
But, I cannot achieve the same with spark-submit utility
spark-submit --master yarn --jars /usr/local/share/google/dataproc/lib/gcs-connector-hadoop3-2.2.3.jar, /usr/local/share/google/dataproc/lib/gcs-connector.jar testing_dep.py --num 2 --path gs://monsoon-credittech.appsp
ot.com/mar19/training_data.csv
output:
2022-01-08 16:34:43,605 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" org.apache.spark.SparkException: No main class set in JAR; please specify one with --class.
at org.apache.spark.deploy.SparkSubmit.error(SparkSubmit.scala:972)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:492)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:898)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Why is the spark-submit asking me to provide --class. My entry point is testing_dep.py. I do not have --class to be specified. How can I achieve adding those jars?

Exception in thread "main" java.lang.IllegalStateException: Cannot retrieve files with 'spark' scheme without an active SparkEnv

I'm very new to spark and cassandra, got one sample from github and tried to run the application from the below link
spark-on-cassandra-quickstart
After jar file generated, Tried executing with the below syntax
C:\Users\user\Desktop\softwares\spark-2.4.3-bin-hadoop2.7\spark-2.4.3-bin-hadoop2.7\bin>spark-submit --class com.github.boneill42.JavaDemo --master spark://localhost:7077
C:\Users\user\git\spark-on-cassandra-quickstart\target/spark-on-cassandra-0.0.1-SNAPSHOT-jar-with-dependencies.jar spark://localhost:7077 localhost
Below is the issue I'm facing
19/06/08 22:59:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" java.lang.IllegalStateException: Cannot retrieve files with 'spark' scheme without an active SparkEnv.
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:690)
at org.apache.spark.deploy.DependencyUtils$.downloadFile(DependencyUtils.scala:137)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:367)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:367)
at scala.Option.map(Option.scala:146)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:366)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Please help me in resolving the issue
In your case, It seems you want to start in standalone mode
spark://HOST:PORT Connect to the given Spark standalone cluster master.
The port must be whichever one your master is configured to use, which is 7077 by default.
Do you start spark master and worker first ?
launch master
./sbin/start-master.sh
launch worker
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://localhost:7077 -c 1 -m 512M
After start master and worker, then you can submit your job again.

Ignite Yarn and Hortonworks

I'm trying to deploy ignite so that I can use the shared RDD/Dataframe cache for my spark cluster. I've followed the spark install instructions and choose to deploy into my existing yarn cluster running spark. I'm using HDP to deploy spark.
I've already verified that Resource Manager and History server are listening on the ports below and I can telnet to each port. What am I doing wrong? Am I not deploying this the way it is intended?
I'm running:
yarn jar ignite-yarn-2.6.0.jar ./ignite-yarn-2.6.0.jar ../../../cluster.properties
Error below:
18/09/24 22:13:38 INFO client.RMProxy: Connecting to ResourceManager at dev01clus02.dna.local/172.31.31.5:8050
18/09/24 22:13:38 INFO client.AHSProxy: Connecting to Application History server at dev01clus02.dna.local/172.31.31.5:10200
Exception in thread "main" java.lang.RuntimeException: Failed update ignite.
at org.apache.ignite.yarn.IgniteProvider.updateIgnite(IgniteProvider.java:243)
at org.apache.ignite.yarn.IgniteProvider.getIgnite(IgniteProvider.java:93)
at org.apache.ignite.yarn.IgniteYarnClient.getIgnite(IgniteYarnClient.java:194)
at org.apache.ignite.yarn.IgniteYarnClient.main(IgniteYarnClient.java:84)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:233)
at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:209)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:704)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:675)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1569)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
at org.apache.ignite.yarn.IgniteProvider.updateIgnite(IgniteProvider.java:220)
... 9 more
It looks like a piece of insfrastructure, provided by GridGain for Apache Ignite project, does not work currently. I'll raise the issue.
In the meantime, you can provide IGNITE_PATH property (in config, system properties or env) pointed to unzipped Apache Ignite 2.6 distribution directory to avoid downloading attempts altogether.

AWS EMR using spark steps in cluster mode. Application application_ finished with failed status

I'm trying to launch a cluster using AWS Cli. I use the following command:
aws emr create-cluster --name "Config1" --release-label emr-5.0.0 --applications Name=Spark --use-default-role --log-uri 's3://aws-logs-813591802533-us-west-2/elasticmapreduce/' --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.medium InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.medium
The cluster is created successfully. Then I add this command:
aws emr add-steps --cluster-id ID_CLUSTER --region us-west-2 --steps Name=SparkSubmit,Jar="command-runner.jar",Args=[spark-submit,--deploy-mode,cluster,--master,yarn,--executor-memory,1G,--class,Traccia2014,s3://tracceale/params/scalaProgram.jar,s3://tracceale/params/configS3.txt,30,300,2,"s3a://tracceale/Tempi1"],ActionOnFailure=CONTINUE
After some time, the step failed. This is the LOG file:
17/02/22 11:00:07 INFO RMProxy: Connecting to ResourceManager at ip-172-31- 31-190.us-west-2.compute.internal/172.31.31.190:8032
17/02/22 11:00:08 INFO Client: Requesting a new application from cluster with 2 NodeManagers
17/02/22 11:00:08 INFO Client: Verifying our application has not requested
Exception in thread "main" org.apache.spark.SparkException: Application application_1487760984275_0001 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1132)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1175)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/02/22 11:01:02 INFO ShutdownHookManager: Shutdown hook called
17/02/22 11:01:02 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-27baeaa9-8b3a-4ae6-97d0-abc1d3762c86
Command exiting with ret '1'
Locally (on SandBox Hortonworks HDP 2.5) I run:
./spark-submit --class Traccia2014 --master local[*] --executor-memory 2G /usr/hdp/current/spark2-client/ScalaProjects/ScripRapportoBatch2.1/target/scala-2.11/traccia-22-ottobre_2.11-1.0.jar "/home/tracce/configHDFS.txt" 30 300 3
and everything works fine.
I've already read something related to my problem, but I can't figure it out.
UPDATE
Checked into Application Master, I get this error:
17/02/22 15:29:54 ERROR ApplicationMaster: User class threw exception: java.io.FileNotFoundException: s3:/tracceale/params/configS3.txt (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at scala.io.Source$.fromFile(Source.scala:91)
at scala.io.Source$.fromFile(Source.scala:76)
at scala.io.Source$.fromFile(Source.scala:54)
at Traccia2014$.main(Rapporto.scala:40)
at Traccia2014.main(Rapporto.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:627)
17/02/22 15:29:55 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.io.FileNotFoundException: s3:/tracceale/params/configS3.txt (No such file or directory))
I pass the path mentioned "s3://tracceale/params/configS3.txt" from S3 to the function 'fromFile' like this:
for(line <- scala.io.Source.fromFile(logFile).getLines())
How could I solve it? Thanks in advance.
Because you are using cluster deploy mode, the logs you have included are not useful at all. They just say that the application failed but not why it failed. To figure out why it failed, you at least need to look at the Application Master logs, since that is where the Spark driver runs in cluster deploy mode, and it will probably give a better hint as to why the application failed.
Since you have configured your cluster with a --log-uri, you will find the logs for the Application Master underneath s3://aws-logs-813591802533-us-west-2/elasticmapreduce/<CLUSTER ID>/containers/<YARN Application ID>/ where the YARN Application ID is (based on the logs you included above) application_1487760984275_0001, and the container ID should be something like container_1487760984275_0001_01_000001. (The first container for an application is the Application Master.)
What you have there is a URL to an object store, reachable from the Hadoop filesystem APIs, and a stack trace coming from java.io.File, which can't read it because it doesn't refer to anything in the local disk.
Use SparkContext.hadoopRDD() as the operation to convert the path into an RDD
There is a probability of file missing in the location, may be you can see it after ssh into EMR cluster but still the steps command wouldn't be able to figure out by itself and starts throwing that file not found exception.
In this scenario what I did is :
Step 1: Checked for the file existence in the project directory which we copied to EMR.
for example mine was in `//usr/local/project_folder/`
Step 2: Copy the script which you're expecting to run on the EMR.
for example I copied from `//usr/local/project_folder/script_name.sh` to `/home/hadoop/`
Step 3: Then executed the script from /home/hadoop/ by passing the absolute path to the command-runner.jar
command-runner.jar bash /home/hadoop/script_name.sh
Thus I found my script running. Hope this may be helpful to someone

running spark on yarn as client

I'm trying to run a spark job with yarn using:
./bin/spark-submit --class "KafkaToMaprfs" --master yarn --deploy-mode client /home/mapr/kafkaToMaprfs/target/scala-2.10/KafkaToMaprfs.jar
But facing this error:
/opt/mapr/hadoop/hadoop-2.7.0 17/01/03 11:19:26 WARN NativeCodeLoader:
Unable to load native-hadoop library for your platform... using
builtin-java classes where applicable 17/01/03 11:19:38 ERROR
SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended!
It might have been killed or unable to launch application master.
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:124)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:64)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
at org.apache.spark.SparkContext.(SparkContext.scala:530)
at KafkaToMaprfs$.main(KafkaToMaprfs.scala:61)
at KafkaToMaprfs.main(KafkaToMaprfs.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:752)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 17/01/03 11:19:39 WARN MetricsSystem: Stopping a MetricsSystem that is
not running Exception in thread "main"
org.apache.spark.SparkException: Yarn application has already ended!
It might have been killed or unable to launch application master.
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:124)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:64)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
at org.apache.spark.SparkContext.(SparkContext.scala:530)
at KafkaToMaprfs$.main(KafkaToMaprfs.scala:61)
at KafkaToMaprfs.main(KafkaToMaprfs.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:752)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I have a multi node cluster, i'm deploying the application from a remote node.
I'm using spark 1.6.1 and hadoop 2.7.x versions.
I didn't set the cluster, so I couldn't find where the mistake lies.
Can anyone please help me fix this?
In my case i'm using MapR distribution.And i didn't configure the environment.
So, when i dug down to the all the conf folders.I made some changes in the below files,
1. In Spark-env.sh,Make sure these values are set right.
export SPARK_LOG_DIR=
export SPARK_PID_DIR=
export HADOOP_HOME=
export HADOOP_CONF_DIR=
export JAVA_HOME=
export SPARK_SUBMIT_OPTIONS=
2. yarn-env.sh.
Make sure the yarn_conf_dir, and java_home are set with correct values.
3. In spark-defaults.conf
1.spark.driver.extraClassPath
2.set value for HADOOP_CONF_DIR
4. HADOOP_CONF_DIR and JAVA_HOME in $SPARK_HOME/conf/spark-env.sh
1.export HADOOP_CONF_DIR=/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop
2.export JAVA_HOME =
5.spark assembly jar
1.Copy the following JAR file from the local file system to a world-readable location on MapR-FS: Substitute your Spark version and
specific JAR file name in the command.
/opt/mapr/spark/spark-/lib/spark-assembly--hadoop-mapr-.jar
Now i'm able to run my spark application as YARN-CLIENT smoothly using spark-submit.
These are basic essentials to make spark connect with yarn.
Correct me if i missed any other things.

Resources