How to run spark yarn jobs on a Hadoop cluster that is external to the K8S cluster - apache-spark

I am running a on-prem k8s cluster and am running juypterhub on it.
I can successfully submit the job to an yarn queue, however the job will fail because users notebook pod IP is not resolvable and therefore it can’t talk back to the spark driver running on said pod and I get an error like:
Caused by: java.io.IOException: Failed to connect to podIP:33630 at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:287)
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:218)
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:230)
at
org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:204)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:202) at
org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:198) at
java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) Caused by:
java.net.UnknownHostException: pod-ip
I believe I’m missing something in my setup that will allow yarn to talk back to the spawned notebook pods on the kubernetes cluster.
Any help or hints are greatly appreciated .
For now I am passing the spark driver the internal Kubernetes Pod IP of juypterhub by setting:
"spark.driver.host" to str(socket.gethostbyname(socket.gethostname())). Could I change this to something else in the notebook I am running? I am not too sure what to change it to.
Thanks!

Related

Loading data to Scylla DB on GKE with sstableloader, getting "Error creating netty channel to /10.110.117.172:9042"

In Scylladb while using sstable loader to load data, we are facing the following error-
root#scylla-chronicle-s-1:~# sstableloader -d 10.110.68.9 /var/lib/scylla/data/connections/by_date-4ce6c340f5dd11e980df000000000002/snapshots/1658173631835
Using /etc/scylla/scylla.yaml as the config file
===== Using optimized driver!!! =====
WARN 19:51:22,937 Error creating netty channel to /10.110.117.172:9042
com.datastax.shaded.netty.channel.ConnectTimeoutException: connection timed out: /10.110.117.172:9042
at com.datastax.shaded.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:218) ~[scylla-driver-core-3.7.1-scylla-2-shaded.jar:na]
at com.datastax.shaded.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38) [scylla-driver-core-3.7.1-scylla-2-shaded.jar:na]
at com.datastax.shaded.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120) [scylla-driver-core-3.7.1-scylla-2-shaded.jar:na]
at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399) [scylla-driver-core-3.7.1-scylla-2-shaded.jar:na]
at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:464) [scylla-driver-core-3.7.1-scylla-2-shaded.jar:na]
at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) [scylla-driver-core-3.7.1-scylla-2-shaded.jar:na]
at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [scylla-driver-core-3.7.1-scylla-2-shaded.jar:na]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_292]
ERROR 19:51:22,942 Unexpected error while executing task
java.lang.NullPointerException: null
at com.datastax.driver.core.HostConnectionPool.closeAsync(HostConnectionPool.java:838) ~[scylla-driver-core-3.7.1-scylla-2-shaded.jar:na]
at com.datastax.driver.core.SessionManager.removePool(SessionManager.java:437) ~[scylla-driver-core-3.7.1-scylla-2-shaded.jar:na]
at com.datastax.driver.core.SessionManager.onDown(SessionManager.java:525) ~[scylla-driver-core-3.7.1-scylla-2-shaded.jar:na]
at com.datastax.driver.core.Cluster$Manager.onDown(Cluster.java:2033) ~[scylla-driver-core-3.7.1-scylla-2-shaded.jar:na]
at com.datastax.driver.core.Cluster$Manager.access$1200(Cluster.java:1393) ~[scylla-driver-core-3.7.1-scylla-2-shaded.jar:na]
at com.datastax.driver.core.Cluster$Manager$5.runMayThrow(Cluster.java:1988) ~[scylla-driver-core-3.7.1-scylla-2-shaded.jar:na]
at com.datastax.driver.core.ExceptionCatchingRunnable.run(ExceptionCatchingRunnable.java:32) ~[scylla-driver-core-3.7.1-scylla-2-shaded.jar:na]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_292]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_292]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_292]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_292]
at com.datastax.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [scylla-driver-core-3.7.1-scylla-2-shaded.jar:na]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_292]
WARN 19:51:22,948 Error creating pool to /10.110.117.172:9042
The error you posted indicates that the machine where you are running sstableloader doesn't have network connectivity to the IP.
You need to make sure you are connecting to an IP that is reachable by clients. If the cluster is running on GKE, there's a good chance that you need to setup host networking so the cluster is accessible from outside GKE.
Usually, the pods are accessible via a Kubernetes service configured on GKE. See Accessing Scylla on a Kubernetes cluster for details. Cheers!

pySpark job failing on yarn

i am trying submit pyspark job from yarnclient. getting below error from RM without any further logs.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
Operation category READ is not supported in state standby ENOENT: No
such file or directory at
org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmodImpl(Native Method)
at
org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmod(NativeIO.java:231)
at
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:773)
at
org.apache.hadoop.fs.DelegateToFileSystem.setPermission(DelegateToFileSystem.java:218)
at org.apache.hadoop.fs.FilterFs.setPermission(FilterFs.java:266) at
org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1008) at
org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1004) at
org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at
org.apache.hadoop.fs.FileContext.setPermission(FileContext.java:1011)
at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:483)
at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:481)
at java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
at
org.apache.hadoop.yarn.util.FSDownload.changePermissions(FSDownload.java:481)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:419) at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:242)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:235)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:223)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) For more detailed output,
check the application tracking page:
https://.com:8090/cluster/app/application_1638972290118_64750
Then click on links to logs of each attempt. . Failing the
application.
cluster is fine and other pyspark jobs running fine.
Please help
Thanks in advance
What do you mean by "cluster is fine and other pyspark jobs running fine"?
Did you run them on Yarn or just on Standalone mode?
However, I think it's better to check your yarn cluster first to see if it works (without spark).
you can do it using hadoop MapR examples:
yarn jar $HadoopDir/share/hadoop/mapreduce/hadoop-mapreduce-examples-$version.jar wordcount inputFilePath OutputDir
Check link 1 and link 2 too. They may help.

Spark executor driver unknown host issue

Currently I got an issue when using spark. Here is the details.
I setup a spark cluster on VM with three nodes, one is master and the other two are worker.
Spark version is 3.0.0.
I created an application and deployed it into K8S pod.
Then I triggered the application in my K8S pod.
The code I created my Spark Context is like this.
Code:
val conf = SparkConf().setMaster(sparkMaster).setAppName(appName)
.set("spark.cassandra.connection.host", hosts)
.set("spark.cassandra.connection.port", port.toString())
.set("spark.cores.max", sparkCore)
.set("spark.executor.memory", executorMemory)
.set("spark.network.timeout", "50000")
Then I got the error that :
Spark Executor Command:
"/usr/lib/jvm/java-11-openjdk-11.0.7.10-4.el7_8.x86_64/bin/java" "-cp"
"/opt/spark3/conf/:/opt/spark3/jars/*:/etc/hadoop/conf" "-Xmx1024M"
"-Dspark.network.timeout=50000" "-Dspark.driver.port=34471"
"-Dspark.cassandra.connection.port=9042"
"org.apache.spark.executor.CoarseGrainedExecutorBackend"
"--driver-url"
"spark://CoarseGrainedScheduler#my-spark-application-gjwd7:34471"
"--executor-id" "281" "--hostname" "100.84.162.9" "--cores" "1"
"--app-id" "app-20200722115716-0000" "--worker-url"
"spark://Worker#100.84.162.9:33670"
The error is like this:
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
Caused by: java.io.IOException: Failed to connect to my-spark-application-gjwd7:34471
Caused by: java.net.UnknownHostException: my-spark-application-gjwd7
I think this issue is caused by that the worker node can not call back to the driver, since my driver is in the K8S pod.
Do you know how can I resolve this issue. I have some concerns here.
1.How can I set the driver host and port in my code for spark context? port is random according to Spark.
2.How can Spark executor callback to a K8S pod where the driver is running?

Ignite Yarn and Hortonworks

I'm trying to deploy ignite so that I can use the shared RDD/Dataframe cache for my spark cluster. I've followed the spark install instructions and choose to deploy into my existing yarn cluster running spark. I'm using HDP to deploy spark.
I've already verified that Resource Manager and History server are listening on the ports below and I can telnet to each port. What am I doing wrong? Am I not deploying this the way it is intended?
I'm running:
yarn jar ignite-yarn-2.6.0.jar ./ignite-yarn-2.6.0.jar ../../../cluster.properties
Error below:
18/09/24 22:13:38 INFO client.RMProxy: Connecting to ResourceManager at dev01clus02.dna.local/172.31.31.5:8050
18/09/24 22:13:38 INFO client.AHSProxy: Connecting to Application History server at dev01clus02.dna.local/172.31.31.5:10200
Exception in thread "main" java.lang.RuntimeException: Failed update ignite.
at org.apache.ignite.yarn.IgniteProvider.updateIgnite(IgniteProvider.java:243)
at org.apache.ignite.yarn.IgniteProvider.getIgnite(IgniteProvider.java:93)
at org.apache.ignite.yarn.IgniteYarnClient.getIgnite(IgniteYarnClient.java:194)
at org.apache.ignite.yarn.IgniteYarnClient.main(IgniteYarnClient.java:84)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:233)
at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:209)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:704)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:675)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1569)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
at org.apache.ignite.yarn.IgniteProvider.updateIgnite(IgniteProvider.java:220)
... 9 more
It looks like a piece of insfrastructure, provided by GridGain for Apache Ignite project, does not work currently. I'll raise the issue.
In the meantime, you can provide IGNITE_PATH property (in config, system properties or env) pointed to unzipped Apache Ignite 2.6 distribution directory to avoid downloading attempts altogether.

AWS EMR using spark steps in cluster mode. Application application_ finished with failed status

I'm trying to launch a cluster using AWS Cli. I use the following command:
aws emr create-cluster --name "Config1" --release-label emr-5.0.0 --applications Name=Spark --use-default-role --log-uri 's3://aws-logs-813591802533-us-west-2/elasticmapreduce/' --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.medium InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.medium
The cluster is created successfully. Then I add this command:
aws emr add-steps --cluster-id ID_CLUSTER --region us-west-2 --steps Name=SparkSubmit,Jar="command-runner.jar",Args=[spark-submit,--deploy-mode,cluster,--master,yarn,--executor-memory,1G,--class,Traccia2014,s3://tracceale/params/scalaProgram.jar,s3://tracceale/params/configS3.txt,30,300,2,"s3a://tracceale/Tempi1"],ActionOnFailure=CONTINUE
After some time, the step failed. This is the LOG file:
17/02/22 11:00:07 INFO RMProxy: Connecting to ResourceManager at ip-172-31- 31-190.us-west-2.compute.internal/172.31.31.190:8032
17/02/22 11:00:08 INFO Client: Requesting a new application from cluster with 2 NodeManagers
17/02/22 11:00:08 INFO Client: Verifying our application has not requested
Exception in thread "main" org.apache.spark.SparkException: Application application_1487760984275_0001 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1132)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1175)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/02/22 11:01:02 INFO ShutdownHookManager: Shutdown hook called
17/02/22 11:01:02 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-27baeaa9-8b3a-4ae6-97d0-abc1d3762c86
Command exiting with ret '1'
Locally (on SandBox Hortonworks HDP 2.5) I run:
./spark-submit --class Traccia2014 --master local[*] --executor-memory 2G /usr/hdp/current/spark2-client/ScalaProjects/ScripRapportoBatch2.1/target/scala-2.11/traccia-22-ottobre_2.11-1.0.jar "/home/tracce/configHDFS.txt" 30 300 3
and everything works fine.
I've already read something related to my problem, but I can't figure it out.
UPDATE
Checked into Application Master, I get this error:
17/02/22 15:29:54 ERROR ApplicationMaster: User class threw exception: java.io.FileNotFoundException: s3:/tracceale/params/configS3.txt (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at scala.io.Source$.fromFile(Source.scala:91)
at scala.io.Source$.fromFile(Source.scala:76)
at scala.io.Source$.fromFile(Source.scala:54)
at Traccia2014$.main(Rapporto.scala:40)
at Traccia2014.main(Rapporto.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:627)
17/02/22 15:29:55 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.io.FileNotFoundException: s3:/tracceale/params/configS3.txt (No such file or directory))
I pass the path mentioned "s3://tracceale/params/configS3.txt" from S3 to the function 'fromFile' like this:
for(line <- scala.io.Source.fromFile(logFile).getLines())
How could I solve it? Thanks in advance.
Because you are using cluster deploy mode, the logs you have included are not useful at all. They just say that the application failed but not why it failed. To figure out why it failed, you at least need to look at the Application Master logs, since that is where the Spark driver runs in cluster deploy mode, and it will probably give a better hint as to why the application failed.
Since you have configured your cluster with a --log-uri, you will find the logs for the Application Master underneath s3://aws-logs-813591802533-us-west-2/elasticmapreduce/<CLUSTER ID>/containers/<YARN Application ID>/ where the YARN Application ID is (based on the logs you included above) application_1487760984275_0001, and the container ID should be something like container_1487760984275_0001_01_000001. (The first container for an application is the Application Master.)
What you have there is a URL to an object store, reachable from the Hadoop filesystem APIs, and a stack trace coming from java.io.File, which can't read it because it doesn't refer to anything in the local disk.
Use SparkContext.hadoopRDD() as the operation to convert the path into an RDD
There is a probability of file missing in the location, may be you can see it after ssh into EMR cluster but still the steps command wouldn't be able to figure out by itself and starts throwing that file not found exception.
In this scenario what I did is :
Step 1: Checked for the file existence in the project directory which we copied to EMR.
for example mine was in `//usr/local/project_folder/`
Step 2: Copy the script which you're expecting to run on the EMR.
for example I copied from `//usr/local/project_folder/script_name.sh` to `/home/hadoop/`
Step 3: Then executed the script from /home/hadoop/ by passing the absolute path to the command-runner.jar
command-runner.jar bash /home/hadoop/script_name.sh
Thus I found my script running. Hope this may be helpful to someone

Resources