I am attempting to run a job on a Spark cluster setup in Mesos. I can run a job if I copy the jar to the server and then use a file: URL, but I cannot get spark to download a jar using https:. Every time I do get the error below in the stderr file.
I0226 00:11:05.618361 22652 logging.cpp:172] INFO level logging started!
I0226 00:11:05.618552 22652 fetcher.cpp:409] Fetcher Info: ...
I0226 00:11:05.619721 22652 fetcher.cpp:364] Fetching URI 'https://jenkins.company.com/nexus/...
I0226 00:11:05.619738 22652 fetcher.cpp:238] Fetching directly into the sandbox directory
I0226 00:11:05.619751 22652 fetcher.cpp:176] Fetching URI 'https://jenkins.company.com/nexus/...
I0226 00:11:05.619762 22652 fetcher.cpp:126] Downloading resource from 'https://jenkins.company.com/nexus/...
Failed to fetch 'https://jenkins.company.com/nexus/... ': Error downloading resource: SSL connect error
Failed to synchronize with slave (it's probably exited)
I am able to use wget to download the jar from the specified URL. I have also verified that the JDK on the server has the correct certificate for the nexus server where I am attempting to download the jar from.
I am new to Spark and Mesos and any help resolving this issue would be greatly appreciated.
Did you specify you private Nexus repository with the --repository flag on application start?
I personally never use encryption together with Spark, but from the docs it seems to be possible/necessary to configure it within Spark. I guess just for the SDKs is not enough.
See
http://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management
http://spark.apache.org/docs/latest/configuration.html#encryption
Related
Running an application in in client mode, the driver logs are printed with the below info messages, any idea on how to resolve this? Any spark configs to be updated? or missing?
[INFO ][dispatcher-event-loop-29][SparkRackResolver:54] Got an error when resolving hostNames. Falling back to /default-rack for all
The jobs runs fine, this msg is not in the executor logs.
Check this bug:
https://issues.apache.org/jira/browse/SPARK-28005
If you want to suppress this in the logs you can try to add this into your log4j.properties
log4j.logger.org.apache.spark.deploy.yarn.SparkRackResolver=ERROR
This can happen while using spart-submit with master yarn in a deploy mode local (not using --deploy-mode cluster) and the path to topology.py script is not correct into your core-site.xml.
Path to core-site.xml can be set via environment variable HADOOP_CONF_DIR (or YARN_CONF_DIR).
Check the path in the param net.topology.script.file.name value of core-site.xml.
If the path is incorrect, deploying driver in local mode will lead to error of executing with the following warning:
23/01/15 18:39:43 WARN ScriptBasedMapping: Exception running /home/alexander/xxx/.conf/topology.py 10.15.21.199
java.io.IOException: Cannot run program "/etc/hadoop/conf.cloudera.yarn/topology.py" (in directory "/home/john"): error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
...
23/01/15 18:39:43 INFO SparkRackResolver: Got an error when resolving hostNames. Falling back to /default-rack for all
I'm trying to send spark job to yarn (without HDFS) in HA mode.
For submitting I'm using org.apache.spark.deploy.SparkSubmit.
When I send request from machine with active Resource Manager, it works well. But if I' trying to send from machine with standby Resource Manager, job fails with error:
DEBUG org.apache.hadoop.ipc.Client - Connecting to spark2-node-dev/10.10.10.167:8032
DEBUG org.apache.hadoop.ipc.Client - Connecting to /0.0.0.0:8032
org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep
However, when I send request via command line (spark-submit), it works well through both active and standby machine.
What can cause the problem?
P.S. Use the same parameters for both type of sending job: org.apache.spark.deploy.SparkSubmit and spark-submit command line request. And properties yarn.resourcemanager.hostname.rm_id defined for all rm hosts
The problem was with absence of yarn-site.xml within class path for spark-submitter jar. Actually spark submitter jar does not take to account YARN_CONF_DIR or HADOOP_CONF_DIR env var, so cannot see yarn-site.
One solution that I found was to put yarn-site into classpath of jar.
I was using Spark 1.6.0 to access data on Kerberos enabled HDFS by API DataFrame.read.parquet($path).
My application is deployed as spark on yarn with client mode.
By default, Kerberos ticket expires every 24 hours. Everything works fine in the first 24 hours but failing to read files after 24 hours(or more, like 27 hours).
I have tried several ways to login and renew the ticket, doesn't work.
Set spark.yarn.keytab and spark.yarn.principal in spark-defaults.conf
Set --keytab and --principal in the spark-submit command line
Start a timer in code to call UserGroupInformation.getLoginUser().checkTGTAndReloginFromKeytab() every 2 hours.
Error details are:
WARN [org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:671)] - Couldn't setup connection for adam/cluster1#DEV.COM to cdh01/192.168.1.51:8032
DEBUG [org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1632)] - PrivilegedActionException as:adam/cluster1#DEV.COM (auth:KERBEROS) cause:java.io.IOException: Couldn't setup connection for adam/cluster1#DEV.COMto cdh01/192.168.1.51:8032
ERROR [org.apache.spark.Logging$class.logError(Logging.scala:95)] - Failed to contact YARN for application application_1490607689611_0002.
java.io.IOException: Failed on local exception: java.io.IOException: Couldn't setup connection for adam/cluster1#DEV.COM to cdh01/192.168.1.51:8032; Host Details : local host is: "cdh05/192.168.1.41"; destination host is: "cdh01":8032;
The problem was solved.
It was caused by the wrong version of Hadoop lib.
In Spark 1.6 assembly jar, it refer to the old ver. of Hadoop lib, so I downloaded it again without build-in Hadoop lib, and referring to a third party Hadop 2.8 lib.
Then it just works.
I use Spark 1.6.1, Hadoop 2.6.4 and Mesos 0.28 on Debian 8.
While trying to submit a job via spark-submit to a Mesos cluster a slave fails with the following in stderr log:
I0427 22:35:39.626055 48258 fetcher.cpp:424] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/ad642fcf-9951-42ad-8f86-cc4f5a5cb408-S0\/hduser","items":[{"action":"BYP$
I0427 22:35:39.628031 48258 fetcher.cpp:379] Fetching URI 'hdfs://xxxxxxxxx:54310/sources/spark/SimpleEventCounter.jar'
I0427 22:35:39.628057 48258 fetcher.cpp:250] Fetching directly into the sandbox directory
I0427 22:35:39.628078 48258 fetcher.cpp:187] Fetching URI 'hdfs://xxxxxxx:54310/sources/spark/SimpleEventCounter.jar'
E0427 22:35:39.629243 48258 shell.hpp:93] Command 'hadoop version 2>&1' failed; this is the output:
sh: 1: hadoop: not found
Failed to fetch 'hdfs://xxxxxxx:54310/sources/spark/SimpleEventCounter.jar': Failed to create HDFS client: Failed to execute 'hadoop version 2>&1'; the command was e$
Failed to synchronize with slave (it's probably exited)
My Jar file contains hadoop 2.6 binaries
The path to spark executor/binary is via an hdfs:// link
My jobs don't appear in the framework tab, but they do appear in the driver with the status 'queued' and they just sit there till I shut down the spark-mesos-dispatcher.sh service.
I was seeing a very similar error and I figured out my problem was that hadoop_home wasn't set in the mesos agent.
I added to /etc/default/mesos-slave (path may be different on your install) on each mesos-slave the following line: MESOS_hadoop_home="/path/to/my/hadoop/install/folder/"
EDIT: Hadoop has to be installed on each slave, the path/to/my/haoop/install/folder is a local path
We untar spark-0.9.0-incubating.tgz and trying to build it for use with Yarn.
SPARK_HADOOP_VERSION=2.0.0-cdh4.6.0 SPARK_YARN=true sbt/sbt assembly
...
[info] Resolving io.netty#netty-all;4.0.13.Final ...
[error] Server access Error: Connection timed out url=https://oss.sonatype.org/content/repositories/snapshots/io/netty/netty-all/4.0.13.Final/netty-all-4.0.13.Final.pom
[error] Server access Error: Connection timed out url=https://oss.sonatype.org/service/local/staging/deploy/maven2/io/netty/netty-all/4.0.13.Final/netty-all-4.0.13.Final.pom
...
If I just cut-paste the url into a browser, I get:
404 - ItemNotFoundException
Retrieval of /io/netty/netty-all/4.0.13.Final/netty-all-4.0.13.Final.pom from M2Repository(id=snapshots) is forbidden by repository policy SNAPSHOT.
org.sonatype.nexus.proxy.ItemNotFoundException: Retrieval of /io/netty/netty-all/4.0.13.Final/netty-all-4.0.13.Final.pom from M2Repository(id=snapshots) is forbidden by repository policy SNAPSHOT.
at org.sonatype.nexus.proxy.maven.AbstractMavenRepository.doRetrieveItem(AbstractMavenRepository.java:380)
at org.sonatype.nexus.proxy.maven.maven2.M2Repository.doRetrieveItem(M2Repository.java:396)
at org.sonatype.nexus.proxy.repository.AbstractRepository.retrieveItem(AbstractRepository.java:765)
at org.sonatype.nexus.proxy.repository.AbstractRepository.retrieveItem(AbstractRepository.java:608)
at org.sonatype.nexus.proxy.router.DefaultRepositoryRouter.retrieveItem(DefaultRepositoryRouter.java:155)
at org.sonatype.nexus.web.content.NexusContentServlet.doGet(NexusContentServlet.java:359)
at org.sonatype.nexus.web.content.NexusContentServlet.service(NexusContentServlet.java:331)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
I have seen this reported in a number of places but no solution. Is this error because we are behind a corporate firewall, or is this due to something else? Please advise.
I had proxy set as environment variables, but it appears they are not being picked up. Adding them in sbt directly worked for me.
Edit $SPARK_HOME/sbt/sbt
For example,
EXTRA_ARGS="-Dhttp.proxySet=true -Dhttp.proxyHost=myproxy.mycompany.com -Dhttp.proxyPort=80 -Dhttps.proxySet=true -Dhttps.proxyHost=myproxy.mycompany.com -Dhttps.proxyPort=80 -Dftp.proxySet=true -Dftp.proxyHost=myproxy.mycompany.com -Dftp.proxyPort=80 -Dhttp.nonProxyHosts=mydomain -Dhttps.nonProxyHosts=mydomain -Dftp.nonProxyHosts=mydomain"