Submit Spark Job to Google Cloud Platform - apache-spark

Has everyone tries deploy Spark using https://console.developers.google.com/project/_/mc/template/hadoop?
Spark installed correctly for me, I can SSH into the hadoop worker or master, spark is installed at /home/hadoop/spark-install/
I can use spark python shell to read file at cloud storage
lines = sc.textFile("hello.txt")
lines.count()
line.first()
but I cannot sucessfully submit the python example to spark cluster, when I run
bin/spark-submit --master spark://hadoop-m-XXX:7077 examples/src/main/python/pi.py 10
I always got
Traceback (most recent call last): File
"/Users/yuanwang/programming/spark-1.1.0-bin-hadoop2.4/examples/src/main/python/pi.py",
line 38, in
count = sc.parallelize(xrange(1, n+1), slices).map(f).reduce(add) File
"/Users/yuanwang/programming/spark-1.1.0-bin-hadoop2.4/python/pyspark/rdd.py",
line 759, in reduce
vals = self.mapPartitions(func).collect() File "/Users/yuanwang/programming/spark-1.1.0-bin-hadoop2.4/python/pyspark/rdd.py",
line 723, in collect
bytesInJava = self._jrdd.collect().iterator() File "/Users/yuanwang/programming/spark-1.1.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in call File
"/Users/yuanwang/programming/spark-1.1.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error
occurred while calling o26.collect. : org.apache.spark.SparkException:
Job aborted due to stage failure: All masters are unresponsive! Giving
up. at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236) at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at
akka.actor.ActorCell.invoke(ActorCell.scala:456) at
akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at
akka.dispatch.Mailbox.run(Mailbox.scala:219) at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
I am pretty sure I am not connect to Spark cluster correctly, has anyone successfully connect spark cluster on cloud engine?

You can run jobs from the master:
ssh to the master node:
gcloud compute ssh --zone <zone> hadoop-m-<hash>
and then:
$ cd /home/hadoop/spark-install
$ spark-submit examples/src/main/python/pi.py 10
and somewhere in the output you should see: something like:
Pi is roughly 3.140100
It looks like you are trying to do remote submission of jobs. I'm not sure how you get that to work, but you can submit jobs from on the master.
BTW, as a routine operation, you can validate your spark installation with:
cd /usr/local/share/google/bdutil-0.35.2/extensions/spark
sudo chmod 755 spark-validate-setup.sh
./spark-validate-setup.sh

Related

Spark on Fargate can't find local IP

I have a build job I'm trying to set up in an AWS Fargate cluster of 1 node. When I try to run Spark to build my data, I get an error that seems to be about Java not being able to find "localHost".
I set up the config by running a script that adds the spark-env.sh file, updates the /etc/hosts file and updates the spark-defaults.conf file.
In the $SPARK_HOME/conf/spark-env.sh file, I add:
SPARK_LOCAL_IP
SPARK_MASTER_HOST
In the $SPARK_HOME/conf/spark-defaults.conf
spark.jars.packages <comma separated jars>
spark.master <ip or URL>
spark.driver.bindAddress <IP or URL>
spark.driver.host <IP or URL>
In the /etc/hosts file, I append:
<IP I get from http://169.254.170.2/v2/metadata> master
Invoking the spark-submit script by passing in the -master <IP or URL> argument with an IP or URL doesn't seem to help.
I've tried using local[*], spark://<ip from metadata>:<port from metadata>, <ip> and <ip>:<port> variations, to no avail.
Using 127.0.0.1 and localhost don't seem to make a difference, compared to using things like master and the IP returned from metadata.
On the AWS side, the Fargate cluster is running in a private subnet with a NatGateway attached, so it does have egress and ingress network routes, as far as I can tell. I've tried using a public network and ENABLEDing the setting for ECS to automatically attach a public IP to the container.
All the standard ports from the Spark docs are opened up on the container too.
It seems to run fine up until the point at which it tries to gather its own IP.
The error that I get back has this, in the stack:
spark.jars.packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2
spark.master spark://10.0.41.190:7077
Spark Command: /docker-java-home/bin/java -cp /usr/spark/conf/:/usr/spark/jars/* -Xmx1gg org.apache.spark.deploy.SparkSubmit --master spark://10.0.41.190:7077 --verbose --jars lib/RedshiftJDBC42-1.2.12.1017.jar --packages org.apache.hadoop:hadoop-aws:2.7.3,com.amazonaws:aws-java-sdk:1.7.4,com.upplication:s3fs:2.2.1 ./build_phase.py
========================================
Using properties file: /usr/spark/conf/spark-defaults.conf
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.spark.util.Utils$.redact(Utils.scala:2653)
at org.apache.spark.deploy.SparkSubmitArguments$$anonfun$defaultSparkProperties$1.apply(SparkSubmitArguments.scala:93)
at org.apache.spark.deploy.SparkSubmitArguments$$anonfun$defaultSparkProperties$1.apply(SparkSubmitArguments.scala:86)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.deploy.SparkSubmitArguments.defaultSparkProperties$lzycompute(SparkSubmitArguments.scala:86)
at org.apache.spark.deploy.SparkSubmitArguments.defaultSparkProperties(SparkSubmitArguments.scala:82)
at org.apache.spark.deploy.SparkSubmitArguments.mergeDefaultSparkProperties(SparkSubmitArguments.scala:126)
at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:110)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.UnknownHostException: d4771b650361: d4771b650361: Name or service not known
at java.net.InetAddress.getLocalHost(InetAddress.java:1505)
at org.apache.spark.util.Utils$.findLocalInetAddress(Utils.scala:891)
at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$localIpAddress$lzycompute(Utils.scala:884)
at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$localIpAddress(Utils.scala:884)
at org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:941)
at org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:941)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.util.Utils$.localHostName(Utils.scala:941)
at org.apache.spark.internal.config.package$.<init>(package.scala:204)
at org.apache.spark.internal.config.package$.<clinit>(package.scala)
... 10 more
The container has no problems when trying to run locally so I think it has something to do with the nature of Fargate.
Any help or pointers would be much appreciated!
Edit
Since the post I've tried a few different things. I am using images that run with Spark 2.3, Hadoop 2.7 and Python 3 and I've tried adding OS packages and different variations of the config I mentioned already.
It all smells like I'm doing the spark-defaults.conf and friends wrong but I'm so new to this stuff that it could just be a bad alignment of Jupiter and Mars...
The current stack trace:
Getting Spark Context...
2018-06-08 22:39:40 INFO SparkContext:54 - Running Spark version 2.3.0
2018-06-08 22:39:40 INFO SparkContext:54 - Submitted application: SmashPlanner
2018-06-08 22:39:41 INFO SecurityManager:54 - Changing view acls to: root
2018-06-08 22:39:41 INFO SecurityManager:54 - Changing modify acls to: root
2018-06-08 22:39:41 INFO SecurityManager:54 - Changing view acls groups to:
2018-06-08 22:39:41 INFO SecurityManager:54 - Changing modify acls groups to:
2018-06-08 22:39:41 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
2018-06-08 22:39:41 ERROR SparkContext:91 - Error initializing SparkContext.
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:101)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218)
at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:128)
at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:558)
at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1283)
at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:501)
at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:486)
at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:989)
at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:254)
at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:364)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:748)
2018-06-08 22:39:41 INFO SparkContext:54 - Successfully stopped SparkContext
Traceback (most recent call last):
File "/usr/local/smash_planner/build_phase.py", line 13, in <module>
main()
File "/usr/local/smash_planner/build_phase.py", line 9, in main
build_all_data(pred_date)
File "/usr/local/smash_planner/DataPiping/build_data.py", line 25, in build_all_data
save_keyword(pred_date)
File "/usr/local/smash_planner/DataPiping/build_data.py", line 52, in save_keyword
df = get_dataframe(query)
File "/usr/local/smash_planner/SparkUtil/data_piping.py", line 15, in get_dataframe
sc = SparkCtx.get_sparkCtx()
File "/usr/local/smash_planner/SparkUtil/context.py", line 20, in get_sparkCtx
sc = SparkContext(conf=conf).getOrCreate()
File "/usr/spark-2.3.0/python/lib/pyspark.zip/pyspark/context.py", line 118, in __init__
File "/usr/spark-2.3.0/python/lib/pyspark.zip/pyspark/context.py", line 180, in _do_init
File "/usr/spark-2.3.0/python/lib/pyspark.zip/pyspark/context.py", line 270, in _initialize_context
File "/usr/local/lib/python3.4/dist-packages/py4j-0.10.6-py3.4.egg/py4j/java_gateway.py", line 1428, in __call__
answer, self._gateway_client, None, self._fqn)
File "/usr/local/lib/python3.4/dist-packages/py4j-0.10.6-py3.4.egg/py4j/protocol.py", line 320, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:101)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218)
at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:128)
at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:558)
at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1283)
at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:501)
at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:486)
at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:989)
at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:254)
at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:364)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:748)
2018-06-08 22:39:41 INFO ShutdownHookManager:54 - Shutdown hook called
2018-06-08 22:39:41 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-80488ba8-2367-4fa6-8bb7-194b5ebf08cc
Traceback (most recent call last):
File "bin/smash_planner.py", line 76, in <module>
raise RuntimeError("Spark hated your config and/or invocation...")
RuntimeError: Spark hated your config and/or invocation...
SparkConf runtime configuration:
def get_dataframe(query):
...
sc = SparkCtx.get_sparkCtx()
sql_context = SQLContext(sc)
df = sql_context.read \
.format("jdbc") \
.option("driver", "com.amazon.redshift.jdbc42.Driver") \
.option("url", os.getenv('JDBC_URL')) \
.option("user", os.getenv('REDSHIFT_USER')) \
.option("password", os.getenv('REDSHIFT_PASSWORD')) \
.option("dbtable", "( " + query + " ) tmp ") \
.load()
return df
Edit 2
Using only the spark-env configuration and running with the defaults from the gettyimages/docker-spark image gives this error, in the browser.
java.util.NoSuchElementException
at java.util.Collections$EmptyIterator.next(Collections.java:4189)
at org.apache.spark.util.kvstore.InMemoryStore$InMemoryIterator.next(InMemoryStore.java:281)
at org.apache.spark.status.AppStatusStore.applicationInfo(AppStatusStore.scala:38)
at org.apache.spark.ui.jobs.AllJobsPage.render(AllJobsPage.scala:273)
at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82)
at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82)
at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
at org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
at org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584)
at org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
at org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
at org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.spark_project.jetty.server.Server.handle(Server.java:534)
at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:320)
at org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
at org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108)
at org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Thread.java:748)
The solution is to avoid user error...
This was a total face-palm situation but I hope my misunderstanding of the Spark system can help some poor fool, like myself, who has spent too much time stuck on the same type of problem.
The answer for the last iteration (gettyimages/docker-spark Docker image) was that I was trying to run the spark-submit command without having a master or worker(s) started.
In the gettyimages/docker-spark repo, you can find a docker-compose file that shows you that it creates the master and the worker nodes before any spark work is done. The way that image creates a master or a worker is by using the spark-class script and passing in the org.apache.spark.deploy.<master|worker>.<Master|Worker> class, respectively.
So, putting it all together, I can use the configuration I was using but I have to create the master and worker(s) first, then execute the spark-submit command the same as I was already doing.
This is a quick and dirty of one implementation, although I guarantee there's better, done by folks who actually know what they're doing:
The first 3 steps happen in a cluster boot script. I do this in an AWS Lambda, triggered by an APIGateway
create a cluster and a queue or some sort of message brokerage system, like zookeeper/kafka. (I'm using API-Gateway -> lambda for this)
pick a master node (logic in the lambda)
create a message with some basic information, like the master's IP or domain and put it in the queue from step 1 (happens in the lambda)
Everything below this happens in the startup script on the Spark nodes
create a step in the startup script that has the nodes check the queue for the message from step 3
add SPARK_MASTER_HOST and SPARK_LOCAL_IP to the $SPARK_HOME/conf/spark-env.sh file, using the information from the message you picked up in step 4
add spark.driver.bindAddress to the $SPARK_HOME/conf/spark-defaults.conf file, using the information from the message you picked up in step 4
use some logic in your startup script to decide "this" node is a master or a worker
start the master or worker. in the gettyimages/docker-spark image, you can start a master with $SPARK_HOME/bin/spark-class org.apache.spark.deploy.master.Master -h <the master's IP or domain> and you can start a worker with $SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker -h spark://<master's domain or IP>:7077
Now you can run the spark-submit command, which will deploy the work to the cluster.
Edit: (some code for reference)
This is the addition to the lambda
def handler(event, context):
config = BuildConfig(event)
res = create_job(config)
return build_response(res)
and after the edit
def handler(event, context):
config = BuildConfig(event)
coordination_queue = config.cluster + '-coordination'
sqs = boto3.client('sqs')
message_for_master_node = {'type': 'master', 'count': config.count}
queue_urls = sqs.list_queues(QueueNamePrefix=coordination_queue)['QueueUrls']
if not queue_urls:
queue_url = sqs.create_queue(QueueName=coordination_queue)['QueueUrl']
else:
queue_url = queue_urls[0]
sqs.send_message(QueueUrl=queue_url,
MessageBody=message_for_master_node)
res = create_job(config)
return build_response(res)
and then I added a little to the script that the nodes in the Spark cluster run, on startup:
# addition to the "main" in the Spark node's startup script
sqs = boto3.client('sqs')
boot_info_message = sqs.receive_message(
QueueUrl=os.getenv('COORDINATIN_QUEUE_URL'),
MaxNumberOfMessages=1)['Messages'][0]
boot_info = boot_info_message['Body']
message_for_worker = {'type': 'worker', 'master': self_url}
if boot_info['type'] == 'master':
for i in range(int(boot_info['count'])):
sqs.send_message(QueueUrl=os.getenv('COORDINATIN_QUEUE_URL'),
MessageBody=message_for_worker)
sqs.delete_message(QueueUrl=os.getenv('COORDINATIN_QUEUE_URL'),
ReceiptHandle=boot_info_message['ReceiptHandle'])
...
# starts a master or worker node
startup_command = "org.apache.spark.deploy.{}.{}".format(
boot_info['type'], boot_info['type'].title())
subprocess.call(startup_command)
Go to AWS console and under your security group configuration, allow all inbound traffic to the instance.
https://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_SecurityGroups.html

Zeppelin Pyspark on HDP 2.3 giving error

I am trying to configure zeppelin to work with HDP 2.3 (Spark 1.3). I have successfully installed zeppelin via Ambari and the zeppelin service is running.
But when I am trying to run any %pyspark command I am getting the below error.
I read few blogs but seems like there is some issue with jar being compiled on Java 6 and Java 7 that are being shared between Python and Spark.
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 7, sandbox.hortonworks.com): org.apache.spark.SparkException:
Error from python worker:
/usr/bin/python: No module named pyspark
PYTHONPATH was:
/opt/incubator-zeppelin/interpreter/spark/zeppelin-spark-0.6.0-incubating-SNAPSHOT.jar
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:105)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
(<class 'py4j.protocol.Py4JJavaError'>, Py4JJavaError(u'An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.\n', JavaObject id=o68), <traceback object at 0x2618bd8>)
Took 0 seconds
Can you check in your zeppelin-env.sh if you have the below line?
export PYTHONPATH=${SPARK_HOME}/python
If missing, this can be added via Ambari under Zeppelin > Configs > Advanced zeppelin-env > zeppelin-env template
Although, if you installed using the latest version of Ambari service for zeppelin then it should have done this for you:
https://github.com/hortonworks-gallery/ambari-zeppelin-service/blob/master/configuration/zeppelin-env.xml#L63
I just setup a fresh HDP 2.3 setup (2.3.0.0-2557) on Centos 6.5 using Ambari 2.1 and installed zeppelin using Ambari zeppelin service (using default configs). Pyspark seems to work fine for me.
Based on your error it sounds like PYTHONPATH is not getting set to the correct value:
PYTHONPATH was:
/opt/incubator-zeppelin/interpreter/spark/zeppelin-spark-0.6.0-incubating-SNAPSHOT.jar
In zeppelin can you enter the below in a cell and run it and provide the output?
System.getenv().get("MASTER")
System.getenv().get("SPARK_YARN_JAR")
System.getenv().get("HADOOP_CONF_DIR")
System.getenv().get("JAVA_HOME")
System.getenv().get("SPARK_HOME")
System.getenv().get("PYSPARK_PYTHON")
System.getenv().get("PYTHONPATH")
System.getenv().get("ZEPPELIN_JAVA_OPTS")
Here is the output on my setup:
res41: String = yarn-client
res42: String = hdfs:///apps/zeppelin/zeppelin-spark-0.6.0-SNAPSHOT.jar
res43: String = /etc/hadoop/conf
res44: String = /usr/java/default
res45: String = /usr/hdp/current/spark-client/
res46: String = null
res47: String = /usr/hdp/current/spark-client//python:/usr/hdp/current/spark-client//python/lib/pyspark.zip:/usr/hdp/current/spark-client//python/lib/py4j-0.8.2.1-src.zip
res48: String = -Dhdp.version=2.3.0.0-2557 -Dspark.executor.memory=512m -Dspark.yarn.queue=default

Spark Standalone Mode not working in a cluster

My installation of spark is not working correctly in my local cluster. I downloaded spark-1.4.0-bin-hadoop2.6.tgz and untar it in a directory visible to all nodes (these nodes are all accessible by ssh without password). In addition, I edited conf/slaves so that it contains the names of the nodes. Then I issued a sbin/start-all.sh . The Web UI in the master became available and the nodes appeared in the workers sections. However, if a start a pyspark section (connecting to the master using the URL that appeared in the Web UI), and try to run this simple example:
a=sc.parallelize([0,1,2,3],2)
a.collect()
I get this error:
15/07/12 19:52:58 ERROR TaskSetManager: Task 1 in stage 0.0 failed 4 times; aborting job
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/myuser/spark-1.4.0-bin-hadoop2.6/python/pyspark/rdd.py", line 745, in collect
port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "/home/myuser/spark-1.4.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/home/myuser/spark-1.4.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, 172.16.1.1): java.io.InvalidClassException: scala.reflect.ClassTag$$anon$1; local class incompatible: stream classdesc serialVersionUID = -4937928798201944954, local class serialVersionUID = -8102093212602380348
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:604)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1601)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1514)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1750)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1964)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1888)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1964)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1888)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1450)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Has anyone experienced this issue? Thanks in advance.
It's seems like it type cast exception.
Can you try input as sc.parallelize(List(1,2,3,4,5,6), 2) and re-run
Please, check that you use the proper JAVA_HOME.
you should set it before lauching the Spark job.
For example:
export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera
export PATH=$JAVA_HOME/bin:$JAVA_HOME/jre/bin:$PATH

submit .py script on Spark without Hadoop installation

I have the following simple wordcount Python script.
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("My App")
sc = SparkContext(conf = conf)
from operator import add
f=sc.textFile("C:/Spark/spark-1.2.0/README.md")
wc=f.flatMap(lambda x: x.split(" ")).map(lambda x: (x,1)).reduceByKey(add)
print wc
wc.saveAsTextFile("wc_out.txt")
I am launching this script using this command line:
spark-submit "C:/Users/Alexis/Desktop/SparkTest.py"
I am getting the following error:
Picked up _JAVA_OPTIONS: -Djava.net.preferIPv4Stack=true
15/04/20 18:58:01 WARN Utils: Your hostname, AE-LenovoUltra resolves to a loopba
ck address: 127.0.1.2; using 192.168.1.63 instead (on interface net0)
15/04/20 18:58:01 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another
address
15/04/20 18:58:10 WARN NativeCodeLoader: Unable to load native-hadoop library fo
r your platform... using builtin-java classes where applicable
15/04/20 18:58:11 ERROR Shell: Failed to locate the winutils binary in the hadoo
p binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Ha
doop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:278)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:300)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:293)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:867)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:411)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:969)
at org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:28
0)
at org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:28
0)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:280)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.sc
ala:61)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruct
orAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingC
onstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:214)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand
.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Traceback (most recent call last):
File "C:/Users/Alexis/Desktop/SparkTest.py", line 3, in <module>
sc = SparkContext(conf = conf)
File "C:\Spark\spark-1.2.0\python\pyspark\context.py", line 105, in __init__
conf, jsc)
File "C:\Spark\spark-1.2.0\python\pyspark\context.py", line 153, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "C:\Spark\spark-1.2.0\python\pyspark\context.py", line 201, in _initializ
e_context
return self._jvm.JavaSparkContext(jconf)
File "C:\Spark\spark-1.2.0\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.p
y", line 701, in __call__
File "C:\Spark\spark-1.2.0\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py",
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spa
rk.api.java.JavaSparkContext.
: java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:404)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
589)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:411)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:969)
at org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:28
0)
at org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:28
0)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:280)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.sc
ala:61)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruct
orAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingC
onstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:214)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand
.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
To a Spark beginner like me, it seems that this is the problem: "ERROR Shell: Failed to locate the winutils binary in the hadoop binary path". However, the Spark documentation clearly states that a Hadoop installation is not necessary for Spark to run in standalone mode.
What am I doing wrong?
The good news is you're not doing anything wrong, and your code will run after the error is mitigated.
Despite the statement that Spark will run on Windows without Hadoop, it still looks for some Hadoop components. The bug has a JIRA ticket (SPARK-2356), and a patch is available. As of Spark 1.3.1, the patch hasn't been committed to the main branch yet.
Fortunately, there's a fairly easy work around.
Create a bin directory for winutils under your Spark installation directory. In my case, Spark is installed in D:\Languages\Spark, so I created the following path: D:\Languages\Spark\winutils\bin
Download the winutils.exe from Hortonworks and put it into the bin directory created in the first step. Download link for Win64: http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe
Create a "HADOOP_HOME" environment variable that points to the winutils directory (not the bin subdirectory). You can do this in a couple of ways:
a. Establish a permanent environment variable via the Control Panel -> System -> Advanced System Settings -> Advanced Tab -> Environment variables. You can create either a user variable or a system variable with the following parameters:
Variable Name=HADOOP_HOME
Variable Value=D:\Languages\Spark\winutils\
b. Set a temporary environment variable inside your command shell
before executing your script
set HADOOP_HOME=d:\\Languages\\Spark\\winutils
Run your code. It should work without error now.

Spark-Submit exception SparkException: Job aborted due to stage failure

Whenever I tried to run a spark-submit command like the one below I'm getting an exception. Please could someone suggest what's going wrong here.
My command:
spark-submit --class com.xyz.MyTestClass --master spark://<spark-master-IP>:7077 SparkTest.jar
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 4 times, most recent failure: TID 7 on host <hostname> failed for unknown reason
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
I am not sure if the parameter
--master spark://<spark-master-IP>:7077
is what you actually have written instead of the actual IP of the master node. If so, you should change it and type the IP or public DNS of the master, such as:
--master spark://ec2-XX-XX-XX-XX.eu-west-1.compute.amazonaws.com:7077
If that's not the case, I would appreciate if you could provide more information about the error of the application, just as pointed on the comments above. Also make sure that the --class parameter points to the actual main class of the application.

Resources