Livy session stuck on starting after successful spark context creation - apache-spark

I've been trying to create a new spark session with Livy 0.7 server that runs on Ubuntu 18.04.
On that same machine I have a running spark cluster with 2 workers and I'm able to create a normal spark-session.
My problem is that after running the following request to Livy server the session stays stuck on starting state:
import json, pprint, requests, textwrap
host = 'http://localhost:8998'
data = {'kind': 'spark'}
headers = {'Content-Type': 'application/json'}
r = requests.post(host + '/sessions', data=json.dumps(data), headers=headers)
r.json()
I can see that the session is starting and created the spark session from the session log:
20/06/03 13:52:31 INFO SparkEntries: Spark context finished initialization in 5197ms
20/06/03 13:52:31 INFO SparkEntries: Created Spark session.
20/06/03 13:52:46 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (xxx.xx.xx.xxx:1828) with ID 0
20/06/03 13:52:47 INFO BlockManagerMasterEndpoint: Registering block manager xxx.xx.xx.xxx:1830 with 434.4 MB RAM, BlockManagerId(0, xxx.xx.xx.xxx, 1830, None)
and also from the spark master UI:
and after the livy.rsc.server.idle-timeout is reached the session log then outputs:
20/06/03 14:28:04 WARN RSCDriver: Shutting down RSC due to idle timeout (10m).
20/06/03 14:28:04 INFO SparkUI: Stopped Spark web UI at http://172.17.52.209:4040
20/06/03 14:28:04 INFO StandaloneSchedulerBackend: Shutting down all executors
20/06/03 14:28:04 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
20/06/03 14:28:04 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/06/03 14:28:04 INFO MemoryStore: MemoryStore cleared
20/06/03 14:28:04 INFO BlockManager: BlockManager stopped
20/06/03 14:28:04 INFO BlockManagerMaster: BlockManagerMaster stopped
20/06/03 14:28:04 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/06/03 14:28:04 INFO SparkContext: Successfully stopped SparkContext
20/06/03 14:28:04 INFO SparkContext: SparkContext already stopped.
and after that the dies :(
I already tried increasing the driver timeout with no luck, and didn't find any known issues like that
my guess it has something to do with the spark driver connectivity to the rsc but I have no idea where to configure that
Anyone knows the reason/solution for that?

We faced a similar problem in one of our environments. The only difference between the working and non-working env was spark master setting in livy.conf file.
I removed the config livy.spark.master=yarn from livy.conf and set this value from the code itself.
// pass master as yarn
public static JavaSparkContext getSparkContext(final String master, final String appName) {
LOGGER.info("Creating spark context");
SparkConf conf = new SparkConf().setAppName(appName);
if (Strings.isNullOrEmpty(master)) {
LOGGER.warn("No spark master found setting local!!");
conf.setMaster("local");
} else {
conf.setMaster(master);
}
conf.set("spark.submit.deployMode", "client");
return new JavaSparkContext(conf);
}
This worked for me.
It would help if anyone can point out, how this worked for me.

Related

Read spark stdout from driverLogUrl through livy batch API

Livy has a batch log endpoint: GET /batches/{batchId}/log, pointed out in How to pull Spark jobs client logs submitted using Apache Livy batches POST method using AirFlow
As far as I can tell, these logs are the livy logs and not the spark driver logs. I have a print statement in a pyspark job which prints to driver log stdout.
I am able to find the driver log URL via the describe batch endpoint https://livy.incubator.apache.org/docs/latest/rest-api.html#batch: by visiting the json response['appInfo']['driverLogUrl'] URL and clicking through to the logs
The json response url looks like : http://ip-some-ip.emr.masternode:8042/node/containerlogs/container_1578061839438_0019_01_000001/livy/ and I can click through to an html page with the added url leaf: stdout/?start=-4096 to see the logs...
As it is, I can only get an HTML page of the stdout, does a JSON API like version of this stdout (and preferrably stderr too) exist in the yarn/emr/hadoop resource manager? Otherwise is livy able to retrieve these driver logs somehow?
Or, is this an issue because I am using cluster mode instead of client. When I try to use client mode, I've been unable to use python3 and the PYSPARK_PYTHON, which is maybe for a different question, but if I'm able to get the stdout of the driver using a different deployMode, then that would work too.
If it matters, I'm running the cluster with EMR
I meet the same problem.
The short answer is it will only work for the client mode, but not the cluster mode.
This is because we try to get all logs from the master node. But the print information is local and is from the driver node.
When the spark is running in the "client mode", the driver node is your master node, so we get both log info and print info as they are in the same physical machine
However, things are different when spark is running in the "cluster mode". In this case, the driver node is one of your worker node, not your master node. Therefore we lose the print info since livy only get info from the master node
You can fetch the all logs including stdout, stderr and yarn diagnostics by GET /batches/{batchId}. (as you can see through at a batch log endpoint)
Here are code examples:
# self.job is batch object returned by `POST /batches`
job_response = requests.get(self.job, headers=self.headers).json()
self.job_status = job_response['state']
print(f"Job status: {self.job_status}")
for log in job_response['log']:
print(log)
Printed logs are like this (note that it is a Spark job logs, not a livy logs):
20/01/10 05:28:57 INFO Client: Application report for application_1578623516978_0024 (state: ACCEPTED)
20/01/10 05:28:58 INFO Client: Application report for application_1578623516978_0024 (state: ACCEPTED)
20/01/10 05:28:59 INFO Client: Application report for application_1578623516978_0024 (state: RUNNING)
20/01/10 05:28:59 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.2.100.6
ApplicationMaster RPC port: -1
queue: default
start time: 1578634135032
final status: UNDEFINED
tracking URL: http://ip-10-2-100-176.ap-northeast-2.compute.internal:20888/proxy/application_1578623516978_0024/
user: livy
20/01/10 05:28:59 INFO YarnClientSchedulerBackend: Application application_1578623516978_0024 has started running.
20/01/10 05:28:59 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 38087.
20/01/10 05:28:59 INFO NettyBlockTransferService: Server created on ip-10-2-100-176.ap-northeast-2.compute.internal:38087
20/01/10 05:28:59 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/01/10 05:28:59 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, ip-10-2-100-176.ap-northeast-2.compute.internal, 38087, None)
20/01/10 05:28:59 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-2-100-176.ap-northeast-2.compute.internal:38087 with 5.4 GB RAM, BlockManagerId(driver, ip-10-2-100-176.ap-northeast-2.compute.internal, 38087, None)
20/01/10 05:28:59 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, ip-10-2-100-176.ap-northeast-2.compute.internal, 38087, None)
20/01/10 05:28:59 INFO BlockManager: external shuffle service port = 7337
20/01/10 05:28:59 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, ip-10-2-100-176.ap-northeast-2.compute.internal, 38087, None)
20/01/10 05:28:59 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> ip-10-2-100-176.ap-northeast-2.compute.internal, PROXY_URI_BASES -> http://ip-10-2-100-176.ap-northeast-2.compute.internal:20888/proxy/application_1578623516978_0024), /proxy/application_1578623516978_0024
20/01/10 05:28:59 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /jobs, /jobs/json, /jobs/job, /jobs/job/json, /stages, /stages/json, /stages/stage, /stages/stage/json, /stages/pool, /stages/pool/json, /storage, /storage/json, /storage/rdd, /storage/rdd/json, /environment, /environment/json, /executors, /executors/json, /executors/threadDump, /executors/threadDump/json, /static, /, /api, /jobs/job/kill, /stages/stage/kill.
20/01/10 05:28:59 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /metrics/json.
20/01/10 05:28:59 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(spark-client://YarnAM)
20/01/10 05:28:59 INFO EventLoggingListener: Logging events to hdfs:/var/log/spark/apps/application_1578623516978_0024
20/01/10 05:28:59 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
20/01/10 05:28:59 INFO SharedState: loading hive config file: file:/etc/spark/conf.dist/hive-site.xml
...
Please check the Livy docs for REST API for further information.

Using RStudio-sparklyr to connect to local Spark provided by IntelliJ

Good morning,
it maybe sounds like a stupid question, but I would like to access a temporary table in Spark by RStudio. I don't have any Spark cluster, and I only run everything local on my PC.
When I start Spark through IntelliJ, the instance is running fine:
17/11/11 10:11:33 INFO Utils: Successfully started service 'sparkDriver' on port 59505.
17/11/11 10:11:33 INFO SparkEnv: Registering MapOutputTracker
17/11/11 10:11:33 INFO SparkEnv: Registering BlockManagerMaster
17/11/11 10:11:33 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/11/11 10:11:33 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/11/11 10:11:33 INFO DiskBlockManager: Created local directory at C:\Users\stephan\AppData\Local\Temp\blockmgr-7ca4e8fb-9456-4063-bc6d-39324d7dad4c
17/11/11 10:11:33 INFO MemoryStore: MemoryStore started with capacity 898.5 MB
17/11/11 10:11:33 INFO SparkEnv: Registering OutputCommitCoordinator
17/11/11 10:11:33 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/11/11 10:11:34 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://172.25.240.1:4040
17/11/11 10:11:34 INFO Executor: Starting executor ID driver on host localhost
17/11/11 10:11:34 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 59516.
17/11/11 10:11:34 INFO NettyBlockTransferService: Server created on 172.25.240.1:59516
But I am not sure about the port, I have to choose in RStudio/sparklyr:
sc <- spark_connect(master = "spark://localhost:7077", spark_home = "C://Users//stephan//Downloads//spark//spark-2.2.0-bin-hadoop2.7", version = "2.2.0")
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
cannot open file 'C:\Users\stephan\AppData\Local\Temp\Rtmp61Ejow\file2fa024ce51af_spark.log': Permission denied
I tried different ports, like 59516, 4040, ... but all led to the same result. The permission denied message I guess can be ignored due that the file is written fine:
17/11/11 01:07:30 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master localhost:7077
Can please anyone assist me, how I can establish a connection between a local running Spark and RStudio, but without that RStudio is running another Spark instance?
Thanks
Stephan
Running standalone Spark cluster is not the same thing as running Spark in local mode in your IDE, which is likely the case here. local mode doesn't create any persistent services.
To run your own "pseudodistributed" cluster:
Download Spark binaries.
Start Spark master using $SPARK_HOME/sbin/start-master.sh script.
Start Spark worker using $SPARK_HOME/sbin/start-slave.sh script and passing master url.
To be able to share tables you'll also need a proper metastore (not Derby).

mongo-spark error while loading data from mongo through spark using spark stratio connector

I am trying to load mongo data in spark using spark stratio connector(version:spark-mongodb_2.11-0.12.0). for that I have added all necessary dependencies. I am trying to create rdd by loading mongo data from my local mongo. Below is my code:
import org.apache.spark.SparkContext
import org.apache.spark.sql._
import com.mongodb.casbah.{WriteConcern => MongodbWriteConcern}
import com.stratio.datasource.mongodb._
import com.stratio.datasource.mongodb.config._
import com.stratio.datasource.mongodb.config.MongodbConfig._
import org.apache.spark.sql.SparkSession
import akka.actor.ActorSystem
import org.apache.spark.SparkConf
object newtest {def main(args:Array[String]){
System.setProperty("hadoop.home.dir", "C:\\winutil\\");
import org.apache.spark.sql.functions._
val sparkSession = SparkSession.builder().master("local").getOrCreate()
//spark.conf.set("spark.executor.memory", "2g")
val builder = MongodbConfigBuilder(Map(Host -> List("localhost:27017"), Database -> "test", Collection ->"SurvyAnswer", SamplingRatio -> 1.0, WriteConcern -> "normal"))
val readConfig = builder.build()
val columns=Array("GroupId", "_Id", "hgId")
val mongoRDD = sparkSession.sqlContext.fromMongoDB(readConfig)
mongoRDD.take(2).foreach(println)
I am getting below error while connection it is failing.I am not getting why this error is showing :
17/02/21 14:45:45 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect.
17/02/21 14:45:45 INFO SharedState: Warehouse path is 'file:/C:/Users/gbhog/Desktop/BDG/example/mongospark/spark-warehouse'.
17/02/21 14:45:48 INFO cluster: Cluster created with settings {hosts=[localhost:27017], mode=MULTIPLE, requiredClusterType=UNKNOWN, serverSelectionTimeout='30000 ms', maxWaitQueueSize=500}
17/02/21 14:45:48 INFO cluster: Adding discovered server localhost:27017 to client view of cluster
Exception in thread "main" java.lang.NoSuchFieldError: NONE
at com.mongodb.casbah.WriteConcern$.<init>(WriteConcern.scala:40)
at com.mongodb.casbah.WriteConcern$.<clinit>(WriteConcern.scala)
at com.mongodb.casbah.BaseImports$class.$init$(Implicits.scala:162)
at com.mongodb.casbah.Imports$.<init>(Implicits.scala:142)
at com.mongodb.casbah.Imports$.<clinit>(Implicits.scala)
at com.mongodb.casbah.MongoClient.apply(MongoClient.scala:217)
at com.stratio.datasource.mongodb.partitioner.MongodbPartitioner.isShardedCollection(MongodbPartitioner.scala:78)
at com.stratio.datasource.mongodb.partitioner.MongodbPartitioner$$anonfun$computePartitions$1.apply(MongodbPartitioner.scala:67)
at com.stratio.datasource.mongodb.partitioner.MongodbPartitioner$$anonfun$computePartitions$1.apply(MongodbPartitioner.scala:66)
at com.stratio.datasource.mongodb.util.usingMongoClient$.apply(usingMongoClient.scala:27)
at com.stratio.datasource.mongodb.partitioner.MongodbPartitioner.computePartitions(MongodbPartitioner.scala:66)
17/02/21 14:45:48 INFO SparkContext: Invoking stop() from shutdown hook
17/02/21 14:45:48 INFO SparkUI: Stopped Spark web UI at http://192.168.242.1:4040
17/02/21 14:45:48 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/02/21 14:45:49 INFO MemoryStore: MemoryStore cleared
17/02/21 14:45:49 INFO BlockManager: BlockManager stopped
17/02/21 14:45:49 INFO BlockManagerMaster: BlockManagerMaster stopped
17/02/21 14:45:49 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/02/21 14:45:49 INFO SparkContext: Successfully stopped SparkContext
17/02/21 14:45:49 INFO ShutdownHookManager: Shutdown hook called

Why does my Spark Streaming application shut down immediately (and not process any Kafka records)?

I've created a Spark application in Python following the example described in Spark Streaming + Kafka Integration Guide (Kafka broker version 0.8.2.1 or higher) to stream Kafka messages using Apache Spark, but it's shutting down before I get the chance to send any messages.
This is where the shutdown section begins in the output.
16/11/26 17:11:06 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 1********6, 58045)
16/11/26 17:11:06 INFO VerifiableProperties: Verifying properties
16/11/26 17:11:06 INFO VerifiableProperties: Property group.id is overridden to
16/11/26 17:11:06 INFO VerifiableProperties: Property zookeeper.connect is overridden to
16/11/26 17:11:07 INFO SparkContext: Invoking stop() from shutdown hook
16/11/26 17:11:07 INFO SparkUI: Stopped Spark web UI at http://192.168.1.16:4040
16/11/26 17:11:07 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/11/26 17:11:07 INFO MemoryStore: MemoryStore cleared
16/11/26 17:11:07 INFO BlockManager: BlockManager stopped
16/11/26 17:11:07 INFO BlockManagerMaster: BlockManagerMaster stopped
16/11/26 17:11:07 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/11/26 17:11:07 INFO SparkContext: Successfully stopped SparkContext
16/11/26 17:11:07 INFO ShutdownHookManager: Shutdown hook called
16/11/26 17:11:07 INFO ShutdownHookManager: Deleting directory /private/var/folders/yn/t3pvrk7s231_11ff2lqr4jhr0000gn/T/spark-1876feee-9b71-413e-a505-99c414aafabf/pyspark-1d97c3dd-0889-42ed-b559-d0fd473faa22
16/11/26 17:11:07 INFO ShutdownHookManager: Deleting directory /private/var/folders/yn/t3pvrk7s231_11ff2lqr4jhr0000gn/T/spark-1876feee-9b71-413e-a505-99c414aafabf
Is there a way I should tell it to wait or am I missing something?
Full code:
from pyspark.streaming.kafka import KafkaUtils
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
sc = SparkContext("local[2]", "TwitterWordCount")
ssc = StreamingContext(sc, 1)
directKafkaStream = KafkaUtils.createDirectStream(ssc, ["next"], {"metadata.broker.list": "localhost:9092"})
offsetRanges = []
def storeOffsetRanges(rdd):
global offsetRanges
offsetRanges = rdd.offsetRanges()
return rdd
def printOffsetRanges(rdd):
for o in offsetRanges:
print("Printing! %s %s %s %s" % o.topic, o.partition, o.fromOffset, o.untilOffset)
directKafkaStream\
.transform(storeOffsetRanges)\
.foreachRDD(printOffsetRanges)
And here's the command to run it in case that's helpful.
spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 producer.py
You will also need to start the streaming context. Take a look at this example.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#a-quick-example
ssc.start() # Start the computation
ssc.awaitTermination() # Wait for the computation to terminate
For Scala when submitting to yarn with cluster mode I had to use awaitAnyTermination:
query.start()
sparkSession.streams.awaitAnyTermination()
as (kind of) per the docs here Structured Streaming Guide half way through Quick Example.

Spark UI's kill is not killing Driver

I am trying to kill my spark-kafka streaming job from Spark UI. It is able to kill the application but the driver is still running.
Can anyone help me with this. I am good with my other streaming jobs. only one of the streaming jobs is giving this problem ever time.
I can't kill the driver through command or spark UI. Spark Master is alive.
Output i collected from logs is -
16/10/25 03:14:25 INFO BlockManagerMaster: Removed 0 successfully in removeExecutor
16/10/25 03:14:25 INFO SparkUI: Stopped Spark web UI at http://***:4040
16/10/25 03:14:25 INFO SparkDeploySchedulerBackend: Shutting down all executors
16/10/25 03:14:25 INFO SparkDeploySchedulerBackend: Asking each executor to shut down
16/10/25 03:14:35 INFO AppClient: Stop request to Master timed out; it may already be shut down.
16/10/25 03:14:35 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/10/25 03:14:35 INFO MemoryStore: MemoryStore cleared
16/10/25 03:14:35 INFO BlockManager: BlockManager stopped
16/10/25 03:14:35 INFO BlockManagerMaster: BlockManagerMaster stopped
16/10/25 03:14:35 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/10/25 03:14:35 INFO SparkContext: Successfully stopped SparkContext
16/10/25 03:14:35 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exiting due to error from cluster scheduler: Master removed our application: KILLED
at org.apache.spark.scheduler.TaskSchedulerImpl.error(TaskSchedulerImpl.scala:438)
at org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.dead(SparkDeploySchedulerBackend.scala:124)
at org.apache.spark.deploy.client.AppClient$ClientEndpoint.markDead(AppClient.scala:264)
at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$receive$1.applyOrElse(AppClient.scala:172)
at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/10/25 03:14:35 WARN NettyRpcEnv: Ignored message: true
16/10/25 03:14:35 WARN AppClient$ClientEndpoint: Connection to master:7077 failed; waiting for master to reconnect...
16/10/25 03:14:35 WARN AppClient$ClientEndpoint: Connection to master:7077 failed; waiting for master to reconnect...
Get the running driverId from spark UI, and hit the post rest call(spark master rest port like 6066) to kill the pipeline. I have tested it with spark 1.6.1
curl -X POST http://localhost:6066/v1/submissions/kill/driverId
Hope it helps...

Resources