How to print out Spark connection of Spark session ? - apache-spark

Suppose I run pyspark command and got global variable spark of type SparkSession. As I understand, this spark holds a connection to the Spark master. Can I print out the details of this connection including the hostname of this Spark master ?

For basic information you can use master property:
spark.sparkContext.master
To get details on YARN you might have to dig through hadoopConfiguration:
hadoopConfiguration = spark.sparkContext._jsc.hadoopConfiguration()
hadoopConfiguration.get("yarn.resourcemanager.hostname")
or
hadoopConfiguration.get("yarn.resourcemanager.address")
When submitted to YARN Spark uses Hadoop configuration to determine the resource manger so these values should match ones present in configuration placed in HADOOP_CONF_DIR or YARN_CONF_DIR.

Related

Spark RDD.pipe run bash script as a specific user

I notice that RDD.pipe(Seq("/tmp/test.sh")) runs the shell script with the user yarn . that is problematic because it allows the spark user to access files that should only be accessible to the yarn user.
What is the best way to address this ?
Calling sudo -u sparkuser is not a clean solution . I would hate to even consider that .
I am not sure if this is the fault of Spark to treat the Pipe() differently, but I opened a similar issue on JIRA: https://issues.apache.org/jira/projects/SPARK/issues/SPARK-26101
Now on to the problem. Apparently in YARN cluster Spark Pipe() asks for a container, whether your Hadoop is nonsecure or is secured by Kerberos is the difference between whether container runs by user yarn/nobody or the user who launches the container your actual user.
Either use Kerberos to secure your Hadoop or if you don't want to go through securing your Hadoop, you can set two configs in YARN which uses the Linux users/groups to launches the container. Note, you must share the same users/groups across all the nodes in your cluster. Otherwise, this won't work. (perhaps use LDAP/AD to sync your users/groups)
Set these:
yarn.nodemanager.linux-container-executor.nonsecure-mode.limit-users = false
yarn.nodemanager.container-executor.class = org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor
Source: https://hadoop.apache.org/docs/r2.7.4/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html
(this is the same even in Hadoop 3.0)
This fixed worked on Cloudera latest CDH 5.15.1 (yarn-site.xml):
http://community.cloudera.com/t5/Batch-Processing-and-Workflow/YARN-force-nobody-user-on-all-jobs-and-so-they-fail/m-p/82572/highlight/true#M3882
Example:
val test = sc.parallelize(Seq("test user")).repartition(1)
val piped = test.pipe(Seq("whoami"))
val c = piped.collect()
est: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[4] at repartition at <console>:25
piped: org.apache.spark.rdd.RDD[String] = PipedRDD[5] at pipe at <console>:25
c: Array[String] = Array(maziyar)
This will return the username who started the Spark session after setting those configs in yarn-site.xml and sync all the users/groups among all the nodes.

Apache Spark: resulting file being created at worker node instead of master node

I configure one master at local pc and a worker node inside virtualbox and the result file has been creating at worker node, instread of sending back to master node, I wonder why is that.
Because my worker node cannot send result back to master node? how to verify that?
I use spark2.2.
I use same username for master and worker node.
I also configured ssh without password.
I tried --deploy-mode client and --deploy-mode cluster
I tried once then I switched the master/worker node and I got the same result.
val result = joined.distinct()
result.write.mode("overwrite").format("csv")
.option("header", "true").option("delimiter", ";")
.save("file:///home/data/KPI/KpiDensite.csv")
also, for input file, I load like this:
val commerce = spark.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true")
.option("delimiter", "|").load("file:///home/data/equip-serv-commerce-infra-2016.csv").distinct()
but why must I presend the file both at master and worker node at the same position? I don't use yarn or mesos right now.
You are exporting to a local file system, which tells Spark to write it on the file system of the machine running the code. On the worker, that will be the file system of the worker machine.
If you want the data to be stored on the file system of the driver (not master, you'll need to know where the driver is running on your yarn cluster), then you need to collect the RDD or data frame and use normal IO code to write the data to a file.
The easiest option, however, is to use a distributed storage system, such as HDFS (.save("hdfs://master:port/data/KPI/KpiDensite.csv")) or export to a database (writing to a JDBC or using a nosql db); if you're running your application in cluster mode.

Spark: Exception in thread "main" akka.actor.ActorNotFound:

I am submitting my spark jobs from a local laptop to a remote standalone Spark cluster (spark://IP:7077). It is submitted successfully. However, I do not get any output and it fails after some time. When i check the workers on my cluster, I find the following exception:
Exception in thread "main" akka.actor.ActorNotFound: Actor not found for: ActorSelection[Actor[akka.tcp://sparkDriver#localhost:54561/]/user/CoarseGrainedScheduler]
When I run the same code on my local system (local[*]), it runs successfully and gives the output.
Note that I run it in spark notebook. The same application runs successfully on the remote standalone cluster when i submit it via terminal using spark-submit
Am I missing something in the configuration of notebook? Any other possible causes?
The code is very simple.
Detailed exception:
Exception in thread "main" akka.actor.ActorNotFound: Actor not found for: ActorSelection[Actor[akka.tcp://sparkDriver#localhost:54561/]/user/CoarseGrainedScheduler]
at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:66)
at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:64)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110)
at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:269)
at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:512)
at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:545)
at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:535)
at akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:91)
at akka.actor.ActorRef.tell(ActorRef.scala:125)
at akka.dispatch.Mailboxes$$anon$1$$anon$2.enqueue(Mailboxes.scala:44)
at akka.dispatch.QueueBasedMessageQueue$class.cleanUp(Mailbox.scala:438)
at akka.dispatch.UnboundedDequeBasedMailbox$MessageQueue.cleanUp(Mailbox.scala:650)
at akka.dispatch.Mailbox.cleanUp(Mailbox.scala:309)
at akka.dispatch.MessageDispatcher.unregister(AbstractDispatcher.scala:204)
at akka.dispatch.MessageDispatcher.detach(AbstractDispatcher.scala:140)
at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:203)
at akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:163)
at akka.actor.ActorCell.terminate(ActorCell.scala:338)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:431)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:447)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:262)
at akka.dispatch.Mailbox.run(Mailbox.scala:218)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Sample code
val logFile = "hdfs://hostname/path/to/file"
val conf = new SparkConf()
.setMaster("spark://hostname:7077") // as appears on hostname:8080
.setAppName("myapp")
.set("spark.executor.memory", "20G")
.set("spark.cores.max", "40")
.set("spark.executor.cores","20")
.set("spark.driver.allowMultipleContexts","true")
val sc2 = new SparkContext(conf)
val logData = sc2.textFile(logFile)
val numAs = logData.filter(line => line.contains("hello")).count()
val numBs = logData.filter(line => line.contains("hi")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
Update:
The above issue can be avoided by including the IP address of driver (i.e., local laptop's public IP) within the application code. This can be done by adding the following line in the spark context:
.set("spark.driver.host",YourSystemIPAddress)
However, there can be issue if the driver's IP address is behind the NAT. In this case the workers will not be able to find the IP.
When you say "spark notebook" I am assuming you mean the github project https://github.com/andypetrella/spark-notebook?
I would have to look into specifics of notebook but I notice your worker is trying to connect to a master on "localhost".
For normal Spark configuration, on the worker set SPARK_MASTER_IP in $SPARK_HOME/conf/spark-env.sh and see if that helps, Even if you are running on a single machine in standalone mode, set this. In my experience Spark doesn't always resolve hostnames properly so starting from a baseline of all IPs is a good idea.
The rest is general info, see if it helps with your specific issue:
If you are submitting to a cluster from your laptop you use --deploy-mode to cluster to tell your driver to run on one of the worker nodes. This creates an extra consideration of how you setup your classpath because you don't know which worker the driver will run on.
Here's some general info in the interest of completeness, there is a known Spark bug about hostnames resolving to IP addresses. I am not presenting this as the complete answer in all cases, but I suggest trying with a baseline of just using all IPs, and only use the single config SPARK_MASTER_IP. With just those two practices I get my clusters to work and all the other configs, or using hostnames, just seems to muck things up.
So in your spark-env.sh get rid of SPARK_LOCAL_IP and change SPARK_MASTER_IP to an IP address, not a hostname.
I have treated this more at length in this answer.
For more completeness here's part of that answer:
Can you ping the box where the Spark master is running? Can you ping the worker from the master? More importantly, can you password-less ssh to the worker from the master box? Per 1.5.2 docs you need to be able to do that with a private key AND have the worker entered in the conf/slaves file. I copied the relevant paragraph at the end.
You can get a situation where the worker can contact the master but the master can't get back to the worker so it looks like no connection is being made. Check both directions.
I think the slaves file on the master node, and the password-less ssh can lead to similar errors to what you are seeing.
Per the answer I crosslinked, there's also an old bug but it's not clear how that bug was resolved.

How to configure SSL between Spark and Cassandra?

I'm trying to configure SSL for the Cassandra Spark connector, but I couldn't find an example of how to do it.
I'm trying to configure it like this:
SparkConf conf = new SparkConf().setAppName("someApp")
.set("spark.cassandra.connection.host", "111.111.111.111")
.set("spark.cassandra.connection.ssl.enabled", "true")
.set("spark.cassandra.connection.ssl.trustStore.path", "/some/tfile.jks")
.set("spark.cassandra.connection.ssl.trustStore.password", "apassword")
.set("spark.cassandra.connection.ssl.trustStore.type", "JKS")
.set("spark.cassandra.connection.ssl.enabledAlgorithms", "TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA")
.set("spark.cassandra.connection.ssl.keyStore.path", "/some/kfile.jks")
.set("spark.cassandra.connection.ssl.keyStore.password", "anotherpassword")
.set("spark.cassandra.connection.ssl.keyStore.type", "JKS")
.set("spark.cassandra.connection.ssl.protocol", "TLS");
When I try to submit the spark job, I get these errors:
Exception in thread "main" com.datastax.spark.connector.util.ConfigCheck$ConnectorConfigurationException: Invalid Config Variables
Only known spark.cassandra.* variables are allowed when using the Spark Cassandra Connector.
spark.cassandra.connection.ssl.keyStore.password is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.enabled is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.protocol is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.keyStore.type is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.trustStore.path is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.enabledAlgorithms is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.keyStore.path is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.trustStore.password is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.connection.ssl.trustStore.type is not a valid Spark Cassandra Connector variable.
No likely matches found.
So I'm not sure if this is supported or I'm just using the wrong property names.
I saw this ticket for release 1.2.3 of the connector, but I couldn't find an example of how to use it and it sounded like it may not support keystores. I'm using version 1.4.0-M1 of the connector.
Can anyone show me an example of how to configure SSL for the Spark Cassandra connector? Thanks.
Though I don't see any keystore configurations, I can see below config variables and they are working fine for me.
Note: I am using 1.5.0-M1 version. Not sure if there is any other bug in the version you are using.
sparkConf.set("spark.cassandra.connection.ssl.enabled", "true");
sparkConf.set("spark.cassandra.connection.ssl.trustStore.password", "password");
sparkConf.set("spark.cassandra.connection.ssl.trustStore.path", "jks file path");

Connecting to Cassandra with Spark

First, I have bought the new O'Reilly Spark book and tried those Cassandra setup instructions. I've also found other stackoverflow posts and various posts and guides over the web. None of them work as-is. Below is as far as I could get.
This is a test with only a handful of records of dummy test data. I am running the most recent Cassandra 2.0.7 Virtual Box VM provided by plasetcassandra.org linked from the main Cassandra project page.
I downloaded Spark 1.2.1 source and got the latest Cassandra Connector code from github and built both against Scala 2.11. I have JDK 1.8.0_40 and Scala 2.11.6 setup on Mac OS 10.10.2.
I run the spark shell with the cassandra connector loaded:
bin/spark-shell --driver-class-path ../spark-cassandra-connector/spark-cassandra-connector/target/scala-2.11/spark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar
Then I do what should be a simple row count type test on a test table of four records:
import com.datastax.spark.connector._
sc.stop
val conf = new org.apache.spark.SparkConf(true).set("spark.cassandra.connection.host", "192.168.56.101")
val sc = new org.apache.spark.SparkContext(conf)
val table = sc.cassandraTable("mykeyspace", "playlists")
table.count
I get the following error. What is confusing is that it is getting errors trying to find Cassandra at 127.0.0.1, but it also recognizes the host name that I configured which is 192.168.56.101.
15/03/16 15:56:54 INFO Cluster: New Cassandra host /192.168.56.101:9042 added
15/03/16 15:56:54 INFO CassandraConnector: Connected to Cassandra cluster: Cluster on a Stick
15/03/16 15:56:54 ERROR ServerSideTokenRangeSplitter: Failure while fetching splits from Cassandra
java.io.IOException: Failed to open thrift connection to Cassandra at 127.0.0.1:9160
<snip>
java.io.IOException: Failed to fetch splits of TokenRange(0,0,Set(CassandraNode(/127.0.0.1,/127.0.0.1)),None) from all endpoints: CassandraNode(/127.0.0.1,/127.0.0.1)
BTW, I can also use a configuration file at conf/spark-defaults.conf to do the above without having to close/recreate a spark context or pass in the --driver-clas-path argument. I ultimately hit the same error though, and the above steps seem easier to communicate in this post.
Any ideas?
Check the rpc_address config in your cassandra.yaml file on your cassandra node. It's likely that the spark connector is using that value from the system.local/system.peers tables and it may be set to 127.0.0.1 in your cassandra.yaml.
The spark connector uses thrift to get token range splits from cassandra. Eventually I'm betting this will be replaced as C* 2.1.4 has a new table called system.size_estimates (CASSANDRA-7688). It looks like it's getting the host metadata to find the nearest host and then making the query using thrift on port 9160.

Resources