Spark RDD.pipe run bash script as a specific user - apache-spark

I notice that RDD.pipe(Seq("/tmp/test.sh")) runs the shell script with the user yarn . that is problematic because it allows the spark user to access files that should only be accessible to the yarn user.
What is the best way to address this ?
Calling sudo -u sparkuser is not a clean solution . I would hate to even consider that .

I am not sure if this is the fault of Spark to treat the Pipe() differently, but I opened a similar issue on JIRA: https://issues.apache.org/jira/projects/SPARK/issues/SPARK-26101
Now on to the problem. Apparently in YARN cluster Spark Pipe() asks for a container, whether your Hadoop is nonsecure or is secured by Kerberos is the difference between whether container runs by user yarn/nobody or the user who launches the container your actual user.
Either use Kerberos to secure your Hadoop or if you don't want to go through securing your Hadoop, you can set two configs in YARN which uses the Linux users/groups to launches the container. Note, you must share the same users/groups across all the nodes in your cluster. Otherwise, this won't work. (perhaps use LDAP/AD to sync your users/groups)
Set these:
yarn.nodemanager.linux-container-executor.nonsecure-mode.limit-users = false
yarn.nodemanager.container-executor.class = org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor
Source: https://hadoop.apache.org/docs/r2.7.4/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html
(this is the same even in Hadoop 3.0)
This fixed worked on Cloudera latest CDH 5.15.1 (yarn-site.xml):
http://community.cloudera.com/t5/Batch-Processing-and-Workflow/YARN-force-nobody-user-on-all-jobs-and-so-they-fail/m-p/82572/highlight/true#M3882
Example:
val test = sc.parallelize(Seq("test user")).repartition(1)
val piped = test.pipe(Seq("whoami"))
val c = piped.collect()
est: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[4] at repartition at <console>:25
piped: org.apache.spark.rdd.RDD[String] = PipedRDD[5] at pipe at <console>:25
c: Array[String] = Array(maziyar)
This will return the username who started the Spark session after setting those configs in yarn-site.xml and sync all the users/groups among all the nodes.

Related

how can spark read / write from azurite

I am trying to read (and eventually write) from azurite (version 3.18.0) using spark (3.1.1)
i can't understand what spark configurations and file uri i need to set to make this work properly
for example these are the containers and files i have inside azurite
/devstoreaccount1/container1/file1.avro
/devstoreaccount1/container2/file2.avro
This is the code that im running - the uri val is one of the values below
val uri = ...
val spark = SparkSession.builder()
.appName(appName)
.master("local")
.config("spark.driver.host", "127.0.0.1").getOrCreate()
spark.conf.set("spark.hadoop.fs.wasbs.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.conf.set(s"spark.hadoop.fs.azure.account.auth.type.devstoreaccount1.blob.core.windows.net", "SharedKey")
spark.conf.set(s"spark.hadoop.fs.azure.account.key.devstoreaccount1.blob.core.windows.net", <azurite account key>)
spark.read.format("avro").load(uri)
uri value - what is the correct one?
http://127.0.0.1:10000/container1/file1.avro
I get UnsupportedOperationException when i perform the spark.read.format("avro").load(uri) because spark will use the HttpFileSystem implementation and it doesn't support listStatus
wasb://container1#devstoreaccount1.blob.core.windows.net/file1.avro
Spark will try to authenticate against azure servers (and will fail for obvious reasons)
I have tried to follow this stackoverflow post without success.
I have also tried to remove the blob.core.windows.net configuration postfix but then i don't how to give spark the endpoint for the azurite container?
So my question is what are the correct configurations to give spark so it will be able to read from azurite, and what are the correct file path formats to pass as the URI?

How to print out Spark connection of Spark session ?

Suppose I run pyspark command and got global variable spark of type SparkSession. As I understand, this spark holds a connection to the Spark master. Can I print out the details of this connection including the hostname of this Spark master ?
For basic information you can use master property:
spark.sparkContext.master
To get details on YARN you might have to dig through hadoopConfiguration:
hadoopConfiguration = spark.sparkContext._jsc.hadoopConfiguration()
hadoopConfiguration.get("yarn.resourcemanager.hostname")
or
hadoopConfiguration.get("yarn.resourcemanager.address")
When submitted to YARN Spark uses Hadoop configuration to determine the resource manger so these values should match ones present in configuration placed in HADOOP_CONF_DIR or YARN_CONF_DIR.

Apache Spark: resulting file being created at worker node instead of master node

I configure one master at local pc and a worker node inside virtualbox and the result file has been creating at worker node, instread of sending back to master node, I wonder why is that.
Because my worker node cannot send result back to master node? how to verify that?
I use spark2.2.
I use same username for master and worker node.
I also configured ssh without password.
I tried --deploy-mode client and --deploy-mode cluster
I tried once then I switched the master/worker node and I got the same result.
val result = joined.distinct()
result.write.mode("overwrite").format("csv")
.option("header", "true").option("delimiter", ";")
.save("file:///home/data/KPI/KpiDensite.csv")
also, for input file, I load like this:
val commerce = spark.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true")
.option("delimiter", "|").load("file:///home/data/equip-serv-commerce-infra-2016.csv").distinct()
but why must I presend the file both at master and worker node at the same position? I don't use yarn or mesos right now.
You are exporting to a local file system, which tells Spark to write it on the file system of the machine running the code. On the worker, that will be the file system of the worker machine.
If you want the data to be stored on the file system of the driver (not master, you'll need to know where the driver is running on your yarn cluster), then you need to collect the RDD or data frame and use normal IO code to write the data to a file.
The easiest option, however, is to use a distributed storage system, such as HDFS (.save("hdfs://master:port/data/KPI/KpiDensite.csv")) or export to a database (writing to a JDBC or using a nosql db); if you're running your application in cluster mode.

Write spark event log to local filesystem instead of hdfs

I want to redirect event log of my spark applications to a local directory like "/tmp/spark-events" instead of "hdfs://user/spark/applicationHistory".
I set the "spark.eventLog.dir" variable to "file:///tmp/spark-events" in cloudera manager (Spark Client Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-defaults.con).
But when I restart spark, spark-conf conatains (spark.eventLog.dir=hdfs://nameservice1file:///tmp/spark-eventstmp/spark) and this not works.

Spark: Exception in thread "main" akka.actor.ActorNotFound:

I am submitting my spark jobs from a local laptop to a remote standalone Spark cluster (spark://IP:7077). It is submitted successfully. However, I do not get any output and it fails after some time. When i check the workers on my cluster, I find the following exception:
Exception in thread "main" akka.actor.ActorNotFound: Actor not found for: ActorSelection[Actor[akka.tcp://sparkDriver#localhost:54561/]/user/CoarseGrainedScheduler]
When I run the same code on my local system (local[*]), it runs successfully and gives the output.
Note that I run it in spark notebook. The same application runs successfully on the remote standalone cluster when i submit it via terminal using spark-submit
Am I missing something in the configuration of notebook? Any other possible causes?
The code is very simple.
Detailed exception:
Exception in thread "main" akka.actor.ActorNotFound: Actor not found for: ActorSelection[Actor[akka.tcp://sparkDriver#localhost:54561/]/user/CoarseGrainedScheduler]
at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:66)
at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:64)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110)
at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:269)
at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:512)
at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:545)
at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:535)
at akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:91)
at akka.actor.ActorRef.tell(ActorRef.scala:125)
at akka.dispatch.Mailboxes$$anon$1$$anon$2.enqueue(Mailboxes.scala:44)
at akka.dispatch.QueueBasedMessageQueue$class.cleanUp(Mailbox.scala:438)
at akka.dispatch.UnboundedDequeBasedMailbox$MessageQueue.cleanUp(Mailbox.scala:650)
at akka.dispatch.Mailbox.cleanUp(Mailbox.scala:309)
at akka.dispatch.MessageDispatcher.unregister(AbstractDispatcher.scala:204)
at akka.dispatch.MessageDispatcher.detach(AbstractDispatcher.scala:140)
at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:203)
at akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:163)
at akka.actor.ActorCell.terminate(ActorCell.scala:338)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:431)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:447)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:262)
at akka.dispatch.Mailbox.run(Mailbox.scala:218)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Sample code
val logFile = "hdfs://hostname/path/to/file"
val conf = new SparkConf()
.setMaster("spark://hostname:7077") // as appears on hostname:8080
.setAppName("myapp")
.set("spark.executor.memory", "20G")
.set("spark.cores.max", "40")
.set("spark.executor.cores","20")
.set("spark.driver.allowMultipleContexts","true")
val sc2 = new SparkContext(conf)
val logData = sc2.textFile(logFile)
val numAs = logData.filter(line => line.contains("hello")).count()
val numBs = logData.filter(line => line.contains("hi")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
Update:
The above issue can be avoided by including the IP address of driver (i.e., local laptop's public IP) within the application code. This can be done by adding the following line in the spark context:
.set("spark.driver.host",YourSystemIPAddress)
However, there can be issue if the driver's IP address is behind the NAT. In this case the workers will not be able to find the IP.
When you say "spark notebook" I am assuming you mean the github project https://github.com/andypetrella/spark-notebook?
I would have to look into specifics of notebook but I notice your worker is trying to connect to a master on "localhost".
For normal Spark configuration, on the worker set SPARK_MASTER_IP in $SPARK_HOME/conf/spark-env.sh and see if that helps, Even if you are running on a single machine in standalone mode, set this. In my experience Spark doesn't always resolve hostnames properly so starting from a baseline of all IPs is a good idea.
The rest is general info, see if it helps with your specific issue:
If you are submitting to a cluster from your laptop you use --deploy-mode to cluster to tell your driver to run on one of the worker nodes. This creates an extra consideration of how you setup your classpath because you don't know which worker the driver will run on.
Here's some general info in the interest of completeness, there is a known Spark bug about hostnames resolving to IP addresses. I am not presenting this as the complete answer in all cases, but I suggest trying with a baseline of just using all IPs, and only use the single config SPARK_MASTER_IP. With just those two practices I get my clusters to work and all the other configs, or using hostnames, just seems to muck things up.
So in your spark-env.sh get rid of SPARK_LOCAL_IP and change SPARK_MASTER_IP to an IP address, not a hostname.
I have treated this more at length in this answer.
For more completeness here's part of that answer:
Can you ping the box where the Spark master is running? Can you ping the worker from the master? More importantly, can you password-less ssh to the worker from the master box? Per 1.5.2 docs you need to be able to do that with a private key AND have the worker entered in the conf/slaves file. I copied the relevant paragraph at the end.
You can get a situation where the worker can contact the master but the master can't get back to the worker so it looks like no connection is being made. Check both directions.
I think the slaves file on the master node, and the password-less ssh can lead to similar errors to what you are seeing.
Per the answer I crosslinked, there's also an old bug but it's not clear how that bug was resolved.

Resources