SparkR::dapply library not recognized - apache-spark

Introduction:
I've installed some packages on a Databricks cluster using install.packages on DR 9.1 LTS, and I want to run a UDF using R & Spark (SparkR or sparklyr). My use case is to score some data in batch using Spark (either SparkR or sparklyr). I've currently chosen SparkR::dapply. The main issue is that the installed packages don't appear to be available on the workers using SparkR::dapply.
Code (info reduced and some revised for privacy):
install.packages("lda", repos = "https://cran.microsoft.com/snapshot/2021-12-01/")
my_data<- read.csv('/dbfs/mnt/container/my_data.csv')
my_data_sdf <- as.DataFrame(my_data)
schema <- structType(structField("Var1", "integer"),structField("Var2", "integer"),structField("Var3", "integer"))
df1 <- SparkR::dapply(my_data_sdf , function(my_data) {
# lda #
#install.packages("lda", repos = "https://cran.microsoft.com/snapshot/2021-12-01/")
library( lda )
return(my_data_sdf)
}, schema)
display(df1)
Error message (some info redacted with 'X'):
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 9) (X0.XXX.X.X executor 0): org.apache.spark.SparkException: R unexpectedly exited.
R worker produced errors: Error in library(lda) : there is no package called ‘lda’
Calls: compute -> computeFunc -> library
Execution halted
System\Hardware:
Azure Databricks
Databricks Runtime 9.1 LTS (min 2 workers max 10)
Worker hardware = Standard_DS5_v2
Driver hardware = Standard_D32s_v2
Notes:
If I use 'require' no error message is returned, but 'require' is designed not to return an error message.
I'm able to run SparkR::dapply and preform operations, but once I add in library(lda) I get an error message even though I've installed 'lda' and I'm using DR 9.1 LTS
I'm using recommended CRAN snapshot to install - https://learn.microsoft.com/en-us/azure/databricks/kb/r/pin-r-packages
I'm using DR 9.1 LTS which (to my understanding) makes installed packages available to workers - "Starting with Databricks Runtime 9.0, R packages are accessible to worker nodes as well as the driver node." - https://learn.microsoft.com/en-us/azure/databricks/libraries/notebooks-r-libraries
If I include install.packages("lda", repos = "https://cran.microsoft.com/snapshot/2021-12-01/") in dapply, then it works without error, but this doesn't seem like best practice from documentation.
Questions:
How do I install R packages on Databricks clusters so they're available on all the nodes? What is the proper approach?
How do I make sure that my packages are available to SparkR::dapply?
Thoughts on including install.packages in the dapply function itself?
Should I try something other than SparkR::dapply?
Thanks everyone :)

After working with Azure support team, the following work-around / alternative option is to use an init script. Init script all together works well and plays nicely with Data Factory.
Example
From Notebook:
dbutils.fs.mkdirs("dbfs:/databricks/scripts/")
dbutils.fs.put("/databricks/scripts/r-installs.sh","""R -e 'install.packages("caesar", repos="https://cran.microsoft.com/snapshot/2021-08-02/")'""", True)
display(dbutils.fs.ls("dbfs:/databricks/scripts/r-installs.sh"))
From Cluster UI:
Add init script from 'Init Scripts' tab by following prompts.
References
https://docs.databricks.com/clusters/init-scripts.html#cluster-scoped-init-script-locations
https://nadirsidi.github.io/Databricks-Custom-R-Packages/

An addition to the init script approach, which works best by the way, is to persist the installed binaries of the R packages in DBFS which can be accessed by the worker nodes as well! This approach is easier for interactive workload, and also if you do not have the rights to modify cluster config to add init scripts.
Please refer this page for more details: https://github.com/marygracemoesta/R-User-Guide/blob/master/Developing_on_Databricks/package_management.md
Below codes can be run inside a Databricks notebook - this step is needed to be done only once. Later on, you won't have to install the packages even if you restart your cluster.
%python
# Creating a location in DBFS where we will finally store installed packages
dbutils.fs.mkdirs("/dbfs/persist-loc")
%sh
mkdir /usr/lib/R/persist-libs
%r
install.packages(c("caesar", "dplyr", "rlang"),
repos="https://cran.microsoft.com/snapshot/2021-08-02", lib="/usr/lib/R/persist-libs")
# Can even persist custom packages
# install.packages("/dbfs/path/to/package", repos=NULL, type="source", lib="/usr/lib/R/persist-libs")
%r
system("cp -R /usr/lib/R/persist-libs /dbfs/persist-loc", intern=TRUE)
Now just append the final persist location to .libPaths() in your R script where you used dapply - this can be done in the very first cell, and it will work just fine even with worker nodes. You will not have to install them again as well, which will save time also.
%r
.libPaths(c("/dbfs/persist-loc/persist-libs", .libPaths()))

Related

Windows Job Runner from Linux Cluster (Enterprise)

The documentation from what I have found is a bit sparse on the setup of job runner nodes. I am wondering if anyone has set up a config - Rundeck Linux Cluster with a Windows Job Runner. I was able to install the .jar and all that on the Windows node and it appears and is able to communicate via Runner Management.
Where I am stuck and it gets ambiguous his how to properly specify to use the job runner. This is my current setup:
job runner installed and is green in Runner Management and assigned to my project
IN Project config I have Runner selected as the Default Node Executioner
Default File Copier is also set to runner-file-copier
Under Project nodes
I setup a Node Wizard - here is the edited yaml:
mydomain:
nodename: nodename
hostname: jobrunnerhost.domain
osFamily: windows
node-executor: runner-node-exec
file-copier: runner-file-copier
Under the jobs I have it set to the appropriate node.
I am getting this error when I try and run anything either a simple DIR command or executing a basic powershell:
Execution failed: 28 in project Server.Validation.mynode: [Workflow result: , step failures: {1=Dispatch failed on 1 nodes: [mynode: COPY_ERROR: Reason: FILE_COPIER_NOT_FOUNDUnable to find file copier: runner-file-copier]}, Node failures: {mynode=[COPY_ERROR: Reason: FILE_COPIER_NOT_FOUNDUnable to find file copier: runner-file-copier]}, status: failed]
I have tried setting up multiple ways. I feel as though I am missing a config or another step somewhere. Any help is appreciated.

Spark Job fails connecting to oracle in first attempt

We are running spark job which connect to oracle and fetch some data. Always attempt 0 or 1 of JDBCRDD task fails with below error. In subsequent attempt task get completed. As suggested in few portal we even tried with -Djava.security.egd=file:///dev/urandom java option but it didn't solved the problem. Can someone please help us in fixing this issue.
ava.sql.SQLRecoverableException: IO Error: Connection reset by peer, Authentication lapse 59937 ms.
at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:794)
at oracle.jdbc.driver.PhysicalConnection.connect(PhysicalConnection.java:688)
Issue was with java.security.egd only. Setting it through command line i.e -Djava.security.egd=file:///dev/urandom was not working so I set it through system.setproperty with in job. After that job is no more giving SQLRecoverableException
This Exception nothing to do with Apache Spark ,"SQLRecoverableException: IO Error:" is simply the Oracle JDBC driver reporting that it's connection
to the DBMS was closed out from under it while in use. The real porblem is at
the DBMS, such as if the session died abruptly. Please check DBMS
error log and share with question.
Similer problem you can find here
https://access.redhat.com/solutions/28436
Fastest way is export spark system variable SPARK_SUBMIT_OPTS before running your job.
like this: export SPARK_SUBMIT_OPTS=-Djava.security.egd=file:dev/urandom I'm using docker, so for me full command is:
docker exec -it spark-master
bash -c "export SPARK_SUBMIT_OPTS=-Djava.security.egd=file:dev/urandom &&
/spark/bin/spark-submit --verbose --master spark://172.16.9.213:7077 /scala/sparkjob/target/scala-2.11/sparkjob-assembly-0.1.jar"
export variable
submit job

Error while running Zeppelin paragraphs in Spark on Linux cluster in Azure HdInsight

I have been following this tutorial in order to set up Zeppelin on a Spark cluster (version 1.5.2) in HDInsight, on Linux. Everything worked fine, I have managed to successfully connect to the Zeppelin notebook through the SSH tunnel. However, when I try to run any kind of paragraph, the first time I get the following error:
java.io.IOException: No FileSystem for scheme: wasb
After getting this error, if I try to rerun the paragraph, I get another error:
java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
These errors occur regardless of the code I enter, even if there is no reference to the hdfs. What I'm saying is that I get the "No FileSystem" error even for a trivial scala expression, such as parallelize.
Is there a missing configuration step?
I am download the tar ball that the script that you pointed to as I type. But want I am guessing is that your zeppelin install and spark install are not complete to work with wasb. In order to get spark to work with wasb you need to add some jars to the Class path. To do this you need to add something like this to your spark-defaults.conf (the paths might be different in HDInsights, this is from HDP on IaaS)
spark.driver.extraClassPath /usr/hdp/2.3.0.0-2557/hadoop/lib/azure-storage-2.2.0.jar:/usr/hdp/2.3.0.0-2557/hadoop/lib/microsoft-windowsazure-storage-sdk-0.6.0.jar:/usr/hdp/2.3.0.0-2557/hadoop/hadoop-azure-2.7.1.2.3.0.0-2557.jar
spark.executor.extraClassPath /usr/hdp/2.3.0.0-2557/hadoop/lib/azure-storage-2.2.0.jar:/usr/hdp/2.3.0.0-2557/hadoop/lib/microsoft-windowsazure-storage-sdk-0.6.0.jar:/usr/hdp/2.3.0.0-2557/hadoop/hadoop-azure-2.7.1.2.3.0.0-2557.jar
Once you have spark working with wasb, or next step is make those sames jar in zeppelin class path. A good way to test your setup is make a notebook that prints your env vars and class path.
sys.env.foreach(println(_))
val cl = ClassLoader.getSystemClassLoader
cl.asInstanceOf[java.net.URLClassLoader].getURLs.foreach(println)
Also looking at the install script, it trying to pull the zeppelin jar from wasb, you might want to change that config to somewhere else while you try some of these changes out. (zeppelin.sh)
export SPARK_YARN_JAR=wasb:///apps/zeppelin/zeppelin-spark-0.5.5-SNAPSHOT.jar
I hope this helps, if you are still have problems I have some other ideas, but would start with these first.

How to enable Spark mesos docker executor?

I'm working on integration between Mesos & Spark. For now, I can start SlaveMesosDispatcher in a docker; and I like to also run Spark executor in Mesos docker. I do the following configuration for it, but I got an error; any suggestion?
Configuration:
Spark: conf/spark-defaults.conf
spark.mesos.executor.docker.image ubuntu
spark.mesos.executor.docker.volumes /usr/bin:/usr/bin,/usr/local/lib:/usr/local/lib,/usr/lib:/usr/lib,/lib:/lib,/home/test/workshop/spark:/root/spark
spark.mesos.executor.home /root/spark
#spark.executorEnv.SPARK_HOME /root/spark
spark.executorEnv.MESOS_NATIVE_LIBRARY /usr/local/lib
NOTE: The spark are installed in /home/test/workshop/spark, and all dependencies are installed.
After submit SparkPi to the dispatcher, the driver job is started but failed. The error messes is:
I1015 11:10:29.488456 18697 exec.cpp:134] Version: 0.26.0
I1015 11:10:29.506619 18699 exec.cpp:208] Executor registered on slave b7e24114-7585-40bc-879b-6a1188cb65b6-S1
WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.
/bin/sh: 1: ./bin/spark-submit: not found
Does any know how to map/set spark home in docker for this case?
I think the issue you're seeing here is a result of the current working directory of the container isn't where Spark is installed. When you specify a docker image for Spark to use with Mesos, it expects the default working directory of the container to be inside $SPARK_HOME where it can find ./bin/spark-submit.
You can see that logic here.
It doesn't look like you're able to configure the working directory through Spark configuration itself, which means you'll need to build a custom image on top of ubuntu that simply does a WORKDIR /root/spark.

Spark is not started automatically on the AWS cluster - how to launch it?

A spark cluster has been launched using the ec2/spark-ec2 script from within the branch-1.4 codebase. I have logged onto it.
I can login to it - and it reflects 1 master, 2 slaves:
11:35:10/sparkup2 $ec2/spark-ec2 -i ~/.ssh/hwspark14.pem login hwspark14
Searching for existing cluster hwspark14 in region us-east-1...
Found 1 master, 2 slaves.
Logging into master ec2-54-83-81-165.compute-1.amazonaws.com...
Warning: Permanently added 'ec2-54-83-81-165.compute-1.amazonaws.com,54.83.81.165' (RSA) to the list of known hosts.
Last login: Tue Jun 23 20:44:05 2015 from c-73-222-32-165.hsd1.ca.comcast.net
__| __|_ )
_| ( / Amazon Linux AMI
___|\___|___|
https://aws.amazon.com/amazon-linux-ami/2013.03-release-notes/
Amazon Linux version 2015.03 is available.
But .. where are they?? The only java processes running are:
Hadoop: NameNode and SecondaryNode
Tachyon: Master and Worker
It is a surprise to me that the Spark Master and Workers are not started. When looking for the processes to start them manually it is not at all obvious where they are located.
Hints on
why spark did not start automatically
and
where the launch scripts live
would be appreciated. (In the meantime i will do an exhaustive
find / -name start-all.sh
And .. survey says:
root#ip-10-151-25-94 etc]$ find / -name start-all.sh
/root/persistent-hdfs/bin/start-all.sh
/root/ephemeral-hdfs/bin/start-all.sh
Which means to me that spark were not even installed??
Update I wonder: is this a bug in 1.4.0? I ran same set of commands in 1.3.1 and the spark cluster came up.
There was a bug in spark 1.4.0 provisioning script which is cloned from github repository by spark-ec2 (https://github.com/mesos/spark-ec2/) with similar symptoms - apache spark haven't started. The reason was - provisioning script failed to download spark archive.
Check was spark downloaded and uncompressed on the master host ls -altr /root/spark there should be several directories there. From your description looks like /root/spark/sbin/start-all.sh script is missing - which is missing there.
Also check the contents of the file cat /tmp/spark-ec2_spark.log it should has information about uncompressing step.
Another thing to try is to run spark-ec2 with other provisioning script branch by adding --spark-ec2-git-branch branch-1.4 into the spark-ec2 command line argument.
Also when you run spark-ec2 save all output and check is there something suspicious:
spark-ec2 <...args...> 2>&1 | tee start.log

Resources