Getting Started with Mobius SparkClr (on Linux) - apache-spark

I am looking to try the C# driver with an existing (stand alone) spark cluster (on Ubuntu Linux) which I interact happily with via python or scala.
I am unclear as to how to run a simple c# example having downloaded the latest Mobius release to the linux box. What I am unclear about are those two extra parameters required for the clr spark submit (over and above the ones that are normally required). I am encountering various errors when i try to follow the submit args as documented (or I have misunderstood the instructions)
Firstly, for the --exe, does one simply point to the .exe file or is it required to pass; --exe [mono] [my_app.exe] [params]
Secondly, remote-spark-clr seems to insist on a HDFS path but I am running spark without HDFS. Is HDFS actually necessary?
Thirdly, and related to question (two), if distributing exe/packages for workers, must these also be in a hdfs path or can I put them somewhere sensible on the "regular" file system.
In short, I am looking for confirmation that HDFS is not required and a simple one-liner submit example that can run an exe in some location. The combinations I have tried are not working for me sadly.

Running Mobius on Linux requires a small trick:
Create shell scripts that are launching your executables using mono
Add the extension .exe to your shell scripts so that they are accepted by sparkclr-submit.
Make sure your shell scripts are linux encoded - we had issues when they had CRLF line endings.
If your application is called Driver.exe, I recommend to create a file driver.sh.exe with the following content:
#!/bin/sh
exec mono ./Driver.exe "$#"
Similarly, create a file CSharpDriver.sh.exe with the following content:
#!/bin/sh
exec mono ./CSharpWorker.exe "$#"
In your App.config set the following value in appSettings:
<add key="CSharpWorkerPath" value="CSharpWorker.sh.exe"/>
Finally, when submitting your application, use the following arguments:
$SPARKCLR_HOME/scripts/sparkclr-submit.sh \
--master yarn \
--deploy-mode client \
--exe driver.sh.exe \
/path/to/driver
Note that the --exe argument only takes the name of the file, the path is the next argument.
You can place your applications on the regular file system (don't need to use HDFS), but according to my experience, Mobius will internally use HDFS to distribute the application to the workers. I don't know if you can avoid it.

Related

Can run pyspark.cmd but not pyspark from command prompt

I am trying to get pyspark setup for windows. I have java, python, Hadoop, and spark all setup and environmental variables I believe are setup as I've been instructed elsewhere. In fact, I am able to run this from the command prompt:
pyspark.cmd
And it will load up the pyspark interpreter. However, I should be able to run pyspark unqualified (without the .cmd), and python importing won't work otherwise. It does not matter whether I navigate directly to spark\bin or not, because I do have spark\bin added to the PATH already.
.cmd is listed in my PATHEXT variable, so I don't get why the pyspark command by itself doesn't work.
Thanks for any help.
While I still don't know exactly why, I think the issue somehow stemmed out of how I unzipped the spark tar file. Within the spark\bin folder, I was unable to run any .cmd programs without the .cmd extension included. But I could do that in basically any other folder. I redid the unzip and the problem no longer existed.

Running Matlab code on a cluster

I have a university account for the university's cluster, but I don't know how can I use it to run my Matlab code. Could anyone help? I connect to the cluster by typing below code in the terminal of my laptop:
ssh myusername#192.168.194.222
Then it asks me to type my password.After that, below text appears:
Welcome to gav 9.1.1 (3.12.60-ql-generic-9.1-74) based on Ubuntu 14.04.5 LTS
Last login: Sun Apr 16 10:45:49 2017 from 192.168.41.213
gav:~ >
How can I run my code after these processes? Could anyone help me?
It looks like you have a Linux shell, so you can run your script (for instance yourScript.m)
> matlab -nojvm -nodisplay -nosplash < yourScript.m
(see also https://uk.mathworks.com/help/matlab/ref/matlablinux.html)
As far as I know, there are two possibilities:
Conventional Matlab is installed on the Cluster
The Matlab Distributed Computing server is installed on the cluster
Conventional Matlab is installed on the Cluster
You execute Matlab on the cluster as you would on your local computer. I guess that you work on Windows on your local computer, given that you quote a simple shell prompt in your question ;) All right, all right, bad psychic skillz ;) see edit below.
What you see is the cluster awaiting a program name to execute. This is called the "Shell". Google "Linux shell tutorial" or start with this tutorial to get information about how to operate a Linux system without a graphical desktop.
Try to start matlab by simply typing matlab after the text you've seen. If it works, you see Matlab's welcome message and the Matlab prompt as you would see it in Matlab's command window on your local PC.
Bonus: you can try to execute Matlab on the cluster but see a graphical interface by replacing your ssh call by ssh -X myusername#192.168.194.222, so add an additional -X.
Upload your Matlab scripts to the cluster, for example by using WinSCP (tutorial)
Execute your Matlab functions like you would locally by navigating into the correct folder and typing the function name.
EDIT: As you use Linux, you may use gio mount ssh://myusername#192.168.194.222 to access your home folder on the cluster via your file manager. If that fails, try gvfs-mount ssh://myusername#192.168.194.222 (the old name of the tool). The packages gvfs-backends and gvfs-fuse (I assume that you use ubuntu, other distributions may have different package names) must be installed for this; use your Package manager to install them if you get an error like "command not found".
Distributed Computing Server
This provides a set of Matlab "Workers" which are sent tasks from your local Computer. You use your local Matlab installation to connect to the Distributed computing server. Start with the Matlab Help Pages for the Distributed Computing Server

How to lauch prorams in Apache spark?

I have a “myprogram.py” and my “myprogram.scala” that I need to run on my spark machine. How Can I upload and launch them?
I have been using shell to do my transformation and calling actions, but now I want to launch a complete program on spark machine instead of entering single commands every time. Also I believe that will make it easy for me to make changes to my program instead of starting to enter commands in shell.
I did standalone installation in Ubuntu 14.04, on single machine, not a cluster, used spark 1.4.1.
I went through spark docs online, but I only find instruction on how to do that on cluster. Please help me on that.
Thank you.
The documentation to do this (as commented above) is available here: http://spark.apache.org/docs/latest/submitting-applications.html
However, the code you need is here:
# Run application locally on 8 cores
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master local[8] \
/path/to/examples.jar \
100
You'll need to compile the scala file using sbt (documentation here: http://www.scala-sbt.org/0.13/tutorial/index.html)
Here's some information on the build.sbt file you'll need in order to grab the right dependencies: http://spark.apache.org/docs/latest/quick-start.html
Once the scala file is compiled, you'll send the resulting jar using the above submit command.
Put it simply:
In Linux terminal, cd to the directory that spark is unpacked/installed
Note, this folder normally contains subfolders like “bin”, “conf”, “lib”, “logs” and so on.
To run the Python program locally with simple/default settings, type command
./bin/spark-submit --master local[*] myprogram.py
More complete descriptions are here like zero323 and ApolloFortyNine described.

How to run a Scala program via cron?

I wrote a small Scala application. I have 2 classes in one source file including the App trait runner to start the program. It works just fine when I run it in the terminal:
scalac update.scala // compiling
scala update // run it
Now I want to run it with a cron job. For this I edited sudo crontab -e and added this:
*/2 * * * * scala /usr/bin/local/update
and made the script executable but nothing happend so far. I'm not sure how to do it:
Do I have to make a jar file for this?
Do I have to add this before my classes or not?
#!/bin/sh
exec scala -savecompiled $0 $#
!#
Does anyone have some experience with this?
Thanks in advance.
I suspect scala isn't in $PATH where cron can see it.
Try the following in a shell session:
$ which scala
Which should output something like "/opt/scala/2.9.1/bin/scala" or something. Could be in /usr/local, any number of places - java and the unix filesystem don't really play together nicely.
So now you have two options:
Put the folder where scala lives in the system path (This will usually involve editing /etc/profile, but you don't specify the OS so I can't say for sure)
(Easier) Just change the the cron entry to call /full/path/to/scala rather than just "scala"
The scala command expects the name of a compiled runnable object or a file containing a scala script source (or a runnable jar file) as the thing to run.
If you have in update.scala object update extends App (and no package declaration) then after scalac update.scala (which should have produced a bunch of *.class files) scala update is the right thing to run.
If the produced class files are not in the current directory then the -classpath option should be used to tell scala where to find them, as in eg. scala -classpath /usr/bin/local update, if the class files are indeed in /usr/bin/local.
Saying scala /usr/bin/local/update would make sense if the file /usr/bin/local/update (this exact name) contained scala script source (that is more or less a sequence of scala expressions not wrapped in a class or object).

Why doesn't SBT 0.7.7 work correctly on my Linux system? (case details inside)

First of all, I'd like to ask to correct my question title if something better comes into your mind.
Let's take a Lift REST web service example from the Simply Lift book by David Pollak here.
If I open a Windows (Windows XP SP3, all the updates, Oracle JDK 7) console inside the directory and run "sbt" (sbt.bat), everything works just fine. But in case I try to do the same (but using "./sbt") in Linux (XUbuntu 11.10, OpenJDK 6, OpenJDK 7, Oracle JDK 7 (tried all of them)), SBT returns (instead of going to SBT console mode) immediately as it has done it's job. This means that may the command be just ./sbt it returns about immediately (after finishing the automatic project maintenance), or be it ./sbt jetty-run - it just starts the web server and shuts it down immediately.
Moreover, a web service I've developed for a project of mine compiles and works ok on Windows, but can't be compiled (using ./sbt compile) on Linux (by the same version of SBT). The error is "source file '/.../src/main/scala/code/lib/FooBar.scala;src/main/scala/bootstrap/liftweb/Boot.scala' could not be found", where "FooBar.scala" is an object where I do all the serves (directly called from Boot.scala).
Any ideas of what can be the reason and how to fix it?
UPDATE: The reason of the first problem (SBT returning to shell instead of offering SBT console) seems to be the file was checked out on Windows and had CR+LF instead of just LF line ending. The solution of source files not being found was in just using clean command to recompile from scratch.
The reason of the first problem (SBT returning to shell instead of offering SBT console) seems to be the file was checked out on Windows and had CR+LF instead of just LF line ending. The solution of source files not being found was in just using clean command to recompile from scratch.
First what happens when you simply type:
java -jar sbt-launch.jar
directly from the command line in the folder where the sbt-launch.jar is placed ?. If the sbt-launch.jar is in the same folder as the sbt script then you can edit the script to look like this:
#!/bin/sh
test -f ~/.sbtconfig && . ~/.sbtconfig
java -Xmx512M ${SBT_OPTS} -jar dirname $0/sbt-launch.jar "$#"
The dirname $0 construct returns the full path of the sbt script folder without the filename of the script. Using the $SBT_OPTS variable allows you to experiment with the various JVM options, like:
SBT_OPTS="-Xss2M -XX:+CMSClassUnloadingEnabled"
Although I would wait with these options as they are likely not the problem here (however be sure to add CMSClassUnloadingEnable later when SBT is working as it ensures that Scala class definitions generated dynamically when running SBT gets unloaded when they are unused, thus preventing memory errors - see more info here):
Also consider using one of
-Djline.terminal=scala.tools.jline.UnixTerminal
or even
-Djline.terminal=jline.UnsupportedTerminal
in your SBT_OPTS.
Finally what happens if you try a never version of SBT ? (you could try running the SBT 0.11 version of the lift example found here https://github.com/lacy/lift-quickstart).
Replace your Linux script by:
#!/bin/bash
java -Xmx512M -jar `dirname $0`/sbt-launch.jar "$#"
On your settings:
Your script sets Xss (the Thread stack size). In Linux you sometimes need to change (via ulimit) the settings for stack per thread (ulimit -s) as you may have conflicts at SO level which may be triggering the "kill" on your threads. Unless you have a very important reason to set this flag just remove it and let the JVM manage this.
It may also be you wanted to put Xms instead of Xss, although then 2M would make this flag irrelevant (heap too small to be practical)
The flag -XX:+CMSClassUnloadingEnabled allows GC to sweep the Perm space. It shouldn't be necessary for sbt. As you can read here the PermGen options will only pospone PermGen issues, so if you have problems with PermGen when running Jetty, just add a bigger PermGen via -XX:MaxPermSize

Resources