Create Local Pyspark Cluster Using Kubernetes - apache-spark

Hope you are doing great.
I have multiple laptops at home, and I want to use one of them as a master and the rest as workers.
How do I connect them ?
Should I Install Spark On all machines ?
Should I Install Kubernetes On all machines ?
Basically how does this work ? I am lost and would love to get a roadmap !
Note : Don't bother yourself by giving me the full technical roadmap (commands etc ..), I only need the steps and relevant articles in the right order because I feel lost.
Thank you in advance.

Related

What is the best way to set this working environment for my research group?

We recently got a supercomputer (I will call it the "cluster", it has 4 GPUs and 12-core processor with some decent storage and RAM) to our lab for machine learning research. A Linux distro (most possibly CentOS or Ubuntu depending on your suggestions of course) will be installed in the machine. We want to design the remote access in such a way that we have the following user hierarchy:
Admin (1 person, the professor): This will be the only superuser of the cluster.
Privileged User (~3 people, PhD students): These guys will be the more tech-savvy or long-term researchers of the lab that will have a user defined for themselves at the cluster. They should be able to setup their own environment (through docker or conda), remote dev their projects and transfer files in and out of the cluster freely.
Regular User (~3 people, Master's students): We expect these kind of users to only interact with the cluster for its computing capabilities and the data it stores. They should not have their own user at the cluster. It is ok if they can only use Jupyter Notebooks. They should be able to access the read-only data in the cluster as the data we are working on will be too much for them to download it locally. However, they should not be able to change anything within the cluster and only be able to have their notebooks and a number of output files there which they should be able to download to their local system whenever necessary for reporting purposes.
We also want to allocate only a certain portion of our computing capabilities for type 3 users. The others should be able to access all the capabilities when they need.
For all users, it should be easy to access the cluster from whatever OS they have on their personal computers. For type 1 and 2 I think PyCharm for remote developing .py files and tunneling for jupyter notebooks is the best option.
I did a lot of research on this but since I don't have an IT background I cannot be sure if the following approach would work.
Set up JupyterHub for type 3 users. This way we don't have to have these guys to have a user at the cluster. However, I am not sure about the GPU support of this. According to here, we can only limit CPU per user. Also, will they be able to access the data under Admin's home directory when we set up the hub or do we have to duplicate the data for that? We only want them to be able to access specific portions of data (the ones related to whatever project they are working on since they sign a confidentiality to only that project). Is this possible with JuptyterHub?
The rest (type-1 and type-2) will have their (sudo or not) users at the cluster. For this case, is there UI to workaround so that users can more easily transfer files from and to the cluster (that they don't have to use scp)? Is FileZilla an option for example?
Finally, if the type-2 users can resolve the issues type-3 users have so that they don't have refer to the professor each time they have a problem. But afaik, you have to be a superuser to control stuff at JupyterHub.
If anyone had to setup this kind of an environment at their own lab and share their experiences I would be grateful.

Cannot start multiple nodes for RabbitMQ cluster on Windows

I am trying to setup multiple RabbitMQ nodes in Windows environment. Based on the official guide, I am setting up 2 nodes but that's where problem starts to occur.
My first node is successfully created and up and running. But I cannot start 2nd node.
Check below output. ( All the commands are executed from Admin cmd. Erlang and python is also present. All precautionary steps are taken as per guide along with management plugin.)
You can see above, that my "hare" node is running. But second node "rabbit " fails to start.
I also replaced cookie as per stack-overflow similar question. Still the problem persists.
Any help is appreciated. Thanks.
For anyone facing similar problem, I changed my approach and was successfully able to run rabbitmq cluster.
I moved my cluster to Linux and faced no problems. Although it did satisfy my current needs; any solution to above problem is welcome.
Cheers.

Set cassandra.yaml settings like seeds through a script

What is the best way to set yaml settings? I am using docker containers and want to automate the process of setting cassandra.yaml settings like seeds, listen_address & rpc_address.
I have seen something like this in other yaml tools: <%= ENV['envsomething'] %>
Thanks in advance
I don't know about the "best" way but when I set up a scripted cluster of cassandra servers on a few vagrant vms I used puppet to set the seed and so on in cassandra.yaml.
I did write some scripting than used puppetdb to keep track of the addresses of the hosts but this wasn't terrifically successful. The trouble was the node that came up first only had itself in the list of seeds and so tended to make a cluster on it's own. Then the rest would come up as a seperate cluster. So I had to take down the solo node, clear it out and restart it with correct config
If I did it now I would set the addresses as static ip, then use them to fill in the templates for the cassandra.yaml files on all the nodes. Then hopefully the nodes would come up with the right idea about the other cluster members.
I don't have any experience with Docker but they do say the way to use puppet+Docker is to use puppet on the Docker container before starting it up
Please note that you need a lot of memory to make this work. I had a machine with 16GB and that was a bit dubious.
Thank you for information.
I was considering using https://github.com/go-yaml/yaml
But this guy did the trick: https://github.com/abh1nav/docker-cassandra
Thanks
If you're running Cassandra in Docker use this as an example: https://github.com/tobert/cassandra-docker You can override cluster name/seeds when launching so whatever config management tool you use for deploying your containers could do something similar.

Which Cassandra node to use?

I'm new to Cassandra.
I've deployed a Cassandra 2.0 cluster and everything works as expected.
There's one thing I don't understand, though.
From within a web app that uses the database, to which node should I connect? I know they're all the same, but how do I know that node isn't down?
I read that you're not supposed to use a load balancer, so I'm a little confused.
Any help appreciated. Thanks!
Depending on which driver you are using to connect, you can typically provide more than one node to connect to. Usually in the form of "node1, node2" ("192.168.1.1,192.168.1.2")

Accessing Matlab MDCS Cluster over SSH

I just installed Matlab's Distributed Computing Server on a bunch of machines and it works, but only for those physically connected to the cluster's network. For remote access those machines are 2 SSH hops away. How this problem is usually solved? I thought in setting up a VPN, but to me this seems like last resort.
What I want is that everybody in the lab, using their own versions of Matlab, with the correct Toolbox, just run their code in the cluster somewhat effortlessly. I guess I could ask to everybody just tar-ball their files and access a remote installation of matlab, somehow forwarding the GUI session (VNC or X-Forward), but that seem ugly.
Any help?
It is possible to set up "remote access" to a cluster running MDCS so that clients without direct access can submit jobs there. The documentation for this starts here:
http://www.mathworks.com/help/mdce/configure-parallel-computing-products-for-a-generic-scheduler.html
I'm not quite sure how to configure things so that the submission can work across two SSH connections - the example integration scripts shipping with MDCS all presume only one. However, it should be possible providing that:
The client can put the job and task files somewhere the execution nodes can see them
The client can trigger the appropriate qsub or whatever on the cluster headnode
You might also consider simply contacting MathWorks installation support.

Resources