I am very impressed with IPython Notebook, and I'd like to use it more extensively. My question has to do with secure data. I know only very little about networking. If I use IPython Notebook, is the data sent out over the web to a remote server? Or is it all contained locally? I am not talking about setting up a common resource for multiple access points, just using the data on my machine as I would with SAS or R.
Thanks
If you run the notebook on your machine, then no, it doesn't send anything externally. There are sites like Wakari where you can use the IPython notebook that's running on a server, and obviously that will send your code and data to their servers.
If you did want to expose your notebook server on the internet, then there are security measures that you should take, but that's not necessary if you're just running ipython notebook locally, which is the default way it starts up.
Related
I am learning to use Azure Machine Learning. it has its Notebooks (which are ok!) and also it allows me to use Jupyter Notebook and VSCode.
However I am wondering if there is a way to efficiently use Spyder with Azure Machine Learing.
eg. I was able to install R-Studio as a custom application using a docker image using steps provided here Stackoverflow link
Spyder supports connecting to a Remote Python kernel, it does, however require SSH.
You can enable SSH on your Compute Instance (see below), but only when you set it up. Also, many companies have policies against enabling SSH, so this might not work for you. If it doesn't, I can highly recommend VSCode.
I am looking to use Databricks Connect for developing a pyspark pipeline. DBConnect is really awesome because I am able to run my code on the cluster where the actual data resides, so it's perfect for integration testing, but I also want to be able to, during development and unit testing (pytest with pytest-spark), simply using a local Spark environment.
Is there any way to configure DBConnect so for one use-case I simply use a local Spark environment, but for another it uses DBConnect?
My 2 cents, since I've been done this type of development for some months now:
Work with two Python environments: one with databricks-connect (and thus, no pyspark installed), and another one with only pyspark installed. When you want to execute the tests, just activate the "local" virtual environment and run pytest as usual. Make sure, as some commenters pointed out, that you are initializing the pyspark session using SparkConf().setMaster("local").
Pycharm helps immensely to switch between environments during development. I am always on the "local" venv by default, but whenever I want to execute something using databricks-connect, I just create a new Run configuration from the menu. Easy peasy.
Also, be aware of some of databricks-connect's limitations:
It is not officially supported anymore, and Databricks recommend moving towards dbx whenever possible.
UDFs just won't work in databricks-connect.
Mlflow integration is not reliable. In my use case, I am able to download and use models, but unable to log a new experiment or track models using databricks tracking uri. This might depend on your Databricks Runtime, mlflow and local Python version.
This is my understanding that anyone with project editor permissions can access the AI Platform Jupyter notebooks, which is great but not very practical since this could cause several issues. I would like to use this environment as an "always-on" machine with GPU enabled and allow different people in my team to access it. Right now everyone is logged in as the default "jupyter" user when logging in with OPEN JUPYTERLAB button. Is there a way to log in with different credentials?
Any tips would be greatly appreciated!
Today it's not possible to use different credentials with the OPEN JUPYTERLAB button.
However, you have SSH access to the underlying VM. You could have each person SSH into the VM with port forwarding. Then everyone would be directly using JupyterLab without the Proxy intermediary.
It's a little less user friendly, but will get you distinct users
We recently got a supercomputer (I will call it the "cluster", it has 4 GPUs and 12-core processor with some decent storage and RAM) to our lab for machine learning research. A Linux distro (most possibly CentOS or Ubuntu depending on your suggestions of course) will be installed in the machine. We want to design the remote access in such a way that we have the following user hierarchy:
Admin (1 person, the professor): This will be the only superuser of the cluster.
Privileged User (~3 people, PhD students): These guys will be the more tech-savvy or long-term researchers of the lab that will have a user defined for themselves at the cluster. They should be able to setup their own environment (through docker or conda), remote dev their projects and transfer files in and out of the cluster freely.
Regular User (~3 people, Master's students): We expect these kind of users to only interact with the cluster for its computing capabilities and the data it stores. They should not have their own user at the cluster. It is ok if they can only use Jupyter Notebooks. They should be able to access the read-only data in the cluster as the data we are working on will be too much for them to download it locally. However, they should not be able to change anything within the cluster and only be able to have their notebooks and a number of output files there which they should be able to download to their local system whenever necessary for reporting purposes.
We also want to allocate only a certain portion of our computing capabilities for type 3 users. The others should be able to access all the capabilities when they need.
For all users, it should be easy to access the cluster from whatever OS they have on their personal computers. For type 1 and 2 I think PyCharm for remote developing .py files and tunneling for jupyter notebooks is the best option.
I did a lot of research on this but since I don't have an IT background I cannot be sure if the following approach would work.
Set up JupyterHub for type 3 users. This way we don't have to have these guys to have a user at the cluster. However, I am not sure about the GPU support of this. According to here, we can only limit CPU per user. Also, will they be able to access the data under Admin's home directory when we set up the hub or do we have to duplicate the data for that? We only want them to be able to access specific portions of data (the ones related to whatever project they are working on since they sign a confidentiality to only that project). Is this possible with JuptyterHub?
The rest (type-1 and type-2) will have their (sudo or not) users at the cluster. For this case, is there UI to workaround so that users can more easily transfer files from and to the cluster (that they don't have to use scp)? Is FileZilla an option for example?
Finally, if the type-2 users can resolve the issues type-3 users have so that they don't have refer to the professor each time they have a problem. But afaik, you have to be a superuser to control stuff at JupyterHub.
If anyone had to setup this kind of an environment at their own lab and share their experiences I would be grateful.
I just installed Matlab's Distributed Computing Server on a bunch of machines and it works, but only for those physically connected to the cluster's network. For remote access those machines are 2 SSH hops away. How this problem is usually solved? I thought in setting up a VPN, but to me this seems like last resort.
What I want is that everybody in the lab, using their own versions of Matlab, with the correct Toolbox, just run their code in the cluster somewhat effortlessly. I guess I could ask to everybody just tar-ball their files and access a remote installation of matlab, somehow forwarding the GUI session (VNC or X-Forward), but that seem ugly.
Any help?
It is possible to set up "remote access" to a cluster running MDCS so that clients without direct access can submit jobs there. The documentation for this starts here:
http://www.mathworks.com/help/mdce/configure-parallel-computing-products-for-a-generic-scheduler.html
I'm not quite sure how to configure things so that the submission can work across two SSH connections - the example integration scripts shipping with MDCS all presume only one. However, it should be possible providing that:
The client can put the job and task files somewhere the execution nodes can see them
The client can trigger the appropriate qsub or whatever on the cluster headnode
You might also consider simply contacting MathWorks installation support.