Basic question on downloading data for a Kubeflow pipeline

Basic question on downloading data for a Kubeflow pipeline - conv-neural-network

I'm a newbie on Kubeflow, just started exploring. I've setup a microk8s cluster and charmed kubeflow. I have executed a few examples trying to understand the different components. Now I'm trying to setup a pipeline from scratch for a classification problem. The problem that I'm facing is with handling the download of data.
Could anyone please point me to an example where data (preferably images) is downloaded from an external source?
All the examples that I can find are based on snakk datasets from sklearn or mnist etc. I'm rather looking for an example using a real world (or near to) data, example
https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip
Thanks in advance for any direction.
Tried exploring multiple kubeflow examples, blogs etc to find an example that contains real data rather than toy dataset. I couldn't find one.
I've found some jupyter notebook examples that use !wget to download in the notebook kernel, but I couldnt find how that can be converted to a kubeflow op step. I presumed func_to_container_op wouldn't work for such a scenario. As a next step I'm going to try using specs.AppDef from torchx to download. As I'm a total newbie, I wanted to make sure if I'm in the right direction.

I was able to download using wget for direct links and also I was able to configure k8s secrets and patch the serviceaccount with ImagePullSecret to get the downloads done from newly created containers.

Related

Deploying PySpark as a service

I have code in PySpark that parallelizes heavy computation. It takes two files and several parameters, performs the computation and generates information that at this moment is stored in a CSV file (tomorrow it would be ideal that the information is stored in a Postgres database).
Now, I want to consume this functionality as a service from a system made in Django, from which users will set the parameters of the Spark service, the selection of the two files and then query the results of the process.
I can think of several ways to cover this, but I don't know which one is the most convenient in terms of simplicity of implementation:
Use Spark API-REST: this would allow a request to be made from Django to the Spark cluster. The problem is that I found no official documentation, I get everything through blogs whose parameters and experiences correspond to a particular situation or solution. At no point does it mention, for example, how I could send files to be consumed through the API by Spark, or get the result.
Develop a simple API in Spark's Master to receive all parameters and execute the spark-submit command at the OS level. The awkwardness of this solution is that everything must be handled at the OS level, the parameter files and the final process result must be saved on disk and accessible by the Django server that wants to get it to save its information in the DB lately.
Integrate the Django app in the Master server, writing PySpark code inside It, Spark connects to the master server and runs the code that manipulates the RDDs. This scheme does not convince me because it sacrifices the independence between Spark and the Django application, which is already huge.
If someone could enlighten me about this, maybe due to lack of experience I am overlooking a cleaner, more robust, or idiomatic solution.
Thanks in advance

Can Azure ML notebooks be run automatically to create alerts?

I'm developing a time series model to anaylize the download traffic inside my organization. Now I'm trying to find a way of automatically running this code everyday and create alerts whenever I'm finding anomalies (high download volumes), so that is not necessary to do it manually. I'd also like to create a dashboard or an easy way to visualize the plots I'm getting in this case.
It'd be something similar to workbooks but with a deeper analysis.
Thanks!

How to migrate Google App Engine from Python2.7 and DataStore to Python3

My website was built using Google AppEngine, DataStore and Python2.7. It’s no longer working This site can’t be reached. I need to migrate to Python3 but I cannot identify which migration guide is best suited for me. Can anyone point me to the correct set? I would like to get it running as quickly as possible (I only have one hour a day to try to correct it -- I have an unrelated full-time job).

Migration guide
Google provides a step-by-step migration guide especially for AppEngine which you should follow.
Additionally, you will find lots of useful links there where you can read about the differences between Python 2 and Python 3 and the various migration tools available. Depending on your application those tools might even be able to do the migration (more or less) automatically for you.
Please note: This is the migration guide for the AppEngine standard environment. If you don't know what you're using, you're most likely using the standard environment. While some steps will differ when using the flexible environment, migration of the code base as described in the guide will always be required.
Video: Python 2 to 3: Migration Patterns & Motivators (Cloud Next '19)
There also is a recording of a talk by the Google Cloud Team on migration from Python 2 to 3 on YouTube.
Still having issues?
Migrating from Python 2 to 3 is a well-known problem and there is tons of information available on the internet. Most likely the problems you face have already been solved by someone, so a Google search for a specific problem will likely give you a working solution.

How do I use Node.js to read server file within a Django application?

I have a django application that I use to visualize data. The data itself is in an Amazon S3 bucket, and the plotting of the data is done using bokeh. One of the applications however, uses d3 instead. And while I managed to configure the bokeh apps to work properly, I don't know how to do that with the d3 app. It has to read data either directly from S3 or locally (If I download it within django views before rendering the web page) and plot it. But whatever I try I get a 404 Not Found error.
I'm learning as I go, so I don't know what I'm doing wrong. Going through SO, I found this question which gives an example of downloading a file from S3 using Node.js, but I am already running a django server so I don't know if that works. I should also mention that the files to be read are quite large (several megabytes). So, to summarize, my question is:
Is using Node.js my only solution here and can it be done without having both Nodejs and django running at the same time? I'm worried that this might be too complex for me to set up. Or better yet, what would be a recommended solution in my case? I am almost done with the whole project but, unfortunately, I've gotten stuck pretty bad here.
Thank you to anyone willing to help or offer advice.

How to set up a local DBPedia Server and using it

I am wondering if there is a tutorial on how i could get a local copy of the dbpedia knowledge base up and running.
I understand that i'll have to download all the .nt files that i would like to include from the dbpedia website. But i don't really get how to utilize them.
From other questions i found out, that there are different interfaces of different languages for using this. But i don't know how to find the right one for node.js or java. I found a lot of libraries that use rdf data and sparql query language but i don't get how to connect them with the .nt files.
Can anyone give a short introduction on where to start to setup the knowledge base for i.e. node.js ?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string