PyMC3/Edward/Pyro on Spark? - apache-spark

Has anyone tried using a python probabilistic programming library with Spark? Or does anyone have a good idea of what it would take?
I have a feeling Edward would be simplest because there are already tools connecting Tensorflow and Spark, but still hazy about what low-level code changes would be required.
I know distributed MCMC is still an area of active research (see MC-Stan on Spark?), so is this even reasonable to implement? Thanks!

You can use Tensorflow connectors with Edward since it is based on Tensorflow, one of the main drawbacks of MCMC is very computational intensive, you may try Variational inference for your Bayesian models it approximates the target distribution. (this also applies to Pyro and PyMC3 I believe), you can also work with Tensorflow distributed tensorflow distributed
I also recommend you to use/try a library called "Dask
"https://dask.pydata.org/en/latest/Dask, you can scale your model from your workstation to a cluster it also has Tensorflow connectors.
Hope this helps

I've seen people run Pyro+PyTorch in PySpark, but the use case was CPU-only and did not involve distributed training.

Related

categorize non-functional requirements

I am developing a machine learning project which analyzes requirement specification and categories the non-functional requirements in to categories like database, web socket, backend technology, etc. As I have researched Naive Bayes is the better way to categorize but due to lack of dataset I have planned to go with Seed LDA for topic modeling. Would it be okay to use LDA or should I use something else?
You can try either LDA or clustering.
Based on my experiences, k-mean clustering could help you have a better visualization about what are you doing and what is happening.
With LDA, it could also be good. You can try it first since k-means take much more time.
I implemented an issue tracking system here using k-means, may you like to take a look. issue tracker

I want to customise the last layer of VGG 19 architecture for a classification. which will be more useful keras or pytorch?

I want to customise the last layer of VGG 19 architecture for a classification problem. which will be more useful keras or pytorch?
It heavily depends on what you want to do with it.
While Keras offers different backends, such as TensorFlow or Theano (which in turn can offer you a little more flexibility), and transfers better to production systems,
PyTorch is definitely also easy to implement. Additionally, it offers great scaling on (multi-)GPU systems, since it is trivial to outsource your computations in a PyTorch model. I do not know how easy that is in Keras (never done it, so I genuinely cannot judge).
If you just want to play around with one of the frameworks, it usually boils down to personal preference. I personally prefer PyTorch, due to its more "python-esque" approach to things, but I know many people that prefer Keras because of its clear and simple layout and documentation.
Providing a little more information, or your context, can also potentially increase the quality of the answers you receive.

PySpark with scikit-learn

I have seen around that we could use scikit-learn libraries with pyspark for working on a partition on a single worker.
But what if we want to work on training dataset that is distributed and say the regression algorithm should concern with entire dataset. Since scikit learn is not integrated with RDD I assume it doesn't allow to run the algorithm on the entire dataset but only on that particular partition. Please correct me if I'm wrong..
And how good is spark-sklearn in solving this problem
As described in the documentation, spark-sklearn does answer your requirements
train and evaluate multiple scikit-learn models in parallel. It is a distributed analog to the multicore implementation included by default
in scikit-learn.
convert Spark's Dataframes seamlessly into numpy ndarrays or sparse matrices.
so, to specifically answer your questions:
But what if we want to work on training dataset that is distributed
and say the regression algorithm should concern with entire dataset.
Since scikit learn is not integrated with RDD I assume it doesn't allow to run the algorithm on the entire dataset on that particular partition
In spark-sklearn, spark is used as the replacement to the joblib library as a multithreading framework. So, going from an execution on a single machine to an excution on mutliple machines is seamlessly handled by spark for you. In other terms, as stated in the Auto scaling scikit-learn with spark article:
no change is required in the code between the single-machine case and the cluster case.

Convolutional Neural Network in Spark

I'm trying to implement a Convolutional Neural Network algorithm on Spark and I wanted to ask two questions before moving forward.
I need to implement my code such that, it is highly integrated with Spark and also follows the principles of machine learning algorithms in Spark. I found that Spark ML is an established ground for machine learning codes and it has a specific foundation, which all written algorithms are following. Also, the implemented algorithms are offloading their heavy mathematical operations to third party libraries such as BLAS, to do calculations fast.
Now I wanted to ask:
1) Is ML the right place to start? By following the ML structure, does my code going to be highly integrable with the rest of the spark ML ecosystem?
2) Am I right about the bottom of the ML codes, where they offload the processing into another mathematical library? Does it mean I can decide to change that layer to do the heavy processings in a customized fashion?
Would appreciate any suggestions.

Is there a Doc2vec model in tensorflow?

I know I am not suppose to ask for a tool, resource, etc on stackoverflow: But I think this is an important question and people will benefit from it. Here comes the question: I have found word2vec but failed to find doc2vec implementation in the tensorflow package, and will be surprised if it is not supported in tensorflow.
I guess that will be very slow, TensorFlow does not support so-called “inline” matrix operations, but forces you to copy a matrix in order to perform an operation on it. Copying very large matrices is costly in every sense. TF takes 4x as long as the state of the art deep learning tools. Google says it’s working on the problem. Source
you can go ahead and implement it on your own which is not hard as there are many types of word2vec implementations but the question remains, is it useful and fast?

Resources