I want to compute mutual information gain between a feature and a target variable in Pyspark. So are there any ways to approach this? I can't seem to find an inbuilt function for it. Or if there are some series of steps that could be done.
Related
I was wondering if the LOESS (locally estimated scatterplot smoothing) regression was a function built-in Spark/PySpark (I'm more interested in the PySpark answer but both would be interesting).
I did some research and couldn't find one so decided to try and code it myself using pandas-udf functions but while doing it, when I displayed the scatter_plot of the manufactured data I created to begin testing my algo, Azure Databricks (on which I'm coding) proposed to me to automatically compute/display the LOESS of my dataset :
So maybe there is indeed a built-in LOESS that I just couldn't find ? If not (and Databricks is the only one responsible for this), is there any way to access the result of databricks's LOESS computation/access the function Databricks is using to do that ?
Thank you in advance :)
So I have been trying for some days now to run ML algorithms inside a map function in Spark. I posted a more specific question but referencing Spark's ML algorithms gives me the following error:
AttributeError: Cannot load _jvm from SparkContext. Is SparkContext initialized?
Obviously I cannot reference SparkContext inside the apply_classifier function.
My code is similar to what was suggested in the previous question I asked but still haven't found a solution to what I am looking for:
def apply_classifier(clf):
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", maxDepth=3)
if clf == 0:
clf = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", maxDepth=3)
elif clf == 1:
clf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=5)
classifiers = [0, 1]
sc.parallelize(classifiers).map(lambda x: apply_classifier(x)).collect()
I have tried using flatMap instead of map but I get NoneType object is not iterable.
I would also like to pass a broadcasted dataset (which is a DataFrame) as parameter inside the apply_classifier function.
Finally, is it possible to do what I am trying to do? What are the alternatives?
is it possible to do what I am trying to do?
It is not. Apache Spark doesn't support any form of nesting and distributed operations can be initialized only by the driver. This includes access to distributed data structures, like Spark DataFrame.
What are the alternatives?
This depends on many factors like the size of the data, amount of available resources, and choice of algorithms. In general you have three options:
Use Spark only as task management tool to train local, non-distributed models. It looks like you explored this path to some extent already. For more advanced implementation of this approach you can check spark-sklearn.
In general this approach is particularly useful when data is relatively small. Its advantage is that there is no competition between multiple jobs.
Use standard multithreading tools to submit multiple independent jobs from a single context. You can use for example threading or joblib.
While this approach is possible I wouldn't recommend it in practice. Not all Spark components are thread-safe and you have to pretty careful to avoid unexpected behaviors. It also gives you very little control over resource allocation.
Parametrize your Spark application and use external pipeline manager (Apache Airflow, Luigi, Toil) to submit your jobs.
While this approach has some drawbacks (it will require saving data to a persistent storage) it is also the most universal and robust and gives a lot of control over resource allocation.
I am using spark MlLib ALS CF algorithm to build a recommender system for an e-commerce website.
I am required by the owner of the website, to sort for each individual user,
all 4000 items in the catalog according to that user`s likelihood to buy them.
Spark`s CF algorithm allows me to do that, however, I suspect that after a few recommended items (say 30 for example), the order by which the algorithms sorts the items becomes pretty meaningless, and post that "magical" point, I am better off sorting the items by their general global popularity.
My question:
How can I find that "magical point"? Should it be different for each user?
I know this question might be a bit theoretical, but
I would appreciate any thought on this matter.
Spark uses in memory computing and caching to decrease latency on complex analytics, however this is mainly for "iterative algorythms",
If I needed to perform a more basic analytic, say perhaps each element was a group of numbers and I wanted to look for elements with a standard deviation less than 'x' would Spark still decrease latency compared to regular cluster computing (without in memory computing)? Assuming I used that same commodity hardware in each case.
It tied for the top sorting framework using none of those extra mechanisms, so I would argue that is reason enough. But, you can also run streaming, graphing, or machine learning without having to switch gears. Then, you add in that you should use DataFrames wherever possible and you get query optimizations beyond any other framework that I know of. So, yes, Spark is the clear choice in almost every instance.
One good thing about spark is its Datasource API combining it with SparkSQL gives you ability to query and join different data sources together. SparkSQL now includes decent optimizer - catalyst. As mentioned in one of the answer along with core (RDD) in spark you can also include streaming data, apply machine learning models and graph algorithms. So yes.
Given a MatrixFactorizationModel what would be the most efficient way to return the full matrix of user-product predictions (in practice, filtered by some threshold to maintain sparsity)?
Via the current API, once could pass a cartesian product of user-product to the predict function, but it seems to me that this will do a lot of extra processing.
Would accessing the private userFeatures, productFeatures be the correct approach, and if so, is there a good way to take advantage of other aspects of the framework to distribute this computation in an efficient way? Specifically, is there an easy way to do better than multiplying all pairs of userFeature, productFeature "by hand"?
Spark 1.1 has a recommendProducts method that can be mapped to each user ID. This is better than nothing but not really optimized for recommending to all users.
I would double-check that you really mean to make recommendations for everyone; at scale, this is inherently a big slow operation. Consider predicting for users that have been recently active only.
Otherwise, yes your best bet is to create your own method. The cartesian join of the feature RDDs is probably too slow as it's shuffling so many copies of the feature vectors. Choose the larger of the user / product feature set, and map that. In each worker, hold the other product / user feature set in memory in each worker. If this isn't feasible you can make this more complex and map several times against subsets of the smaller RDD in memory.
As of Spark 2.2, recommendProductsForUsers(num) would be the method.
Recommends the top "num" number of products for all users. The number of recommendations returned per user may be less than "num".
https://spark.apache.org/docs/2.2.0/api/python/pyspark.mllib.html