I want to run RL algorithm on Apache Spark. However, RL does not exists in Spark's MLib.
Is it possible to implement it? any links may help.
Thank you in advance
You can use UDF or Pandas UDF instead.
Related
Are they interchangeable? Or is one a subset of the other? Thanks.
So it may depend what programs you are using, but I have found a source:
Hive on Spark is similar to SparkSQL, it is a pure SQL interface that use spark as execution engine, SparkSQL uses Hive's syntax, so as a language, I would say they are almost the same.
So to answer your question, the syntax should mostly be interchangeable but they seem to be used for different purposes.
See this post here for more information about Hive and Spark.
I already have nutch/solr application in single mode. I'm spouse to try integrating Mahout or spark to achieve kinda of personlized results. But I'm still a lot far from that point.
With lack of knowledge, time, and resources is there a fast and effective way to use one tool with Nutch's crawled.db or solr indexed data to represent personlization as a proof of concept?
I'm open to any idea.
Regards
Considering you are saying Spark vs Mahout- I think you are thinking of "old" MR based Mahout, which has been deprecated and moved to "community support".
I would recommend you use Mahout Samsara, which is a Spark library. E.g. my answer is you should use Mahout and Spark. For local mode though, you can just use Mahout Vectors / Matrices.
The question is vague, but based on the tags, I think this tutorial might be a good place to start, as it uses Mahout and Solr for a recommendation engine.
http://mahout.apache.org/docs/latest/tutorials/cco-lastfm/
Disclaimer: I'm a PMC of the Apache Mahout project.
This question already has answers here:
Difference between SparkContext, JavaSparkContext, SQLContext, and SparkSession?
(5 answers)
Closed 4 years ago.
Can someone explain to me the difference between spark.createDataFrame() and sqlContext.createDataFrame()? I have seen both used but do not understand the exact difference or when to use which.
I'm gonna assume you are using spark with a version over 2, because in the first method you seem to be referring to a SparkSession which is only available after version 2
spark.createDataFrame(...) is the preferred way to create a df in spark 2. Refer to the linked documentation to see possible usages, as it is an overloaded method.
sqlContext.createDataFrame(...) (spark version - 1.6) was the used way to create a df in spark 1.x. As you can read in the linked documentation, it is deprecated in spark 2.x and only kept for backwards compatibility
The entry point for working with structured data (rows and columns) in Spark 1.x.
As of Spark 2.0, this is replaced by SparkSession. However, we are keeping the class here for backward compatibility.
So, to answer your question, you can use both ways in spark 2.x (although the second one is deprecated so it's strongly recommended to use the first one) and you can only use the second one provided you are stuck with spark 1.x
Edit: SparkSession implementation (i.e the source code) and SQLContext implementation
I've tried to read through and understand exactly where the speed up is coming from in spark when I run python libraries like pandas or scikit learn but I don't see anything particularly informative. If I can get the same speedup without using the pyspark dataframes can I just deploy code using pandas and it will perform roughly the same?
I suppose my question is:
If I have working pandas code should I translate it to PySpark for efficiency or not?
If you ask if you get any speedup by starting arbitrary Python code on the driver node the answer is negative. Driver is a plain Python interpreter, it doesn't affect you code in "magic" way.
If I have working pandas code should I translate it to PySpark for efficiency or not?
If you want to get benefits of distributed computing then you have to rewrite your code using distribute primitives. However it is not a free lunch:
You problem might not distribute well.
Even if does, amount of data might not justify distribution - How to add a <br/> after each result, but not last result?
In other words - if your code works just fine with Pandas or Scikit Learn, there is little chance you'll get anything from rewriting it to Spark.
Thanks in advance for any help on this. I am working on a project to do some system log anomaly detection on some very large data sets (we aggregate ~100gb per day of syslogs). The method/road we have chosen requires the need of singular decomposition value on a matrix of identifiers for each log message. As we progressed we found that Spark 2.2 provides a computeSVD function (we are using Python API - we are aware that this is available in Scala and Java, but our target is to use Python), but we are running Spark 2.1.1 (HortonWorks HDP 2.6.2 distribution). I asked about upgrading our 2.1.1 version in place, but the 2.2 version has not been tested against HDP yet.
We toyed with the idea of using Numpy straight from Python for this, but we are afraid we'll break the disinterestedness of Spark and possibly overload worker nodes by going outside of the Spark API. Are there any alternatives in the Spark 2.1.1 Python API for SVD? Any suggestion or pointers would greatly be appreciated. Thanks!
Another though I forgot about in the initial posting - is there a way we can write our machine learning primarily in the Python API, but maybe call that Scala function we need, return that result and continue with Python? I don't know if that is a thing or not....
To bring this to a close, we ended up writing our own SVD function based on the example at:
Pyspark and PCA: How can I extract the eigenvectors of this PCA? How can I calculate how much variance they are explaining?
There were some minor tweaks and I will post them as soon as we have them finalized, but overall it was the same. This was posted for Spark 1.5 and we are using Spark 2.1.1. However, it was noted that Spark 2.2 contains a computeSVD() function - unfortunately, at the time of the posting on this, the HDP distribution we are using did not support 2.2. Yesterday (11.1.2017), HDP 2.6.3 was announced and had support for Spark 2.2. Once we upgrade, we'll be converting the code to take advantage of the built-in computeSVD() function that Spark 2.2 provides. Thanks for all the help and pointers to the link above, they helped greatly!