what is the mapper and reducer that are used in computeSVD() function? - apache-spark

i am new to Map reduce and i want to do some research to compute svd using mapreduce.
the code side : i have found computeSVD a pyspark function and it uses mapreduce as said in this discussion .
the theory side : what is the mapper and reducer that are used in computeSVD() function ?
my code
findspark.init('C:\spark\spark-3.0.3-bin-hadoop2.7')
conf=SparkConf()
conf.setMaster("local[*]")
conf.setAppName('firstapp')
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
rows = np.loadtxt('data.txt', dtype=float) # data.txt is a (m rows x n cols) matrix m>n
rows = sc.parallelize(rows)
mat = RowMatrix(rows)
svd = mat.computeSVD(5, computeU=True)
i would highely appriciate any help.

You can see in this function the usage of mapPartitions and reduceByKey from the RDD object, which do something similar to MapReduce, but is not the same library as Hadoop Mapreduce

Related

Higher Order functions in Spark SQL

Can anyone please explain the transform() and filter() in Spark Sql 2.4 with some advanced real-world use-case examples ?
In a sql query, is this only to be used with array columns or it can also be applied to any column type in general. It would be great if anyone could demonstrate with a sql query for an advanced application.
Thanks in advance.
Not going down the .filter road as I cannot see the focus there.
For .transform
dataframe transform at DF-level
transform on an array of a DF in v 2.4
transform on an array of a DF in v 3
The following:
dataframe transform
From the official docs https://kb.databricks.com/data/chained-transformations.html transform on DF can end up like spaghetti. Opinion can differ here.
This they say is messy:
...
def inc(i: Int) = i + 1
val tmp0 = func0(inc, 3)(testDf)
val tmp1 = func1(1)(tmp0)
val tmp2 = func2(2)(tmp1)
val res = tmp2.withColumn("col3", expr("col2 + 3"))
compared to:
val res = testDf.transform(func0(inc, 4))
.transform(func1(1))
.transform(func2(2))
.withColumn("col3", expr("col2 + 3"))
transform with lambda function on an array of a DF in v 2.4 which needs the select and expr combination
import org.apache.spark.sql.functions._
val df = Seq(Seq(Array(1,999),Array(2,9999)),
Seq(Array(10,888),Array(20,8888))).toDF("c1")
val df2 = df.select(expr("transform(c1, x -> x[1])").as("last_vals"))
transform with lambda function new array function on a DF in v 3 using withColumn
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
val df = Seq(
(Array("New York", "Seattle")),
(Array("Barcelona", "Bangalore"))
).toDF("cities")
val df2 = df.withColumn("fun_cities", transform(col("cities"),
(col: Column) => concat(col, lit(" is fun!"))))
Try them.
Final note and excellent point raised (from https://mungingdata.com/spark-3/array-exists-forall-transform-aggregate-zip_with/):
transform works similar to the map function in Scala. I’m not sure why
they chose to name this function transform… I think array_map would
have been a better name, especially because the Dataset#transform
function is commonly used to chain DataFrame transformations.
Update
If wanting to use %sql or display approach for Higher Order Functions, then consult this: https://docs.databricks.com/delta/data-transformation/higher-order-lambda-functions.html

How will Spark react if an RDD gets bigger?

We have code running in Apache Spark. After a detailed examination of the code, I've determined that one of our mappers is modifying an object that is in an RDD, rather than making a copy of the object for the output. That is, we have an RDD of dicts, and the map function is adding things to the dictionary, rather than returning new dictionaries.
RDDs are supposed to be immutable. Ours are being mutated.
We are also having memory errors.
Question: Will Spark be confused if the size of an RDD suddenly increases?
While it probably does not crash, it can cause some unspecified behaviour. For example this snippet
val rdd = sc.parallelize({
val m = new mutable.HashMap[Int, Int]
m.put(1, 2)
m
} :: Nil)
rdd.cache() // comment out to change behaviour!
rdd.map(m => {
m.put(2, 3)
m
}).collect().foreach(println) // "Map(2 -> 3, 1 -> 2)"
rdd.collect().foreach(println) // Either "Map(1 -> 2)" or "Map(2 -> 3, 1 -> 2)" depending if caching is used
the behaviour changes depending if the RDD gets cached or not. In the Spark API there is a bunch of functions that are allowed to mutate the data and that is clearly pointed out in the documentation, see this for example https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/rdd/PairRDDFunctions.html#aggregateByKey-U-scala.Function2-scala.Function2-scala.reflect.ClassTag-
Consider having a RDD[(K, V)] of map entries instead of maps i.e. RDD[Map[K, V]]. This would enable adding new entries in a standard way using flatMap or mapPartitions. If needed, the map representation can be eventually generating by grouping etc.
Okay, I developed some code to test out what happens if an object referred to in an RDD is mutated by the mapper, and I am happy to report that it is not possible if you are programming from Python.
Here is my test program:
from pyspark.sql import SparkSession
import time
COUNT = 5
def funnydir(i):
"""Return a directory for i"""
return {"i":i,
"gen":0 }
def funnymap(d):
"""Take a directory and perform a funnymap"""
d['gen'] = d.get('gen',0) + 1
d['id' ] = id(d)
return d
if __name__=="__main__":
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
dfroot = sc.parallelize(range(COUNT)).map(funnydir)
dfroot.persist()
df1 = dfroot.map(funnymap)
df2 = df1.map(funnymap)
df3 = df2.map(funnymap)
df4 = df3.map(funnymap)
print("===========================================")
print("*** df1:",df1.collect())
print("*** df2:",df2.collect())
print("*** df3:",df3.collect())
print("*** df4:",df4.collect())
print("===========================================")
ef1 = dfroot.map(funnymap)
ef2 = ef1.map(funnymap)
ef3 = ef2.map(funnymap)
ef4 = ef3.map(funnymap)
print("*** ef1:",ef1.collect())
print("*** ef2:",ef2.collect())
print("*** ef3:",ef3.collect())
print("*** ef4:",ef4.collect())
If you run this, you'll see that the id for the dictionary d is different in each of the dataframes. Apparently Spark is serializing deserializing the objects as they are passed from mapper to mapper. So each gets its own version.
If this were not true, then the first call to funnymap to make df1 would also change the generation in the dfroot data frame, and as a result ef4 would have different generation numbers that df4.

How to apply FP-Growth to dataset after groupBy?

I'd like to use FP-Growth from Spark MLlib in Spark 2.1.
My data has only two columns item_group and item.
I have tried the following but it does not work:
sc = SparkSession.builder.appName("Assoziationsanalyse").getOrCreate()
hiveCtx = SQLContext(sc)
input = hiveCtx.sql("""select * from bosch.input_view""").
groupBy("item_group").
agg(collect_list("item")).
alias("items").
rdd.
map(lambda x : x.items)
model = FPGrowth.train(input, minSupport=0.2, numPartitions=10)
I have solved the problem now using an other approch which I have found in a discussion here.
data=hiveCtx.sql("""select * from bosch.input_view""")
datardd=data.rdd.map(lambda x (x[0],x[1])).groupByKey().mapValues(list).values()
model = FPGrowth.train(datardd, minSupport=0.1, numPartitions=10)

K means clustering ml library using apache spark

I am trying to implement k means clustering using apache spark ml version in 2.0.2. After finding the cluster center , facing the issue in how do identity the data belongs to which cluster point. Please help me.. Thanks in advance. Please find my code :
val tokenFrameprocess = TokenizerProcessor.process("value", "tokenized")
val stopwordRemover = StopWordsProcessor.process("tokenized", "stopwords")
val stemmingProcess = StemmingProcessor.process("value", "stemmed")
val HashingProcess = HashingTFProcessor.process("stopwords" ,"features")
val pipeline = new Pipeline().setStages(Array(tokenFrameprocess,stopwordRemover,stemmingProcess,HashingProcess))
val finalFrameProcess = pipeline.fit(df)
val finalFramedata = finalFrameProcess.transform(df);
val kmeans = new KMeans().setK(4).setSeed(1L)
val model = kmeans.fit(finalFramedata)
val WSSSE = model.computeCost(finalFramedata)
// Shows the result.
println("Cluster Centers: ")
model.clusterCenters.foreach(println)
You have to set the feature column which will be used to predict on while instantiating the Kmeans like following example:
val kmeans = new KMeans().setK(4).setSeed(1L).setFeaturesCol("features").setPredictionCol("prediction")
Once you call the .fit() on the kmeans, It returns a Transformer. In case of your example, variable "model" is a transformer. You can call the .transform() To get the desired prediction on the given data. For example below lines will give you the dataframe with prediction column.
val model = kmeans.fit(finalFramedata)
val transformedDF = model.transform(finalFramedata)
transformedDF.show(false)
Prediction column denotes clusters points. If you give k value as 3 and the prediction column will have values like 0,1,2.

How to use L1 penalty in pyspark.ml.regression.LinearRegressionModel for features selection?

Firstly, I use spark 1.6.0. I want to use L1 penalty in pyspark.ml.regression.LinearRegressionModel for features selection.
But I can not get the detailed coefficients when calling the function:
lr = LogisticRegression(elasticNetParam=1.0, regParam=0.01,maxIter=100,fitIntercept=False,standardization=False)
model = lr.fit(df_one_hot_train)
print model.coefficients.toArray().astype(float).tolist()
I only get sparse list like:
[0,0,0,0,0,..,-0.0871650387514,..,]
While when I use sklearn.linear_model.LogisticRegression model, I can get the detailed list without zero value in coef_ list like:
[0.03098372361467529,-0.13709075166114365,-0.15069548597557908,-0.017968044053830862]
With the better performance in spark, I could finished my work faster. I just want to use L1 penalty for feature selection.
I think I should use more detailed values of coefficients for my feature selection work just as sklearn does, how can I solve my problem?
Below is a working code snip in Spark 2.1.
The key to extract values is :
stages(4).asInstanceOf[LinearRegressionModel]
Spark 1.6 may have something similar.
val holIndIndexer = new StringIndexer().setInputCol("holInd").setOutputCol("holIndIndexer")
val holIndEncoder = new OneHotEncoder().setInputCol("holIndIndexer").setOutputCol("holIndVec")
val time_intervaLEncoder = new OneHotEncoder().setInputCol("time_interval").setOutputCol("time_intervaLVec")
val assemblerL1 = (new VectorAssembler()
.setInputCols(Array("time_intervaLVec", "holIndVec", "length")).setOutputCol("features") )
val lrL1 = new LinearRegression().setFeaturesCol("features").setLabelCol("travel_time")
val pipelineL1 = new Pipeline().setStages(Array(holIndIndexer,holIndEncoder,time_intervaLEncoder,assemblerL1, lrL1))
val modelL1 = pipelineL1.fit(dfTimeMlFull)
val l1Coeff =modelL1.stages(4).asInstanceOf[LinearRegressionModel].coefficients
println(l1Coeff)

Resources