Spark PCA top components - apache-spark

In the spark mllib documents for Dimensionality Reduction there is a section about PCA that describe how to use PCA in spark. The computePrincipalComponents method requires a parameter that determine the number of top components that we want.
The problem is that I don't know how many components I want. I mean as few as possible. In Some other tools PCA gives us a table that shows if for example we choose those 3 components we'll cover 95 percents of data. So does Spark has this functionality in it's libraries or if it don't have how can I implement it in Spark?

Spark 2.0+:
This should be available out-of-the box. See SPARK-11530 for details.
Spark <= 1.6
Spark doesn't provide this functionality yet but it is not hard to implement using existing Spark code and definition of explained variance. Lets say we want to explain 75 percent of total variance:
val targetVar = 0.75
First lets reuse Spark code to compute SVD:
import breeze.linalg.{DenseMatrix => BDM, DenseVector => BDV, svd => brzSvd}
import breeze.linalg.accumulate
import java.util.Arrays
// Compute covariance matrix
val cov = mat.computeCovariance()
// Compute SVD
val brzSvd.SVD(u: BDM[Double], e: BDV[Double], _) = brzSvd(
new BDM(cov.numRows, cov.numCols, cov.toArray))
Next we can find fraction of explained variance:
val varExplained = accumulate(e).map(x => x / e.toArray.sum).toArray
and number of components we have to get
val (v, k) = varExplained.zipWithIndex.filter{
case (v, _) => v >= targetVar
}.head
Finally we can subset U once again reusing Spark code:
val n = mat.numCols.toInt
Matrices.dense(n, k + 1, Arrays.copyOfRange(u.data, 0, n * (k + 1)))

Related

how to get some element from each group after use groupby in spark

I have a spark rdd data, let's suppose it has 1000 elements and can be grouped into 10 groups, what I want to do is select 2 element which meets my special requirement in each group. And then, get a new rdd with 20 elements.
suppose the rdd data is like
((1,a1),
(1,a2),
(1,a3),
...
(1,a100),
(2,b1),
(2,b2),
(2,b3)
...
(2,b100))
what i want is
((1,a1),
(1,a99),
(2,b1),
(2,b99)
)
and I select a1、a99、b1、b99 with a function called my_func
I think the code may be something like:
myrdd.groupby(x => x._1)....(my_func)...
Not convinced you need groupBy. Not sure of structure of RDD.
This using my own contrived data, so you will need to adapt:
// Gen some data. My data. Adapt to yours.
val rdd = spark.sparkContext.parallelize(Seq((1, "x"), (2, "y"), (3, "z"), (4, "z"), (5, "bbb") ))
// Compare list.
val l = List("x", "y", "z")
// Function to filter, could be inline or via mapPartitions.
def my_function(l: List[String], r: RDD[(Int, String)]) = {
r.map(x => (x._1, x._2)).filter(x => l.contains(x._2))
}
// Run it all.
val rdd2 = my_function(l,rdd)
rdd2.collect
returns:
res24: Array[(Int, String)] = Array((1,x), (2,y), (3,z), (4,z))
I strongly discourage you from using groupBy() or even mapPartitions() for big dataset when you need to subsequently aggregate your data. The purpose of RDD and MapReduce programming model is to distribute computations: computing the max/min/sum etc in the driver or on a single node means using only the HDFS part of Spark.
Besides, there are many ways to perform your task, but focusing on finding a pattern that fits for every type of aggregation you need is just wrong and inevitably make your code inefficient.
Here is a possible PySpark solution for the problem you have:
rdd.reduceByKey(lambda x, y: x if x < y else y)\
.union(rdd.reduceByKey(lambda x, y: x if x > y else y)).sortByKey().collect()
In the first reduceByKey I find the smallest value for each key and in the second one the biggest value for each key. Then I can union them and, if necessary, sort the resulting RDD to obtain the result you showed us.

Computing PageRank on a digraph with edge weights using GraphFrames

Assume I use GraphFrames to construct a digraph g with edge weights from the positive real numbers. I would then like to compute the PageRank with taking the edge weights into account. I don't see how this can be achieved by looking at the documentation for graphframes.GraphFrame.pageRank. Calling results = g.pageRank(resetProbability=0.15, maxIter=10) will compute the PageRank, but assuming edge weights of 1 as far as I can tell. Am I correct?
Compare this to networkx.algorithms.link_analysis.pagerank_alg.pagerank which allows for computing PageRank on a digraph with edge weights, see documentation.
Thanks for reading and any help is appreciated.
I think that probably we can 'flatten' the data first.
val df = Seq((1,2,3),(2,3,4),(3,4,1)).toDF("src", "dst", "weight")
val getArray = udf[Seq[Int], Int] {x => (1 to x).toList.toSeq}
val flatDf = df \
.withColumn("dummy1", getArray(col("weight"))) \
.withColumn("dummy2", explode(col("dummy1"))).select("src", "dst")

How to find the median salary of employees in a department [PySPark]? [duplicate]

How can I find median of an RDD of integers using a distributed method, IPython, and Spark? The RDD is approximately 700,000 elements and therefore too large to collect and find the median.
This question is similar to this question. However, the answer to the question is using Scala, which I do not know.
How can I calculate exact median with Apache Spark?
Using the thinking for the Scala answer, I am trying to write a similar answer in Python.
I know I first want to sort the RDD. I do not know how. I see the sortBy (Sorts this RDD by the given keyfunc) and sortByKey (Sorts this RDD, which is assumed to consist of (key, value) pairs.) methods. I think both use key value and my RDD only has integer elements.
First, I was thinking of doing myrdd.sortBy(lambda x: x)?
Next I will find the length of the rdd (rdd.count()).
Finally, I want to find the element or 2 elements at the center of the rdd. I need help with this method too.
EDIT:
I had an idea. Maybe I can index my RDD and then key = index and value = element. And then I can try to sort by value? I don't know if this is possible because there is only a sortByKey method.
Ongoing work
SPARK-30569 - Add DSL functions invoking percentile_approx
Spark 2.0+:
You can use approxQuantile method which implements Greenwald-Khanna algorithm:
Python:
df.approxQuantile("x", [0.5], 0.25)
Scala:
df.stat.approxQuantile("x", Array(0.5), 0.25)
where the last parameter is a relative error. The lower the number the more accurate results and more expensive computation.
Since Spark 2.2 (SPARK-14352) it supports estimation on multiple columns:
df.approxQuantile(["x", "y", "z"], [0.5], 0.25)
and
df.approxQuantile(Array("x", "y", "z"), Array(0.5), 0.25)
Underlying methods can be also used in SQL aggregation (both global and groped) using approx_percentile function:
> SELECT approx_percentile(10.0, array(0.5, 0.4, 0.1), 100);
[10.0,10.0,10.0]
> SELECT approx_percentile(10.0, 0.5, 100);
10.0
Spark < 2.0
Python
As I've mentioned in the comments it is most likely not worth all the fuss. If data is relatively small like in your case then simply collect and compute median locally:
import numpy as np
np.random.seed(323)
rdd = sc.parallelize(np.random.randint(1000000, size=700000))
%time np.median(rdd.collect())
np.array(rdd.collect()).nbytes
It takes around 0.01 second on my few years old computer and around 5.5MB of memory.
If data is much larger sorting will be a limiting factor so instead of getting an exact value it is probably better to sample, collect, and compute locally. But if you really want a to use Spark something like this should do the trick (if I didn't mess up anything):
from numpy import floor
import time
def quantile(rdd, p, sample=None, seed=None):
"""Compute a quantile of order p ∈ [0, 1]
:rdd a numeric rdd
:p quantile(between 0 and 1)
:sample fraction of and rdd to use. If not provided we use a whole dataset
:seed random number generator seed to be used with sample
"""
assert 0 <= p <= 1
assert sample is None or 0 < sample <= 1
seed = seed if seed is not None else time.time()
rdd = rdd if sample is None else rdd.sample(False, sample, seed)
rddSortedWithIndex = (rdd.
sortBy(lambda x: x).
zipWithIndex().
map(lambda (x, i): (i, x)).
cache())
n = rddSortedWithIndex.count()
h = (n - 1) * p
rddX, rddXPlusOne = (
rddSortedWithIndex.lookup(x)[0]
for x in int(floor(h)) + np.array([0L, 1L]))
return rddX + (h - floor(h)) * (rddXPlusOne - rddX)
And some tests:
np.median(rdd.collect()), quantile(rdd, 0.5)
## (500184.5, 500184.5)
np.percentile(rdd.collect(), 25), quantile(rdd, 0.25)
## (250506.75, 250506.75)
np.percentile(rdd.collect(), 75), quantile(rdd, 0.75)
(750069.25, 750069.25)
Finally lets define median:
from functools import partial
median = partial(quantile, p=0.5)
So far so good but it takes 4.66 s in a local mode without any network communication. There is probably way to improve this, but why even bother?
Language independent (Hive UDAF):
If you use HiveContext you can also use Hive UDAFs. With integral values:
rdd.map(lambda x: (float(x), )).toDF(["x"]).registerTempTable("df")
sqlContext.sql("SELECT percentile_approx(x, 0.5) FROM df")
With continuous values:
sqlContext.sql("SELECT percentile(x, 0.5) FROM df")
In percentile_approx you can pass an additional argument which determines a number of records to use.
Here is the method I used using window functions (with pyspark 2.2.0).
from pyspark.sql import DataFrame
class median():
""" Create median class with over method to pass partition """
def __init__(self, df, col, name):
assert col
self.column=col
self.df = df
self.name = name
def over(self, window):
from pyspark.sql.functions import percent_rank, pow, first
first_window = window.orderBy(self.column) # first, order by column we want to compute the median for
df = self.df.withColumn("percent_rank", percent_rank().over(first_window)) # add percent_rank column, percent_rank = 0.5 coressponds to median
second_window = window.orderBy(pow(df.percent_rank-0.5, 2)) # order by (percent_rank - 0.5)^2 ascending
return df.withColumn(self.name, first(self.column).over(second_window)) # the first row of the window corresponds to median
def addMedian(self, col, median_name):
""" Method to be added to spark native DataFrame class """
return median(self, col, median_name)
# Add method to DataFrame class
DataFrame.addMedian = addMedian
Then call the addMedian method to calculate the median of col2:
from pyspark.sql import Window
median_window = Window.partitionBy("col1")
df = df.addMedian("col2", "median").over(median_window)
Finally you can group by if needed.
df.groupby("col1", "median")
Adding a solution if you want an RDD method only and dont want to move to DF.
This snippet can get you a percentile for an RDD of double.
If you input percentile as 50, you should obtain your required median.
Let me know if there are any corner cases not accounted for.
/**
* Gets the nth percentile entry for an RDD of doubles
*
* #param inputScore : Input scores consisting of a RDD of doubles
* #param percentile : The percentile cutoff required (between 0 to 100), e.g 90%ile of [1,4,5,9,19,23,44] = ~23.
* It prefers the higher value when the desired quantile lies between two data points
* #return : The number best representing the percentile in the Rdd of double
*/
def getRddPercentile(inputScore: RDD[Double], percentile: Double): Double = {
val numEntries = inputScore.count().toDouble
val retrievedEntry = (percentile * numEntries / 100.0 ).min(numEntries).max(0).toInt
inputScore
.sortBy { case (score) => score }
.zipWithIndex()
.filter { case (score, index) => index == retrievedEntry }
.map { case (score, index) => score }
.collect()(0)
}
There are two ways that can be used. One is using approxQuantile method and the other percentile_approx method. However, both the methods might not give accurate results when there are even number of records.
importpyspark.sql.functions.percentile_approx as F
# df.select(F.percentile_approx("COLUMN_NAME_FOR_WHICH_MEDIAN_TO_BE_COMPUTED", 0.5).alias("MEDIAN)) # might not give proper results when there are even number of records
((
df.select(F.percentile_approx("COLUMN_NAME_FOR_WHICH_MEDIAN_TO_BE_COMPUTED", 0.5) + df.select(F.percentile_approx("COLUMN_NAME_FOR_WHICH_MEDIAN_TO_BE_COMPUTED", 0.51)
)*.5).alias("MEDIAN))
I have written the function which takes data frame as an input and returns a dataframe which has median as an output over a partition and order_col is the column for which we want to calculate median for part_col is the level at which we want to calculate median for :
from pyspark.sql import Window
import pyspark.sql.functions as F
def calculate_median(dataframe, part_col, order_col):
win = Window.partitionBy(*part_col).orderBy(order_col)
# count_row = dataframe.groupby(*part_col).distinct().count()
dataframe.persist()
dataframe.count()
temp = dataframe.withColumn("rank", F.row_number().over(win))
temp = temp.withColumn(
"count_row_part",
F.count(order_col).over(Window.partitionBy(part_col))
)
temp = temp.withColumn(
"even_flag",
F.when(
F.col("count_row_part") %2 == 0,
F.lit(1)
).otherwise(
F.lit(0)
)
).withColumn(
"mid_value",
F.floor(F.col("count_row_part")/2)
)
temp = temp.withColumn(
"avg_flag",
F.when(
(F.col("even_flag")==1) &
(F.col("rank") == F.col("mid_value"))|
((F.col("rank")-1) == F.col("mid_value")),
F.lit(1)
).otherwise(
F.when(
F.col("rank") == F.col("mid_value")+1,
F.lit(1)
)
)
)
temp.show(10)
return temp.filter(
F.col("avg_flag") == 1
).groupby(
part_col + ["avg_flag"]
).agg(
F.avg(F.col(order_col)).alias("median")
).drop("avg_flag")
For exact median computation you can use the following function and use it with PySpark DataFrame API:
def median_exact(col: Union[Column, str]) -> Column:
"""
For grouped aggregations, Spark provides a way via pyspark.sql.functions.percentile_approx("col", .5) function,
since for large datasets, computing the median is computationally expensive.
This function manually computes the median and should only be used for small to mid sized datasets / groupings.
:param col: Column to compute the median for.
:return: A pyspark `Column` containing the median calculation expression
"""
list_expr = F.filter(F.collect_list(col), lambda x: x.isNotNull())
sorted_list_expr = F.sort_array(list_expr)
size_expr = F.size(sorted_list_expr)
even_num_elements = (size_expr % 2) == 0
odd_num_elements = ~even_num_elements
return F.when(size_expr == 0, None).otherwise(
F.when(odd_num_elements, sorted_list_expr[F.floor(size_expr / 2)]).otherwise(
(
sorted_list_expr[(size_expr / 2 - 1).cast("long")]
+ sorted_list_expr[(size_expr / 2).cast("long")]
)
/ 2
)
)
Apply it like this:
output_df = input_spark_df.groupby("group").agg(
median_exact("elems").alias("elems_median")
)
We can calculate the median and quantiles in spark using the following code:
df.stat.approxQuantile(col,[quantiles],error)
For example, finding the median in the following dataframe [1,2,3,4,5]:
df.stat.approxQuantile(col,[0.5],0)
The lesser the error, the more accurate the results.
From version 3.4+ (and also already in 3.3.1) the median function is directly available
https://github.com/apache/spark/blob/e170a2eb236a376b036730b5d63371e753f1d947/python/pyspark/sql/functions.py#L633
import pyspark.sql.functions as f
df.groupBy("grp").agg(f.median("val"))
I guess the respective documentation will be added if the version is finally released.

How to compute the dot product of two distributed RowMatrix in Apache Spark?

Let Q be a distributed Row Matrix in Spark, I want to calculate the cross product of Q with its transpose Q'.
However although a Row Matrix does have a multiply() method, but it can only accept local Matrices as an argument.
Code illustration ( Scala ):
val phi = new RowMatrix(phiRDD) // phiRDD is an instance of RDD[Vector]
val phiTranspose = transposeRowMatrix(phi) // transposeRowMatrix()
// returns the transpose of a RowMatrix
val crossMat = ? // phi * phiTranspose
Note that I want to perform the dot product of 2 Distributed RowMatrix not a distributed one with a local one.
One solution is to use an IndexedRowMatrix as following:
val phi = new IndexedRowMatrix(phiRDD) // phiRDD is an instance of RDD[IndexedRow]
val phiTranspose = transposeMatrix(phi) // transposeMatrix()
// returns the transpose of a Matrix
val crossMat = phi.toBlockMatrix().multiply( phiTranspose.toBlockMatrix()
).toIndexedRowMatrix()
However, I want to use the Row Matrix-Methods such as tallSkinnyQR() and this means that I sholud transform crossMat to a RowMatrix, using .toRowMatrix() method:
val crossRowMat = crossMat.toRowMatrix()
and finally I can apply
crossRowMat.tallSkinnyQR()
but this process includes many transformations between the types of the Distributed Matrices and according to what I understood from MLlib Programming Guide this is expensive:
It is very important to choose the right format to store large and distributed matrices. Converting a distributed matrix to a different format may require a global shuffle, which is quite expensive.
Would someone elaborate, please.
Only distributed matrices which support matrix - matrix multiplication are BlockMatrices. You have to convert your data accordingly - artificial indices are good enough:
new IndexedRowMatrix(
rowMatrix.rows.zipWithIndex.map(x => IndexedRow(x._2, x._1))
).toBlockMatrix match { case m => m.multiply(m.transpose) }
I used the algorithm listed on this page which moves the multiplication problem from dot product to distributed scalar product problem by using vectors outer product:
The outer product between two vectors is the scalar product of the
second vector with all the elements in the first vector, resulting in
a matrix
My own created multiplication function (can be more optimized) for Row Matrices ended up like that.
def multiplyRowMatrices(m1: RowMatrix, m2: RowMatrix)(implicit ctx: SparkSession): RowMatrix = {
// Zip m1 columns with m2 rows
val m1Cm2R = transposeRowMatrix(m1).rows.zip(m2.rows)
// Apply scalar product between each entry in m1 vector with m2 row
val scalar = m1Cm2R.map{
case(column:DenseVector,row:DenseVector) => column.toArray.map{
columnValue => row.toArray.map{
rowValue => columnValue*rowValue
}
}
}
// Add all the resulting matrices point wisely
val sum = scalar.reduce{
case(matrix1,matrix2) => matrix1.zip(matrix2).map{
case(array1,array2)=> array1.zip(array2).map{
case(value1,value2)=> value1+value2
}
}
}
new RowMatrix(ctx.sparkContext.parallelize(sum.map(array=> Vectors.dense(array))))
}
After that I tested both approaches- My own function and using block matrix - using a 300*10 Matrix on a one machine
Using my own function:
val PhiMat = new RowMatrix(phi)
val TphiMat = transposeRowMatrix(PhiMat)
val product = multiplyRowMatrices(PhiMat,TphiMat)
Using matrix transformation:
val MatRow = new RowMatrix(phi)
val MatBlock = new IndexedRowMatrix(MatRow.rows.zipWithIndex.map(x => IndexedRow(x._2, x._1))).toBlockMatrix()
val TMatBlock = MatBlock.transpose
val productMatBlock = MatBlock.multiply(TMatBlock)
val productMatRow = productMatBlock.toIndexedRowMatrix().toRowMatrix()
The first approach spanned 1 job with 5 stages and took 2s to finish in total. While the second approach spanned 4 jobs, three with one stage and one with two stages, and took 0.323s in total. Also the second approach outperformed the first with respect to the Shuffle Read/Write size.
Yet I am still confused by the MLlib Programming Guide statement:
It is very important to choose the right format to store large and
distributed matrices. Converting a distributed matrix to a different
format may require a global shuffle, which is quite expensive.

How does SGD works in Spark MLlib with Streaming

I am new to Spark and wonder what will happen in the background in terms of RDDs on the cluster if the trainOn method is invoked on a StreamingLinearAlgorithm e.g.
val regression = new StreamingLinearRegressionWithSGD()
regression.trainOn(someStream)
StreamingLinearRegressionWithSGD contains the actual model (LinearRegressionModel) as well as the learning algorithm (LinearRegressionWithSGD) as a protected member. In method trainOn, the following is happening:
data.foreachRDD { (rdd, time) =>
if (!rdd.isEmpty) {
model = Some(algorithm.run(rdd, model.get.weights))
}
From my understanding, the streaming data is chunked every x seconds and the chunked data is put in an RDD. So every x seconds the learning algorithm is applied to a new RDD. The actual learning magic happens somewhere in GradientDescent, where the RDD data and the previous model weight vector is the input and the updated weight vector is the output.
val bcWeights = data.context.broadcast(weights)
// Sample a subset (fraction miniBatchFraction) of the total data
// compute and sum up the subgradients on this subset (this is one map-reduce)
val (gradientSum, lossSum, miniBatchSize) = data.sample(false, miniBatchFraction, 42 + i)
.treeAggregate((BDV.zeros[Double](n), 0.0, 0L))(
seqOp = (c, v) => {
// c: (grad, loss, count), v: (label, features)
val l = gradient.compute(v._2, v._1, bcWeights.value, Vectors.fromBreeze(c._1))
(c._1, c._2 + l, c._3 + 1)
},
combOp = (c1, c2) => {
// c: (grad, loss, count)
(c1._1 += c2._1, c1._2 + c2._2, c1._3 + c2._3)
})
...
val update = updater.compute(
weights, Vectors.fromBreeze(gradientSum / miniBatchSize.toDouble),
stepSize, i, regParam)
weights = update._1
regVal = update._2
How is the weight vector updated? Does it work in parallel? I thought an RDD is split up into partitions and the distributed. I can see that the input weights are broadcasted in the spark context and treeAggregate is used, to sum up, the gradients but I can still not grasp which actual map-reduce task is taking place and which part is happening in the driver and which in a parallel way. Can anyone explain what is happening in detail?

Resources