Spark RDD: lookup from other RDD - apache-spark

I am trying to perform the quickest lookup possible in Spark, as part of some practice rolling-my-own association rules module. Please note that I know the metric below, confidence, is supported in PySpark. This is just an example -- another metric, lift, is not supported, yet I intend to use the results from this discussion to develop that.
As part of calculating the confidence of a rule, I need to look at how often the antecedent and consequent occur together, as well as how often the antecedent occurs across the whole transaction set (in this case, rdd).
from itertools import combinations, chain
def powerset(iterable, no_empty=True):
''' Produce the powerset for a given iterable '''
s = list(iterable)
combos = (combinations(s, r) for r in range(len(s)+1))
powerset = chain.from_iterable(combos)
return (el for el in powerset if el) if no_empty else powerset
# Set-up transaction set
rdd = sc.parallelize(
[
('a',),
('a', 'b'),
('a', 'b'),
('b', 'c'),
('a', 'c'),
('a', 'b'),
('b', 'c'),
('c',),
('b'),
]
)
# Create an RDD with the counts of each
# possible itemset
counts = (
rdd
.flatMap(lambda x: powerset(x))
.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x + y)
.map(lambda x: (frozenset(x[0]), x[1]))
)
# Function to calculate confidence of a rule
confidence = lambda x: counts.lookup(frozenset(x)) / counts.lookup((frozenset(x[1]),))
confidence_result = (
rdd
# Must be applied to length-two and greater itemsets
.filter(lambda x: len(x) > 1)
.map(confidence)
)
For those familiar with this type of lookup problem, you'll know that this type of Exception is raised:
Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
One way to get around this exception is to convert counts to a dictionary:
counts = dict(counts.collect())
confidence = lambda x: (x, counts[frozenset(x)] / counts[frozenset(x[1])])
confidence_result = (
rdd
# Must be applied to length-two and greater itemsets
.filter(lambda x: len(x) > 1)
.map(confidence)
)
Which gives me my result. But the process of running counts.collect is very expensive, since in reality I have a dataset with 50m+ records. Is there a better option for performing this type of lookup?

If your target metric can be independently calculated on each RDD partition and then combined to achieve the target result, you can use mapPartitions instead of map when calculating your metric.
The generic flow should be something like:
metric_result = (
rdd
# apply your metric calculation independently on each partition
.mapPartitions(confidence_partial)
# collect results from the partitions into a single list of results
.collect()
# reduce the list to combine the metrics calculated on each partition
.reduce(confidence_combine)
)
Both confidence_partial and confidence_combine are regular python function that take an iterator/list input.
As an aside, you would probably get a huge performance boost by using dataframe API and native expression functions to calculate your metric.

Related

Operate along a dimension without writing back data in dask array on Xarray

I have a dataset with 3 dimensions ('time', 'x' and 'y'). I want to apply this function foo along the time dimension:
def foo(arr):
lo, hi = np.percentile(arr, (1,99))
arr = np.clip(arr, lo, hi)
arr = (arr - lo) / (hi - lo)
return arr
Basically, I want to scale each "image" in the time dimension with a function like foo
In numpy, I could just do something like:
for i in range(data.shape[0]):
data[i] = foo(data[i])
but since the data is stored in dask arrays, I am unable to write back the modified data. I hit this error:
TypeError: this variable's data is stored in a dask array, which does not support item assignment. To assign to this variable, you must first load it into memory explicitly using the .load() method or accessing its .values attribute.
How would one go about doing this in xarray/dask?
There is no need to loop over the time dimension, you can do this in a vectorized way:
da = xr.tutorial.open_dataset(
"air_temperature", chunks={"lat": -1, "lon": -1, "time": 10}
)["air"]
def scale_image(da, quantiles):
quantiles = da.quantile(quantiles, dim=("lat", "lon"))
lower = quantiles.isel(quantile=0, drop=True)
upper = quantiles.isel(quantile=1, drop=True)
clipped = xr.apply_ufunc(np.clip, da, lower, upper, dask="allowed")
return (clipped - lower) / (upper - lower)
scaled = scale_image(da, quantiles=[0.01, 0.99])
Like this it is not necessary to load the whole array into memory.
I realized that you can use xarray's apply_ufunc directly with your foo function as well, if you provide the axis argument to np.percentile and take care of making the array shapes consistent.
It seems, that the dask-version of the percentile function is not implemented for multi-dimensional arrays, but you can use the parallelized option for apply_ufunc to make it work with the numpy function:
def foo(arr):
lo, hi = np.percentile(arr, (1, 99), axis=[1, 2])
arr = np.clip(arr, lo[:, None, None], hi[:, None, None])
return (arr - lo[:, None, None]) / (hi[:, None, None] - lo[:, None, None])
scaled2 = xr.apply_ufunc(foo, da, dask="parallelized")

Can I chain groupByKey calls on pair_rdd in Pyspark?

Is it possible in Pyspark to chain a groupByKey() call on a pair_rdd twice?
I have two levels of keys I want to group by before I aggregate by creating a special list of all values.
Here's my code. First groupByKey() call groups by the outer key and is then given to a map function in which I hope to turn the resultIterable object into a pair_rdd again so I can do the second groupByKey() and map my function to it.
(Since I'm reducing I guess I could also use reduceByKey() there?)
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.appName("test")\
.master("local")\
.config('spark.sql.shuffle.partitions', '4')\
.getOrCreate()
sc = spark.sparkContext
def group_by(ws):
L = ws[0]
E = ...ws[1]... <-- Do something here to turn this from resultIterable to Pair_RDD
rr = E.groupByKey().map(output_lists)
return (L, rr)
def output_lists(ws):
el = [e[0] for e in ws[1]]
res = [ws[0]] + el
return (ws[0], res)
input_data = (('A', ('G', ('xyz',))),
('A', ('G', ('xys',))),
('A', ('H', ('asd',))),
('B', ('K', ('qwe',))),
('B', ('K', ('wer',))))
data = sc.parallelize(input_data)
data = data.groupByKey().map(group_by)
print(data.take(5))
Now, is this even doable or do I need a different approach.
I know two other ways around:
Concatenate both keys into one.
Use a SparkSQL dataframe.
But I'm curious if there is a way with the above approach as I'm still learning Spark.
I found out I can use tuples as keys in pair RDDs. Remapping my input data like this means only one groupByKey() is needed and the problem can be solved:
input_data = ((('A', 'G'), 'xyz'),
(('A', 'G'), 'xys'),
(('A', 'H'), 'asd'),
(('B', 'K'), 'qwe'),
(('B', 'K'), 'wer'))

Python Spark combineByKey Average

I'm trying to learn Spark in Python, and am stuck with combineByKey for averaging the values in key-value pairs. In fact, my confusion is not with the combineByKey syntax, but what comes afterward. The typical example (from the O'Rielly 2015 Learning Spark Book) can be seen on the web in many places; here's one.
The problem is with the sumCount.map(lambda (key, (totalSum, count)): (key, totalSum / count)).collectAsMap() statement. Using spark 2.0.1 and iPython 3.5.2, this throws a syntax error exception. Simplifying it to something that should work (and is what's in the O'Reilly book): sumCount.map(lambda key,vals: (key, vals[0]/vals[1])).collectAsMap() causes Spark to go bats**t crazy with java exceptions, but I do note a TypeError: <lambda>() missing 1 required positional argument: 'v' error.
Can anyone point me to an example of this functionality that actually works with a recent version of Spark & Python? For completeness, I've included my own minimum working (or rather, non-working) example:
In: pRDD = sc.parallelize([("s",5),("g",3),("g",10),("c",2),("s",10),("s",3),("g",-1),("c",20),("c",2)])
In: cbk = pRDD.combineByKey(lambda x:(x,1), lambda x,y:(x[0]+y,x[1]+1),lambda x,y:(x[0]+y[0],x[1]+y[1]))
In: cbk.collect()
Out: [('s', (18, 3)), ('g', (12, 3)), ('c', (24, 3))]
In: cbk.map(lambda key,val:(k,val[0]/val[1])).collectAsMap() <-- errors
It's easy enough to compute [(e[0],e[1][0]/e[1][1]) for e in cbk.collect()], but I'd rather get the "Sparkic" way working.
Step by step:
lambda (key, (totalSum, count)): ... is so-called Tuple Parameter Unpacking which has been removed in Python.
RDD.map takes a function which expect as single argument. Function you try to use:
lambda key, vals: ...
Is a function which expects two arguments, not a one. A valid translation from 2.x syntax would be
lambda key_vals: (key_vals[0], key_vals[1][0] / key_vals[1][1])
or:
def get_mean(key_vals):
key, (total, cnt) = key_vals
return key, total / cnt
cbk.map(get_mean)
You can also make this much simpler with mapValues:
cbk.mapValues(lambda x: x[0] / x[1])
Finally a numerically stable solution would be:
from pyspark.statcounter import StatCounter
(pRDD
.combineByKey(
lambda x: StatCounter([x]),
StatCounter.merge,
StatCounter.mergeStats)
.mapValues(StatCounter.mean))
Averaging over a specific column value can be done by using the Window concept. Consider the following code:
import pyspark.sql.functions as F
from pyspark.sql import Window
df = spark.createDataFrame([('a', 2), ('b', 3), ('a', 6), ('b', 5)],
['a', 'i'])
win = Window.partitionBy('a')
df.withColumn('avg', F.avg('i').over(win)).show()
Would yield:
+---+---+---+
| a| i|avg|
+---+---+---+
| b| 3|4.0|
| b| 5|4.0|
| a| 2|4.0|
| a| 6|4.0|
+---+---+---+
The average aggregation is done on each worker separately, requires no round trip to the host, and therefore efficient.

Find sum of second values in key/value pair

just started with PySpark
I have a key/value pair like following (key,(value1,value2))
I'd like to find a sum of value2 for each key
example of input data
(22, (33, 17.0)),(22, (34, 15.0)),(20, (3, 5.5)),(20, (11, 0.0))
Thanks !
At the end I created a new RDD contains key,value2 only , then just sum values of the new RDD
sumRdd = rdd.map(lambda (x, (a, b)): (x, b))\
.groupByKey().mapValues(sum).collect()
If you would like to benefit from combiner this would be a better choice.
from operator import add
sumRdd = rdd.map(lambda (x, (a, b)): (x, b)).reduceByKey(add)

Pyspark and PCA: How can I extract the eigenvectors of this PCA? How can I calculate how much variance they are explaining?

I am reducing the dimensionality of a Spark DataFrame with PCA model with pyspark (using the spark ml library) as follows:
pca = PCA(k=3, inputCol="features", outputCol="pca_features")
model = pca.fit(data)
where data is a Spark DataFrame with one column labeled features which is a DenseVector of 3 dimensions:
data.take(1)
Row(features=DenseVector([0.4536,-0.43218, 0.9876]), label=u'class1')
After fitting, I transform the data:
transformed = model.transform(data)
transformed.first()
Row(features=DenseVector([0.4536,-0.43218, 0.9876]), label=u'class1', pca_features=DenseVector([-0.33256, 0.8668, 0.625]))
How can I extract the eigenvectors of this PCA? How can I calculate how much variance they are explaining?
[UPDATE: From Spark 2.2 onwards, PCA and SVD are both available in PySpark - see JIRA ticket SPARK-6227 and PCA & PCAModel for Spark ML 2.2; original answer below is still applicable for older Spark versions.]
Well, it seems incredible, but indeed there is not a way to extract such information from a PCA decomposition (at least as of Spark 1.5). But again, there have been many similar "complaints" - see here, for example, for not being able to extract the best parameters from a CrossValidatorModel.
Fortunately, some months ago, I attended the 'Scalable Machine Learning' MOOC by AMPLab (Berkeley) & Databricks, i.e. the creators of Spark, where we implemented a full PCA pipeline 'by hand' as part of the homework assignments. I have modified my functions from back then (rest assured, I got full credit :-), so as to work with dataframes as inputs (instead of RDD's), of the same format as yours (i.e. Rows of DenseVectors containing the numerical features).
We first need to define an intermediate function, estimatedCovariance, as follows:
import numpy as np
def estimateCovariance(df):
"""Compute the covariance matrix for a given dataframe.
Note:
The multi-dimensional covariance array should be calculated using outer products. Don't
forget to normalize the data by first subtracting the mean.
Args:
df: A Spark dataframe with a column named 'features', which (column) consists of DenseVectors.
Returns:
np.ndarray: A multi-dimensional array where the number of rows and columns both equal the
length of the arrays in the input dataframe.
"""
m = df.select(df['features']).map(lambda x: x[0]).mean()
dfZeroMean = df.select(df['features']).map(lambda x: x[0]).map(lambda x: x-m) # subtract the mean
return dfZeroMean.map(lambda x: np.outer(x,x)).sum()/df.count()
Then, we can write a main pca function as follows:
from numpy.linalg import eigh
def pca(df, k=2):
"""Computes the top `k` principal components, corresponding scores, and all eigenvalues.
Note:
All eigenvalues should be returned in sorted order (largest to smallest). `eigh` returns
each eigenvectors as a column. This function should also return eigenvectors as columns.
Args:
df: A Spark dataframe with a 'features' column, which (column) consists of DenseVectors.
k (int): The number of principal components to return.
Returns:
tuple of (np.ndarray, RDD of np.ndarray, np.ndarray): A tuple of (eigenvectors, `RDD` of
scores, eigenvalues). Eigenvectors is a multi-dimensional array where the number of
rows equals the length of the arrays in the input `RDD` and the number of columns equals
`k`. The `RDD` of scores has the same number of rows as `data` and consists of arrays
of length `k`. Eigenvalues is an array of length d (the number of features).
"""
cov = estimateCovariance(df)
col = cov.shape[1]
eigVals, eigVecs = eigh(cov)
inds = np.argsort(eigVals)
eigVecs = eigVecs.T[inds[-1:-(col+1):-1]]
components = eigVecs[0:k]
eigVals = eigVals[inds[-1:-(col+1):-1]] # sort eigenvals
score = df.select(df['features']).map(lambda x: x[0]).map(lambda x: np.dot(x, components.T) )
# Return the `k` principal components, `k` scores, and all eigenvalues
return components.T, score, eigVals
Test
Let's see first the results with the existing method, using the example data from the Spark ML PCA documentation (modifying them so as to be all DenseVectors):
from pyspark.ml.feature import *
from pyspark.mllib.linalg import Vectors
data = [(Vectors.dense([0.0, 1.0, 0.0, 7.0, 0.0]),),
(Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
(Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = sqlContext.createDataFrame(data,["features"])
pca_extracted = PCA(k=2, inputCol="features", outputCol="pca_features")
model = pca_extracted.fit(df)
model.transform(df).collect()
[Row(features=DenseVector([0.0, 1.0, 0.0, 7.0, 0.0]), pca_features=DenseVector([1.6486, -4.0133])),
Row(features=DenseVector([2.0, 0.0, 3.0, 4.0, 5.0]), pca_features=DenseVector([-4.6451, -1.1168])),
Row(features=DenseVector([4.0, 0.0, 0.0, 6.0, 7.0]), pca_features=DenseVector([-6.4289, -5.338]))]
Then, with our method:
comp, score, eigVals = pca(df)
score.collect()
[array([ 1.64857282, 4.0132827 ]),
array([-4.64510433, 1.11679727]),
array([-6.42888054, 5.33795143])]
Let me stress that we don't use any collect() methods in the functions we have defined - score is an RDD, as it should be.
Notice that the signs of our second column are all opposite from the ones derived by the existing method; but this is not an issue: according to the (freely downloadable) An Introduction to Statistical Learning, co-authored by Hastie & Tibshirani, p. 382
Each principal component loading vector is unique, up to a sign flip. This
means that two different software packages will yield the same principal
component loading vectors, although the signs of those loading vectors
may differ. The signs may differ because each principal component loading
vector specifies a direction in p-dimensional space: flipping the sign has no
effect as the direction does not change. [...] Similarly, the score vectors are unique
up to a sign flip, since the variance of Z is the same as the variance of −Z.
Finally, now that we have the eigenvalues available, it is trivial to write a function for the percentage of the variance explained:
def varianceExplained(df, k=1):
"""Calculate the fraction of variance explained by the top `k` eigenvectors.
Args:
df: A Spark dataframe with a 'features' column, which (column) consists of DenseVectors.
k: The number of principal components to consider.
Returns:
float: A number between 0 and 1 representing the percentage of variance explained
by the top `k` eigenvectors.
"""
components, scores, eigenvalues = pca(df, k)
return sum(eigenvalues[0:k])/sum(eigenvalues)
varianceExplained(df,1)
# 0.79439325322305299
As a test, we also check if the variance explained in our example data is 1.0, for k=5 (since the original data are 5-dimensional):
varianceExplained(df,5)
# 1.0
[Developed & tested with Spark 1.5.0 & 1.5.1]
EDIT :
PCA and SVD are finally both available in pyspark starting spark 2.2.0 according to this resolved JIRA ticket SPARK-6227.
Original answer:
The answer given by #desertnaut is actually excellent from a theoretical perspective, but I wanted to present another approach on how to compute the SVD and to extract then eigenvectors.
from pyspark.mllib.common import callMLlibFunc, JavaModelWrapper
from pyspark.mllib.linalg.distributed import RowMatrix
class SVD(JavaModelWrapper):
"""Wrapper around the SVD scala case class"""
#property
def U(self):
""" Returns a RowMatrix whose columns are the left singular vectors of the SVD if computeU was set to be True."""
u = self.call("U")
if u is not None:
return RowMatrix(u)
#property
def s(self):
"""Returns a DenseVector with singular values in descending order."""
return self.call("s")
#property
def V(self):
""" Returns a DenseMatrix whose columns are the right singular vectors of the SVD."""
return self.call("V")
This defines our SVD object. We can define now our computeSVD method using the Java Wrapper.
def computeSVD(row_matrix, k, computeU=False, rCond=1e-9):
"""
Computes the singular value decomposition of the RowMatrix.
The given row matrix A of dimension (m X n) is decomposed into U * s * V'T where
* s: DenseVector consisting of square root of the eigenvalues (singular values) in descending order.
* U: (m X k) (left singular vectors) is a RowMatrix whose columns are the eigenvectors of (A X A')
* v: (n X k) (right singular vectors) is a Matrix whose columns are the eigenvectors of (A' X A)
:param k: number of singular values to keep. We might return less than k if there are numerically zero singular values.
:param computeU: Whether of not to compute U. If set to be True, then U is computed by A * V * sigma^-1
:param rCond: the reciprocal condition number. All singular values smaller than rCond * sigma(0) are treated as zero, where sigma(0) is the largest singular value.
:returns: SVD object
"""
java_model = row_matrix._java_matrix_wrapper.call("computeSVD", int(k), computeU, float(rCond))
return SVD(java_model)
Now, let's apply that to an example :
from pyspark.ml.feature import *
from pyspark.mllib.linalg import Vectors
data = [(Vectors.dense([0.0, 1.0, 0.0, 7.0, 0.0]),), (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),), (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = sqlContext.createDataFrame(data,["features"])
pca_extracted = PCA(k=2, inputCol="features", outputCol="pca_features")
model = pca_extracted.fit(df)
features = model.transform(df) # this create a DataFrame with the regular features and pca_features
# We can now extract the pca_features to prepare our RowMatrix.
pca_features = features.select("pca_features").rdd.map(lambda row : row[0])
mat = RowMatrix(pca_features)
# Once the RowMatrix is ready we can compute our Singular Value Decomposition
svd = computeSVD(mat,2,True)
svd.s
# DenseVector([9.491, 4.6253])
svd.U.rows.collect()
# [DenseVector([0.1129, -0.909]), DenseVector([0.463, 0.4055]), DenseVector([0.8792, -0.0968])]
svd.V
# DenseMatrix(2, 2, [-0.8025, -0.5967, -0.5967, 0.8025], 0)
In spark 2.2+ you can now easily get the explained variance as:
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=<columns of your original dataframe>, outputCol="features")
df = assembler.transform(<your original dataframe>).select("features")
from pyspark.ml.feature import PCA
pca = PCA(k=10, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)
sum(model.explainedVariance)
The easiest answer to your question is to input an identity matrix to your model.
identity_input = [(Vectors.dense([1.0, .0, 0.0, .0, 0.0]),),(Vectors.dense([.0, 1.0, .0, .0, .0]),), \
(Vectors.dense([.0, 0.0, 1.0, .0, .0]),),(Vectors.dense([.0, 0.0, .0, 1.0, .0]),),
(Vectors.dense([.0, 0.0, .0, .0, 1.0]),)]
df_identity = sqlContext.createDataFrame(identity_input,["features"])
identity_features = model.transform(df_identity)
This should give you principle components.
I think eliasah's answer is better in terms of Spark framework because desertnaut is solving the problem by using numpy's functions instead of Spark's actions. However, eliasah's answer is missing normalizing the data. So, I'd add the following lines to eliasah's answer:
from pyspark.ml.feature import StandardScaler
standardizer = StandardScaler(withMean=True, withStd=False,
inputCol='features',
outputCol='std_features')
model = standardizer.fit(df)
output = model.transform(df)
pca_features = output.select("std_features").rdd.map(lambda row : row[0])
mat = RowMatrix(pca_features)
svd = computeSVD(mat,5,True)
Evantually, svd.V and identity_features.select("pca_features").collect() should have identical values.
I have summarized PCA and its use in Spark and sklearn in this blog post.

Resources