How to count all values in one key of a pyspark RDD?

How to count all values in one key of a pyspark RDD? - python-3.x

In a pyspark RDD, 'predicted_values' is the key for the results of a logistic regression. Obviously, 'predicted_values' holds only 0 and 1.
I want to count the number of 0's and 1's in the output field.
I try:
Counter(rdd.groupByKey()['predicted_value'])
which gives
TypeError: 'PipelinedRDD' object is not subscriptable
What is the best way to do this?

You could also use countByValue():
sorted(rdd.map(lambda x: x['predicted_value']).countByValue().items())
#[(0, 580), (1, 420)]

It appears that this can be done by (using the Counter class from collection):
>>> Counter([i['predicted_value'] for i in rdd.collect()]
Counter({0: 580, 1: 420})

Related

Calculate cosine similarity between two columns of Spark DataFrame in PySpark [duplicate]

I have a dataframe in Spark in which one of the columns contains an array.Now,I have written a separate UDF which converts the array to another array with distinct values in it only. See example below:
Ex: [24,23,27,23] should get converted to [24, 23, 27]
Code:
def uniq_array(col_array):
x = np.unique(col_array)
return x
uniq_array_udf = udf(uniq_array,ArrayType(IntegerType()))
Df3 = Df2.withColumn("age_array_unique",uniq_array_udf(Df2.age_array))
In the above code, Df2.age_array is the array on which I am applying the UDF to get a different column "age_array_unique" which should contain only unique values in the array.
However, as soon as I run the command Df3.show(), I get the error:
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)
Can anyone please let me know why this is happening?
Thanks!

The source of the problem is that object returned from the UDF doesn't conform to the declared type. np.unique not only returns numpy.ndarray but also converts numerics to the corresponding NumPy types which are not compatible with DataFrame API. You can try something like this:
udf(lambda x: list(set(x)), ArrayType(IntegerType()))
or this (to keep order)
udf(lambda xs: list(OrderedDict((x, None) for x in xs)),
ArrayType(IntegerType()))
instead.
If you really want np.unique you have to convert the output:
udf(lambda x: np.unique(x).tolist(), ArrayType(IntegerType()))

You need to convert the final value to a python list. You implement the function as follows:
def uniq_array(col_array):
x = np.unique(col_array)
return list(x)
This is because Spark doesn't understand the numpy array format. In order to feed a python object that Spark DataFrames understand as an ArrayType, you need to convert the output to a python list before returning it.

I also got this error when my UDF returns a float but I forget to cast it as a float. I need to do this:
retval = 0.5
return float(retval)

As of pyspark version 2.4, you can use array_distinct transformation.
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.array_distinct

Below Works fine for me
udf(lambda x: np.unique(x).tolist(), ArrayType(IntegerType()))

[x.item() for x in <any numpy array>]
converts it to plain python.

Multiple if elif conditions to be evaluated for each row of pyspark dataframe

I need help in pyspark dataframe topic.
I have a dataframe of say 1000+ columns and 100000+ rows.Also I have 10000+ if elif conditions are there,under each if else condition there are few global variables getting incremented by some values.
Now my question is how can I achieve this in pyspark only.
I read about filter and where functions which return rows based on condition by I need to check those 10000+ if else conditions and perform some manipulations.
Any help would be appreciated.
If you could give an example with small dataset that would be of great help.
Thankyou

You can define a function to contain all of you if elif conditions, then apply this function into each row of the DataFrame.
Just use .rdd to convert the DataFrame to a normal RDD, then use map() function.
e.g, DF.rdd.map(lambda row: func(row))
Hope it can help you.

As I understand it, you just want to update some global counters while iterating over your DataFrame. For this, you need to:
1) Define one or more accumulators:
ac_0 = sc.accumulator(0)
ac_1 = sc.accumulator(0)
2) Define a function to update your accumulators for a given row, e.g:
def accumulate(row):
if row.foo:
ac_0.add(1)
elif row.bar:
ac_1.add(row.baz)
3) Call foreach on your DataFrame:
df.foreach(accumulate)
4) Inspect the accumulator values
> ac_0.value
>>> 123

How to apply function to each row of specified column of PySpark DataFrame

I have a PySpark DataFrame consists of three columns, whose structure is as below.
In[1]: df.take(1)
Out[1]:
[Row(angle_est=-0.006815859163590619, rwsep_est=0.00019571401752467945, cost_est=34.33651951754235)]
What I want to do is to retrieve each value of the first column (angle_est), and pass it as parameter xMisallignment to a defined function to set a particular property of a class object. The defined function is:
def setMisAllignment(self, xMisallignment):
if np.abs(xMisallignment) > 0.8:
warnings.warn('You might set misallignment angle too large.')
self.MisAllignment = xMisallignment
I am trying to select the first column and convert it into rdd, and apply the above function to a map() function, but it seems it does not work, the MisAllignment did not change anyway.
df.select(df.angle_est).rdd.map(lambda row: model0.setMisAllignment(row))
In[2]: model0.MisAllignment
Out[2]: 0.00111511718224
Anyone has ideas to help me let that function work? Thanks in advance!

You can register your function as spark UDF something similar to follows:
spark.udf.register("misallign", setMisAllignment)
You can get many examples of creating and registering UDF's in this test suite:
https://github.com/apache/spark/blob/master/sql/core/src/test/java/test/org/apache/spark/sql/JavaUDFSuite.java
Hope it answers your question

Clustering data using DBSCAN and spark_sklearn

I want to cluster my input data using DBSCAN and spark_sklearn. I'd like to get the labels of each input instance after clustering. Is it possible?
Reading the documentation on http://pythonhosted.org/spark-sklearn, I tried the following:
temp_data = Spark DataFrame containing 'key' and 'features' columns,
where 'features' is a Vector.
ke = KeyedEstimator(sklearnEstimator=DBSCAN(), estimatorType="clusterer")
print ke.getOrDefault("estimatorType") --> "clusterer"
ke.fit_pedict(temp_data) --> ERROR: 'KeyedEstimator' object has no attribute 'fit_predict'
k_model = ke.fit(temp_data)
print k_model.getOrDefault("estimatorType") --> "clusterer"
k_model.fit_pedict(temp_data) --> ERROR: 'KeyedModel' object has no attribute 'fit_predict'
k_model.predict(temp_data) --> ERROR: 'KeyedModel' object has no attribute 'predict'
k_model.transform(temp_data) --> ERROR: estimatorType assumed to be a clusterer, but sklearnEstimator is missing fit_predict()
(NOTE: sklearn.cluster.DBSCAN actually have fit_predict() method)
What I normally do using sklearn (without spark) is to fit (dbscan_model.fit(temp_data-features)) and get labels from the model (labels = dbscan_model.labels_). It is also fine if I can get the 'labels_' attribute using spark-sklearn.
If the above-mentioned calls ('transform' or 'predict') doesn't work, is it possible to get the 'labels_' after fitting data using spark-sklearn? How can I do that? Assuming that we obtained the 'labels_', how can I map the input instances to the labels_? Do they have same order?

It's just possible in the case of KMeans, in which we can predict cluster labels, since the scikit-learn estimator provides this functionality.
Unfortunately, this is not the case for some other clusterers, such as DBSCAN.

I've managed to get the 'labels_' attribute; however I still don't know if the order of resulting labels are same as the input instances or not.
temp_data = Spark DataFrame containing 'key' and 'features' columns,
where 'features' is a Vector.
ke = KeyedEstimator(sklearnEstimator=DBSCAN())
k_model = ke.fit(temp_data)
def getLabels(model):
return model.estimator.labels_
labels_udf = udf(lambda x: getLabels(x).tolist(), ArrayType(IntegerType()))("estimator").alias("labels")
res_df = km_dbscan.keyedModels.select("key", labels_udf)

dot product of a combination of elements of an RDD using pySpark

I have an RDD where each element is a tuple of the form
[ (index1,SparseVector({idx1:1,idx2:1,idx3:1,...})) , (index2,SparseVector() ),... ]
I would like to take a dot-product of each of the values in this RDD by using the SparseVector1.dot(SparseVector2) method provided by mllib.linalg.SparseVector class. I am aware that python has an itertools.combinations module that can be used to achieve the combinations of dot-products to be calculated. Could someone provide a code-snippet to achieve the same? I can only thing of doing an RDD.collect() so I receive a list of all elements in the RDD and then running the itertools.combinations on this list but this as per my understanding would perform all the calculations on the root and wouldn't be distributed per-se. Could someone please suggest a more distributed way of achieving this?

def computeDot(sparseVectorA, sparseVectorB):
"""
Function to compute dot product of two SparseVectors
"""
return sparseVectorA.dot(sparseVectorB)
# Use Cartesian function on the RDD to create tuples containing
# 2-combinations of all the rows in the original RDD
combinationRDD = (originalRDD.cartesian(originalRDD))
# The records in combinationRDD will be of the form
# [(Index, SV1), (Index, SV1)], therefore, you need to
# filter all the records where the index is not equal giving
# RDD of the form [(Index1, SV1), (Index2, SV2)] and so on,
# then use the map function to use the SparseVector's dot function
dottedRDD = (combinationRDD
.filter(lambda x: x[0][0] != x[1][0])
.map(lambda x: computeDot(x[0][1], x[1][1])
.cache())
The solution to this question should be along this line.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to count all values in one key of a pyspark RDD? - python-3.x

You could also use countByValue(): sorted(rdd.map(lambda x: x['predicted_value']).countByValue().items()) #[(0, 580), (1, 420)]

It appears that this can be done by (using the Counter class from collection): >>> Counter([i['predicted_value'] for i in rdd.collect()] Counter({0: 580, 1: 420})

Related

Calculate cosine similarity between two columns of Spark DataFrame in PySpark [duplicate]

Multiple if elif conditions to be evaluated for each row of pyspark dataframe

How to apply function to each row of specified column of PySpark DataFrame

Clustering data using DBSCAN and spark_sklearn

dot product of a combination of elements of an RDD using pySpark

Categories

Resources