How do I use rdd.map/groupByKey with multiple values? - apache-spark

I'm currently learning pyspark and I don't quite understand the syntax to using multiple values with groupByKey(). Assuming I wanted to find the category of the product with the highest quantity in a schema of (Category, Quantity) it seems straightforward enough. I could do:
itemQuantity = df.rdd.map(lambda x: (x[0],x[1]))
highestQuantity = itemQuantity.groupByKey()
.map(lambda x: (x[0], sum(x[1])))
.top(1, key = lambda x: x[1])
But assuming I had a schema of (Category, Price, Quantity) and had to find the average item cost in each category, how would I do it? Conceptually I understand that I would first have to groupByKey() to achieve something like (Category, [Price,
Quantity]), because groupByKey() doesn't appear to work with 3 values (ValueError: too many values to unpack (expected 2). When I do:
averageCost = df.rdd.map(lambda x: (x[0], (x[1], x[2])))
aC = averageCost.groupByKey()
I end up with a dataframe that I'm unsure how to manipulate now that it has two tuples in it.
[('Category_1', <pyspark.resultiterable.ResultIterable object at 0x7fb812b5ef70>), ... ,('Category_n', <pyspark.resultiterable.ResultIterable object at 0x7fb812b5fb50>)]
Ideally I want to have a result of something like (Category, sum(Price), sum(Quantity)). How would I achieve this? I have been using reduceByKey() on two separate schemas (Category, Price) and (Category, Quantity) instead, although that's far from ideal. Also are there any cheatsheets you would recommend for a beginner to understand the syntax? I've looked through a lot of examples and documentation, but I find them often too abstract to understand.

Is there any specific reason you want to use rdd functions instead of dataframe one as in your case it will be easier to write like
from pyspark.sql.functions import sum
df.groupBy("Category").agg(sum("Price"),sum("Quantity"))

It seems like
priceAndQuantity = df.rdd.map(lambda x: (x[0], (x[1], x[2])))
PAQ1 = priceAndQuantity.reduceByKey(lambda v1,v2: (v1[0] + v2[0], v1[1] + v2[1]))
PAQ2 = PAQ1.map(lambda x: (x[0], x[1][0], x[1][1]))
does what I was hoping to do.

Related

Palantir Foundry spark.sql query

When I attempt to query my input table as a view, I get the error com.palantir.foundry.spark.api.errors.DatasetPathNotFoundException. My code is as follows:
def Median_Product_Revenue_Temp2(Merchant_Segments):
Merchant_Segments.createOrReplaceTempView('Merchant_Segments_View')
df = spark.sql('select * from Merchant_Segments_View limit 5')
return df
I need to dynamically query this table, since I am trying to calculate the median using percentile_approx across numerous fields, and I'm not sure how to do this without using spark.sql.
If I try to avoid using spark.sql to calculate median across numerous fields using something like the below code, it results in the error Missing Transform Attribute: A module object does not have an attribute percentile_approx. Please check the spelling and/or the datatype of the object.
import pyspark.sql.functions as F
exprs = {x: percentile_approx("x", 0.5) for x in df.columns if x is not exclustion_list}
df = df.groupBy(['BANK_NAME','BUS_SEGMENT']).agg(exprs)
try createGlobalTempView. It worked for me.
eg:
df.createGlobalTempView("people")
(Don't know the root cause why localTempView dose not work )
I managed to avoid using dynamic sql for calculating median across columns using the following code:
df_result = df.groupBy(group_list).agg(
*[ F.expr('percentile_approx(nullif('+col+',0), 0.5)').alias(col) for col in df.columns if col not in exclusion_list]
)
Embedding percentile_approx in an F.expr bypassed the issue I was encountering in the second half of my post.

how to use lambda function to select larger values from two python dataframes whilst comparing them by date?

I want to map through the rows of df1 and compare those with the values of df2 , by month and day, across every year in df2,leaving only the values in df1 which are larger than those in df2, to add into a new column, 'New'. df1 and df2 are of the same size, and are indexed by 'Month' and 'Day'. what would be the best way to do this?
df1=pd.DataFrame({'Date':['2015-01-01','2015-01-02','2015-01-03','2015-01-``04','2005-01-05'],'Values':[-5.6,-5.6,0,3.9,9.4]})
df1.Date=pd.to_datetime(df1.Date)
df1['Day']=pd.DatetimeIndex(df1['Date']).day
df1['Month']=pd.DatetimeIndex(df1['Date']).month
df1.set_index(['Month','Day'],inplace=True)
df1
df2 = pd.DataFrame({'Date':['2005-01-01','2005-01-02','2005-01-03','2005-01-``04','2005-01-05'],'Values':[-13.3,-12.2,6.7,8.8,15.5]})
df2.Date=pd.to_datetime(df1.Date)
df2['Day']=pd.DatetimeIndex(df2['Date']).day
df2['Month']=pd.DatetimeIndex(df2['Date']).month
df2.set_index(['Month','Day'],inplace=True)
df2
df1 and df2
df2['New']=df2[df2['Values']<df1['Values']]
gives
ValueError: Can only compare identically-labeled Series objects
I have also tried
df2['New']=df2[df2['Values'].apply(lambda x: x < df1['Values'].values)]
The best way to handle your problem is by using numpy as a tool. Numpy has an attribute called "where"that helps a lot in cases like this.
This is how the sentence works:
df1['new column that will contain the comparison results'] = np.where(condition,'value if true','value if false').
First import the library:
import numpy as np
Using the condition provided by you:
df2['New'] = np.where(df2['Values'] > df1['Values'], df2['Values'],'')
So, I think that solves your problem... You can change the value passed to the False condition to every thin you want, this is only an example.
Tell us if it worked!
Let´s try two possible solutions:
The first solution is to sort the index first.
df1.sort_index(inplace=True)
df2.sort_index(inplace=True)
Perform a simple test to see if it works!
df1 == df2
it is possible to raise some kind of error, so if that happens, try this correction instead:
df1.sort_index(inplace=True, axis=1)
df2.sort_index(inplace=True, axis=1)
The second solution is to drop the indexes and reset it:
df1.sort_index(inplace=True)
df2.sort_index(inplace=True)
Perform a simple test to see if it works!
df1 == df2
See if it works and tell us the result.

Calculate cosine similarity between two columns of Spark DataFrame in PySpark [duplicate]

I have a dataframe in Spark in which one of the columns contains an array.Now,I have written a separate UDF which converts the array to another array with distinct values in it only. See example below:
Ex: [24,23,27,23] should get converted to [24, 23, 27]
Code:
def uniq_array(col_array):
x = np.unique(col_array)
return x
uniq_array_udf = udf(uniq_array,ArrayType(IntegerType()))
Df3 = Df2.withColumn("age_array_unique",uniq_array_udf(Df2.age_array))
In the above code, Df2.age_array is the array on which I am applying the UDF to get a different column "age_array_unique" which should contain only unique values in the array.
However, as soon as I run the command Df3.show(), I get the error:
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)
Can anyone please let me know why this is happening?
Thanks!
The source of the problem is that object returned from the UDF doesn't conform to the declared type. np.unique not only returns numpy.ndarray but also converts numerics to the corresponding NumPy types which are not compatible with DataFrame API. You can try something like this:
udf(lambda x: list(set(x)), ArrayType(IntegerType()))
or this (to keep order)
udf(lambda xs: list(OrderedDict((x, None) for x in xs)),
ArrayType(IntegerType()))
instead.
If you really want np.unique you have to convert the output:
udf(lambda x: np.unique(x).tolist(), ArrayType(IntegerType()))
You need to convert the final value to a python list. You implement the function as follows:
def uniq_array(col_array):
x = np.unique(col_array)
return list(x)
This is because Spark doesn't understand the numpy array format. In order to feed a python object that Spark DataFrames understand as an ArrayType, you need to convert the output to a python list before returning it.
I also got this error when my UDF returns a float but I forget to cast it as a float. I need to do this:
retval = 0.5
return float(retval)
As of pyspark version 2.4, you can use array_distinct transformation.
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.array_distinct
Below Works fine for me
udf(lambda x: np.unique(x).tolist(), ArrayType(IntegerType()))
[x.item() for x in <any numpy array>]
converts it to plain python.

How to group by multiple columns and collect in list in PySpark?

Here is my problem:
I've got this RDD:
a = [[u'PNR1',u'TKT1',u'TEST',u'a2',u'a3'],[u'PNR1',u'TKT1',u'TEST',u'a5',u'a6'],[u'PNR1',u'TKT1',u'TEST',u'a8',u'a9']]
rdd= sc.parallelize (a)
Then I try :
rdd.map(lambda x: (x[0],x[1],x[2], list(x[3:])))
.toDF(["col1","col2","col3","col4"])
.groupBy("col1","col2","col3")
.agg(collect_list("col4")).show
Finally I should find this:
[col1,col2,col3,col4]=[u'PNR1',u'TKT1',u'TEST',[[u'a2',u'a3'][u'a5',u'a6'][u'a8',u'a9']]]
But the problem is that I can't collect a list.
If anyone can help me I will appreciate it
I finally found a solution, it is not the best way but I can continue working...
from pyspark.sql.functions import udf
from pyspark.sql.functions import *
def example(lista):
d = [[] for x in range(len(lista))]
for index, elem in enumerate(lista):
d[index] = elem.split("#")
return d
example_udf = udf(example, LongType())
a = [[u'PNR1',u'TKT1',u'TEST',u'a2',u'a3'],[u'PNR1',u'TKT1',u'TEST',u'a5',u'a6'],[u'PNR1',u'TKT1',u'TEST',u'a8',u'a9']]
rdd= sc.parallelize (a)
df = rdd.toDF(["col1","col2","col3","col4","col5"])
df2=df.withColumn('col6', concat(col('col4'),lit('#'),col('col5'))).drop(col("col4")).drop(col("col5")).groupBy([col("col1"),col("col2"),col("col3")]).agg(collect_set(col("col6")).alias("col6"))
df2.map(lambda x: (x[0],x[1],x[2],example(x[3]))).collect()
And it gives:
[(u'PNR1', u'TKT1', u'TEST', [[u'a2', u'a3'], [u'a5', u'a6'], [u'a8', u'a9']])]
Hope this solution can help to someone else.
Thanks for all your answers.
This might do your job (or give you some ideas to proceed further)...
One idea is to convert your col4 to a primitive data type, i.e. a string:
from pyspark.sql.functions import collect_list
import pandas as pd
a = [[u'PNR1',u'TKT1',u'TEST',u'a2',u'a3'],[u'PNR1',u'TKT1',u'TEST',u'a5',u'a6'],[u'PNR1',u'TKT1',u'TEST',u'a8',u'a9']]
rdd = sc.parallelize(a)
df = rdd.map(lambda x: (x[0],x[1],x[2], '(' + ' '.join(str(e) for e in x[3:]) + ')')).toDF(["col1","col2","col3","col4"])
df.groupBy("col1","col2","col3").agg(collect_list("col4")).toPandas().values.tolist()[0]
#[u'PNR1', u'TKT1', u'TEST', [u'(a2 a3)', u'(a5 a6)', u'(a8 a9)']]
UPDATE (after your own answer):
I really thought the point I had reached above was enough to further adapt it according to your needs, plus that I didn't have time at the moment to do it myself; so, here it is (after modifying my df definition to get rid of the parentheses, it is just a matter of a single list comprehension):
df = rdd.map(lambda x: (x[0],x[1],x[2], ' '.join(str(e) for e in x[3:]))).toDF(["col1","col2","col3","col4"])
# temp list:
ff = df.groupBy("col1","col2","col3").agg(collect_list("col4")).toPandas().values.tolist()[0]
ff
# [u'PNR1', u'TKT1', u'TEST', [u'a2 a3', u'a5 a6', u'a8 a9']]
# final list of lists:
ll = ff[:-1] + [[x.split(' ') for x in ff[-1]]]
ll
which gives your initially requested result:
[u'PNR1', u'TKT1', u'TEST', [[u'a2', u'a3'], [u'a5', u'a6'], [u'a8', u'a9']]] # requested output
This approach has certain advantages compared with the one provided in your own answer:
It avoids Pyspark UDFs, which are known to be slow
All the processing is done in the final (and hopefully much smaller) aggregated data, instead of adding and removing columns and performing map functions and UDFs in the initial (presumably much bigger) data
Since you cannot update to 2.x your only option is RDD API. Replace you current code with:
rdd.map(lambda x: ((x[0], x[1], x[2]), list(x[3:]))).groupByKey().toDF()

dot product of a combination of elements of an RDD using pySpark

I have an RDD where each element is a tuple of the form
[ (index1,SparseVector({idx1:1,idx2:1,idx3:1,...})) , (index2,SparseVector() ),... ]
I would like to take a dot-product of each of the values in this RDD by using the SparseVector1.dot(SparseVector2) method provided by mllib.linalg.SparseVector class. I am aware that python has an itertools.combinations module that can be used to achieve the combinations of dot-products to be calculated. Could someone provide a code-snippet to achieve the same? I can only thing of doing an RDD.collect() so I receive a list of all elements in the RDD and then running the itertools.combinations on this list but this as per my understanding would perform all the calculations on the root and wouldn't be distributed per-se. Could someone please suggest a more distributed way of achieving this?
def computeDot(sparseVectorA, sparseVectorB):
"""
Function to compute dot product of two SparseVectors
"""
return sparseVectorA.dot(sparseVectorB)
# Use Cartesian function on the RDD to create tuples containing
# 2-combinations of all the rows in the original RDD
combinationRDD = (originalRDD.cartesian(originalRDD))
# The records in combinationRDD will be of the form
# [(Index, SV1), (Index, SV1)], therefore, you need to
# filter all the records where the index is not equal giving
# RDD of the form [(Index1, SV1), (Index2, SV2)] and so on,
# then use the map function to use the SparseVector's dot function
dottedRDD = (combinationRDD
.filter(lambda x: x[0][0] != x[1][0])
.map(lambda x: computeDot(x[0][1], x[1][1])
.cache())
The solution to this question should be along this line.

Resources