Pyspark: groupby and then count true values - apache-spark

My data structure is in JSON format:
"header"{"studentId":"1234","time":"2016-06-23","homeworkSubmitted":True}
"header"{"studentId":"1234","time":"2016-06-24","homeworkSubmitted":True}
"header"{"studentId":"1234","time":"2016-06-25","homeworkSubmitted":True}
"header"{"studentId":"1236","time":"2016-06-23","homeworkSubmitted":False}
"header"{"studentId":"1236","time":"2016-06-24","homeworkSubmitted":True}
....
I need to plot a histogram that shows number of homeworkSubmitted: True over all stidentIds. I wrote code that flattens the data structure, so my keys are header.studentId, header.time and header.homeworkSubmitted.
I used keyBy to group by studentId:
initialRDD.keyBy(lambda row: row['header.studentId'])
.map(lambda (k,v): (k,v['header.homeworkSubmitted']))
.map(mapTF).groupByKey().mapValues(lambda x: Counter(x)).collect()
This gives me result like this:
("1234", Counter({0:0, 1:3}),
("1236", Counter(0:1, 1:1))
I need only number of counts of 1, possibly mapped to a list so that I can plot a histogram using matplotlib. I am not sure how to proceed and filter everything.
Edit: at the end I iterated through the dictionary and added counts to a list and then plotted histogram of the list. I am wondering if there is a more elegant way to do the whole process I outlined in my code.

df = sqlContext.read.json('/path/to/your/dataset/')
df.filter(df.homeworkSubmitted == True).groupby(df.studentId).count()
Note it is not valid JSON if there is a "header" or True instead of true

I don't have Spark in front of me right now, though I can edit this tomorrow when I do.
But if I'm understanding this you have three key-value RDDs, and need to filter by homeworkSubmitted=True. I would think you turn this into a dataframe, then use:
df.where(df.homeworkSubmitted==True).count()
You could then use group by operations if you wanted to explore subsets based on the other columns.

You can filter out the false, keeping it in RDD, then count the True with counter
initialRDD.filter(lambda row : row['header.homeworkSubmitted'])
Another solution would be to sum the booleans
data = sc.parallelize([('id1',True),('id1',True),
('id2',False),
('id2',False),('id3',False),('id3',True) ])
data.reduceByKey(lambda x,y:x+y).collect()
Outputs
[('id2', 0), ('id3', 1), ('id1', 2)]

Related

PySpark filter a list of element and then merge back

My rdd contains pair of IDs and a list of items. for example, each item will be like (1, [a, b, c]). I need to apply a filter to this item. Let say I don't want an on the list.
My current approach is to use flatMapValues to break the items into key-value pairs. filter them and use groupByKey to merge them back into (1, [b, c]).
After I did some research and seems like groupByKey is terrible when data is huge. Also, seems like I'm breaking down the list, and them merging it back after the list might seems redundant. Is there a way to accomplish this without break the array down and merge it back?
You can use a list comprehension with mapValues:
rdd2 = rdd1.mapValues(lambda l: [i for i in l if i != 'a'])

how to use lambda function to select larger values from two python dataframes whilst comparing them by date?

I want to map through the rows of df1 and compare those with the values of df2 , by month and day, across every year in df2,leaving only the values in df1 which are larger than those in df2, to add into a new column, 'New'. df1 and df2 are of the same size, and are indexed by 'Month' and 'Day'. what would be the best way to do this?
df1=pd.DataFrame({'Date':['2015-01-01','2015-01-02','2015-01-03','2015-01-``04','2005-01-05'],'Values':[-5.6,-5.6,0,3.9,9.4]})
df1.Date=pd.to_datetime(df1.Date)
df1['Day']=pd.DatetimeIndex(df1['Date']).day
df1['Month']=pd.DatetimeIndex(df1['Date']).month
df1.set_index(['Month','Day'],inplace=True)
df1
df2 = pd.DataFrame({'Date':['2005-01-01','2005-01-02','2005-01-03','2005-01-``04','2005-01-05'],'Values':[-13.3,-12.2,6.7,8.8,15.5]})
df2.Date=pd.to_datetime(df1.Date)
df2['Day']=pd.DatetimeIndex(df2['Date']).day
df2['Month']=pd.DatetimeIndex(df2['Date']).month
df2.set_index(['Month','Day'],inplace=True)
df2
df1 and df2
df2['New']=df2[df2['Values']<df1['Values']]
gives
ValueError: Can only compare identically-labeled Series objects
I have also tried
df2['New']=df2[df2['Values'].apply(lambda x: x < df1['Values'].values)]
The best way to handle your problem is by using numpy as a tool. Numpy has an attribute called "where"that helps a lot in cases like this.
This is how the sentence works:
df1['new column that will contain the comparison results'] = np.where(condition,'value if true','value if false').
First import the library:
import numpy as np
Using the condition provided by you:
df2['New'] = np.where(df2['Values'] > df1['Values'], df2['Values'],'')
So, I think that solves your problem... You can change the value passed to the False condition to every thin you want, this is only an example.
Tell us if it worked!
Let´s try two possible solutions:
The first solution is to sort the index first.
df1.sort_index(inplace=True)
df2.sort_index(inplace=True)
Perform a simple test to see if it works!
df1 == df2
it is possible to raise some kind of error, so if that happens, try this correction instead:
df1.sort_index(inplace=True, axis=1)
df2.sort_index(inplace=True, axis=1)
The second solution is to drop the indexes and reset it:
df1.sort_index(inplace=True)
df2.sort_index(inplace=True)
Perform a simple test to see if it works!
df1 == df2
See if it works and tell us the result.

GroupBy of a dataframe by comparison against a set of items

So I have a dataframe of movies with about 10K rows. Have a column that captures its genre in a comma separated string. Since a movie can be classified in multiple genres, I needed to create a set of genres that contained all possible genres in the 10K rows. So I went about it by doing as follows:
simplist = []
for i in df.genres.values:
vlist = i.split(', ')
for item in vlist:
simplist.append(item)
gset = set(simplist)
This gets me a list of 24 genres from the 27K items in the simplist which is awesome. But heres the pinch:
I want to groupby genres by comparing genres to the set and then do aggregation and other operations AND
I want the output to be 24 distinct groups such that if a movie has more than one of the genres in the set - it should show up in both groups (removes sorting or tagging bias in the data gathering phase)
Is groupby even the right way to go about this?
Thanks for your input/thoughts/options/approach in advance.
Ok, so I made some headway but still unable to get to put the puzzle pieces together.
Started off by making a list and set (don't know which I will end up using) of unique values:
simplist = []
for i in df.genres.values:
vlist = i.split(', ')
for item in vlist:
simplist.append(item)
gset = set(simplist)
g_list = list(gset)
Then, separately use df.pivot to structure the analysis:
table7 = df.pivot_table(index=['release_year'], values=['runtime'],aggfunc={'runtime': [np.median, ], 'popularity': [np.mean]}, fill_value=0, dropna=True)
But here's the thing:
it would be awesome if I could index by g_list or check 'genres' against the gset of 24 distinct items but df.pivot_table does not support that. Leaving it at Genres creates ~2000 rows and is not meaningful.
Got it!! Wanted to thank a bunch of offline folks and Pythonistas who helped me in the right direction. Turns out I'd been spinning my wheels with sets and lists when one single Pandas command (well 3 to be precise) does the trick!!
df2 = pd.DataFrame(df.genres.str.split(', ').tolist(), index=[df.col1, df.col2, df.coln]).stack()
df2 = df2.reset_index()[[0, 'col1', 'col2', 'coln',]]
df2.columns = ['Genre', 'col1', "col2", 'coln']
This should create a 2nd dataframe (df2) that has the key columns for analysis from the original dataframe and the rows duplicated/attributed to each genre. You see the true value of this when you turnaround and do something like:
revenue_table = df2.pivot_table(index=['Release Year','Genre'], values=['Profit'],aggfunc={'Profit': np.sum},fill_value=0,dropna=True) or anything to similar effect or use case.
Closing this but would appreciate any notes on more efficient ways to do this.

Multiple if elif conditions to be evaluated for each row of pyspark dataframe

I need help in pyspark dataframe topic.
I have a dataframe of say 1000+ columns and 100000+ rows.Also I have 10000+ if elif conditions are there,under each if else condition there are few global variables getting incremented by some values.
Now my question is how can I achieve this in pyspark only.
I read about filter and where functions which return rows based on condition by I need to check those 10000+ if else conditions and perform some manipulations.
Any help would be appreciated.
If you could give an example with small dataset that would be of great help.
Thankyou
You can define a function to contain all of you if elif conditions, then apply this function into each row of the DataFrame.
Just use .rdd to convert the DataFrame to a normal RDD, then use map() function.
e.g, DF.rdd.map(lambda row: func(row))
Hope it can help you.
As I understand it, you just want to update some global counters while iterating over your DataFrame. For this, you need to:
1) Define one or more accumulators:
ac_0 = sc.accumulator(0)
ac_1 = sc.accumulator(0)
2) Define a function to update your accumulators for a given row, e.g:
def accumulate(row):
if row.foo:
ac_0.add(1)
elif row.bar:
ac_1.add(row.baz)
3) Call foreach on your DataFrame:
df.foreach(accumulate)
4) Inspect the accumulator values
> ac_0.value
>>> 123

dot product of a combination of elements of an RDD using pySpark

I have an RDD where each element is a tuple of the form
[ (index1,SparseVector({idx1:1,idx2:1,idx3:1,...})) , (index2,SparseVector() ),... ]
I would like to take a dot-product of each of the values in this RDD by using the SparseVector1.dot(SparseVector2) method provided by mllib.linalg.SparseVector class. I am aware that python has an itertools.combinations module that can be used to achieve the combinations of dot-products to be calculated. Could someone provide a code-snippet to achieve the same? I can only thing of doing an RDD.collect() so I receive a list of all elements in the RDD and then running the itertools.combinations on this list but this as per my understanding would perform all the calculations on the root and wouldn't be distributed per-se. Could someone please suggest a more distributed way of achieving this?
def computeDot(sparseVectorA, sparseVectorB):
"""
Function to compute dot product of two SparseVectors
"""
return sparseVectorA.dot(sparseVectorB)
# Use Cartesian function on the RDD to create tuples containing
# 2-combinations of all the rows in the original RDD
combinationRDD = (originalRDD.cartesian(originalRDD))
# The records in combinationRDD will be of the form
# [(Index, SV1), (Index, SV1)], therefore, you need to
# filter all the records where the index is not equal giving
# RDD of the form [(Index1, SV1), (Index2, SV2)] and so on,
# then use the map function to use the SparseVector's dot function
dottedRDD = (combinationRDD
.filter(lambda x: x[0][0] != x[1][0])
.map(lambda x: computeDot(x[0][1], x[1][1])
.cache())
The solution to this question should be along this line.

Resources