GroupBy of a dataframe by comparison against a set of items - python-3.x

So I have a dataframe of movies with about 10K rows. Have a column that captures its genre in a comma separated string. Since a movie can be classified in multiple genres, I needed to create a set of genres that contained all possible genres in the 10K rows. So I went about it by doing as follows:
simplist = []
for i in df.genres.values:
vlist = i.split(', ')
for item in vlist:
simplist.append(item)
gset = set(simplist)
This gets me a list of 24 genres from the 27K items in the simplist which is awesome. But heres the pinch:
I want to groupby genres by comparing genres to the set and then do aggregation and other operations AND
I want the output to be 24 distinct groups such that if a movie has more than one of the genres in the set - it should show up in both groups (removes sorting or tagging bias in the data gathering phase)
Is groupby even the right way to go about this?
Thanks for your input/thoughts/options/approach in advance.

Ok, so I made some headway but still unable to get to put the puzzle pieces together.
Started off by making a list and set (don't know which I will end up using) of unique values:
simplist = []
for i in df.genres.values:
vlist = i.split(', ')
for item in vlist:
simplist.append(item)
gset = set(simplist)
g_list = list(gset)
Then, separately use df.pivot to structure the analysis:
table7 = df.pivot_table(index=['release_year'], values=['runtime'],aggfunc={'runtime': [np.median, ], 'popularity': [np.mean]}, fill_value=0, dropna=True)
But here's the thing:
it would be awesome if I could index by g_list or check 'genres' against the gset of 24 distinct items but df.pivot_table does not support that. Leaving it at Genres creates ~2000 rows and is not meaningful.

Got it!! Wanted to thank a bunch of offline folks and Pythonistas who helped me in the right direction. Turns out I'd been spinning my wheels with sets and lists when one single Pandas command (well 3 to be precise) does the trick!!
df2 = pd.DataFrame(df.genres.str.split(', ').tolist(), index=[df.col1, df.col2, df.coln]).stack()
df2 = df2.reset_index()[[0, 'col1', 'col2', 'coln',]]
df2.columns = ['Genre', 'col1', "col2", 'coln']
This should create a 2nd dataframe (df2) that has the key columns for analysis from the original dataframe and the rows duplicated/attributed to each genre. You see the true value of this when you turnaround and do something like:
revenue_table = df2.pivot_table(index=['Release Year','Genre'], values=['Profit'],aggfunc={'Profit': np.sum},fill_value=0,dropna=True) or anything to similar effect or use case.
Closing this but would appreciate any notes on more efficient ways to do this.

Related

How to use dictionary values in dynamic spark sql query

Iam new to python please help me in below problem
I have a dictionary as below
city = {"AP":"VIZAG","TELANGANA":"HYDERABAD"}
and also I have a list which I need to loop for all state tables as below
states=['AP','HYDERABAD']
for st in states:
df = spark.sql(f"""select * from {st} where city = {city}["{st}"]""")
In above df I am trying to filter city based on dictionary value as per state. But I am not able to do it
New answer
By combining two filter conditions you can do the expected filtering.
selected_city = 'AP'
df = df.filter(
(F.col('city') == selected_city)
& (F.col('state') == cities[selected_city])
)
Old answer
It is a simple change: You can use isin to filter a column based on a list [Docs].
cities = list(city.keys())
df = df.filter(F.col('city').isin(cities))
If you want to construct more complex conditions based on a dictionary see this question.
[Edit] Updated answer based on OPs comment. Will leave the old one in there for completeness.

Compare column names of multiple pandas dataframes

In the below code, I created list of dataframes. Now I want to check if all the dataframes in dataframes list has same column names (I just want to compare headers, not the values) and if the condition is not met, it should error out.
dataframes = []
list_of_files = os.listdir(os.path.join(folder_location, quarter, "inputs"))
for files in list_of_files:
df = pd.read_excel(os.path.join(folder_location, quarter, "inputs", files), header=[0,1], sheetname= "Ratings Inputs", parse_cols ="B:AC", index_col=None).reset_index()
df.columns = pd.MultiIndex.from_tuples([tuple(df.columns.names)]
+ list(df.columns)[1:])
dataframes.append(df)
Not the most elegant solution, but it will get you there:
np.all([sorted(dataframes[0].columns) == sorted(i.columns) for i in dataframes])
sorted serves the purpose of both transforming into lists and making sure they dont fail because they are in different order

extract sub dataframe loc in for loop

I am having problem in slicing a total Pandas dataframe into subsets and then join them in a for loop.
The original dataframe looks like this:
original DataFrame content
*Forgive me that I blocked out some details to avoid issues on data security/policy. As long as the message is conveyed.
This is a query problem that I need to query for each "Surveyname" from another list which contains the survey names, and output to a table of all surveys (in sequences of the list) and their other information from selected columns.
The original Dataframe has columns of these:
Index(['Surveyname', 'Surveynumber', 'Datasetstatus', 'Datasetname',
'Datasetprocessname', 'Datasetdatatype', 'Datasetseismicformat',
'Datasettapedisplay', 'Inventoryid', 'Inventoryname',
'Inventorybarcode', 'Inventoryremarks', 'Datashipmentnumber',
'Facilityid', 'Facilityname', 'Inventoryfullpathfilename',
'Inventorytype', 'Mmsdatasetname', 'Inventorymediatype',
'Inventoryoriglocation', 'Inventoryreceiveddate',
'Inventorydataattribdesc', 'Firstalternatedesc', 'Secondalternatedesc',
'Thirdalternatedesc', 'field26', 'field27', 'field28', 'field29',
'field30', 'field31', 'field32'],
dtype='object')
And I am selecting only these columns as output:
cols =['Surveyname','Surveynumber','Datasettapedisplay','Inventoryid','Inventorybarcode','Inventoryoriglocation']
I set up an empty dataframe at the start and try to append the "queried" subset dataframe to this one. Hope it will grow along with the for loop.
The code looks like this :
f=open('EmptySurveyList2.txt','r')
cols =['Surveyname','Surveynumber','Datasettapedisplay','Inventoryid','Inventorybarcode','Inventoryoriglocation']
setdf=pd.DataFrame(columns=cols)# create an empty DataFrame
for line in f:
print(line)
# check by string content
df0=df_MIG.loc[df_MIG['Surveyname']==line,cols]
print(df_MIG.loc[df_MIG['Surveyname']==line,cols])
# check by string length for exact match
df0=df0.loc[df0['Surveyname'].str.len()==len(line),cols]
print(df0.loc[df0['Surveyname'].str.len()==len(line),cols])
print('df0:',len(df0))
setdf=setdf.append(df0)
print('setdf:',len(setdf))
However, this code would still give me only a few rows from the very last survey on the 'setdf' dataframe.
I went on debugging. I found that in the for loop, the df0 dataframe is not finding the survey information from the main df_MIG for the first N surveys in the list except the last one. By printing out the lenth of the df0 and the setdf:
>...Centauro
>
>Empty DataFrame
>Columns: [Surveyname, Surveynumber, Datasettapedisplay, Inventoryid,
>Inventorybarcode, Inventoryoriglocation]
>Index: []
>Empty DataFrame
>Columns: [Surveyname, Surveynumber, Datasettapedisplay, Inventoryid,
>Inventorybarcode, Inventoryoriglocation]
>Index: []
>df0: 0
>
>setdf: 0
>
>Blueberry
>
>Empty DataFrame
>Columns: [Surveyname, Surveynumber, Datasettapedisplay, Inventoryid,
>Inventorybarcode, Inventoryoriglocation]
>Index: []
>Empty DataFrame
>Columns: [Surveyname, Surveynumber, Datasettapedisplay, Inventoryid,
>Inventorybarcode, Inventoryoriglocation]
>Index: []
>df0: 0
>
>setdf: 0
>
>Baha (G)
> Surveyname Surveynumber Datasettapedisplay Inventoryid Inventorybarcode \
>219 Baha (G) 329130 FIN 1538554 4210380
>
>Inventoryoriglocation
>219 /wgdisk/hn0016/mc03/BAHA_329130/MIGFIN_639_256...
> Surveyname Surveynumber Datasettapedisplay Inventoryid Inventorybarcode \
>219 Baha (G) 329130 FIN 1538554 4210380
>
>Inventoryoriglocation
>219 /wgdisk/hn0016/mc03/BAHA_329130/MIGFIN_639_256...
>df0: 1
>
>setdf: 1
While if I do the query outside of the loop,
a = "Blueberry"
df0=df_MIG.loc[df_MIG['Surveyname']==a,cols]
df0=df0.loc[df0['Surveyname'].str.len()==len(a),cols]
setdf=setdf.append(df0)
It works normal and no issues, found the rows which has the name of the survey and add it to the setdf
Bebugging outside loop
This is quite a mystery to me. Anyone can help clarify why , or suggest a better alternative ?

How to use monotonically_increasing_id to join two pyspark dataframes having no common column?

I have two pyspark dataframes with same number of rows but they don't have any common column. So I am adding new column to both of them using monotonically_increasing_id() as
from pyspark.sql.functions import monotonically_increasing_id as mi
id=mi()
df1 = df1.withColumn("match_id", id)
cont_data = cont_data.withColumn("match_id", id)
cont_data = cont_data.join(df1,df1.match_id==cont_data.match_id, 'inner').drop(df1.match_id)
But after join the resulting data frame has less number of rows.
What am I missing here. Thanks
You just don't. This not an applicable use case for monotonically_increasing_id, which is by definition non-deterministic. Instead:
convert to RDD
zipWithIndex
convert back to DataFrame.
join
You can generate the id's with monotonically_increasing_id, save the file to disc, and then read it back in THEN do whatever joining process. Would only suggest this approach if you just need to generate the id's once. At that point they can be used for joining, but for the reasons mentioned above, this is hacky and not a good solution for anything that runs regularly.
If you want to get an incremental number on both dataframes and then join, you can generate a consecutive number with monotonically and windowing with the following code:
df1 = df1.withColumn("monotonically_increasing_id",monotonically_increasing_id())
window = Window.orderBy(scol('monotonically_increasing_id'))
df1 = df1.withColumn("match_id", row_number().over(window))
df1 = df1.drop("monotonically_increasing_id")
cont_data = cont_data.withColumn("monotonically_increasing_id",monotonically_increasing_id())
window = Window.orderBy(scol('monotonically_increasing_id'))
cont_data = cont_data.withColumn("match_id", row_number().over(window))
cont_data = cont_data.drop("monotonically_increasing_id")
cont_data = cont_data.join(df1,df1.match_id==cont_data.match_id, 'inner').drop(df1.match_id)
Warning It may move the data to a single partition! So maybe is better to separate the match_id to a different dataframe with the monotonically_increasing_id, generate the consecutive incremental number and then join with the data.

What is the right way to do a semi-join on two Spark RDDs (in PySpark)?

In my PySpark application, I have two RDD's:
items - This contains item ID and item name for all valid items. Approx 100000 items.
attributeTable - This contains the fields user ID, item ID and an attribute value of this combination in that order. These is a certain attribute for each user-item combination in the system. This RDD has several 100s of 1000s of rows.
I would like to discard all rows in attributeTable RDD that don't correspond to a valid item ID (or name) in the items RDD. In other words, a semi-join by the item ID. For instance, if these were R data frames, I would have done semi_join(attributeTable, items, by="itemID")
I tried the following approach first, but found that this takes forever to return (on my local Spark installation running on a VM on my PC). Understandably so, because there are such a huge number of comparisons involved:
# Create a broadcast variable of all valid item IDs for doing filter in the drivers
validItemIDs = sc.broadcast(items.map(lambda (itemID, itemName): itemID)).collect())
attributeTable = attributeTable.filter(lambda (userID, itemID, attributes): itemID in set(validItemIDs.value))
After a bit of fiddling around, I found that the following approach works pretty fast (a min or so on my system).
# Create a broadcast variable for item ID to item name mapping (dictionary)
itemIdToNameMap = sc.broadcast(items.collectAsMap())
# From the attribute table, remove records that don't correspond to a valid item name.
# First go over all records in the table and add a dummy field indicating whether the item name is valid
# Then, filter out all rows with invalid names. Finally, remove the dummy field we added.
attributeTable = (attributeTable
.map(lambda (userID, itemID, attributes): (userID, itemID, attributes, itemIdToNameMap.value.get(itemID, 'Invalid')))
.filter(lambda (userID, itemID, attributes, itemName): itemName != 'Invalid')
.map(lambda (userID, itemID, attributes, itemName): (userID, itemID, attributes)))
Although this works well enough for my application, it feels more like a dirty workaround and I am pretty sure there must be another cleaner or idiomatically correct (and possibly more efficient) way or ways to do this in Spark. What would you suggest? I am new to both Python and Spark, so any RTFM advices will also be helpful if you could point me to the right resources.
My Spark version is 1.3.1.
Just do a regular join and then discard the "lookup" relation (in your case items rdd).
If these are your RDDs (example taken from another answer):
items = sc.parallelize([(123, "Item A"), (456, "Item B")])
attributeTable = sc.parallelize([(123456, 123, "Attribute for A")])
then you'd do:
attributeTable.keyBy(lambda x: x[1])
.join(items)
.map(lambda (key, (attribute, item)): attribute)
And as a result, you only have tuples from attributeTable RDD which have a corresponding entry in the items RDD:
[(123456, 123, 'Attribute for A')]
Doing it via leftOuterJoin as suggested in another answer will also do the job, but is less efficient. Also, the other answer semi-joins items with attributeTable instead of attributeTable with items.
As others have pointed out, this is probably most easily accomplished by leveraging DataFrames. However, you might be able to accomplish your intended goal by using the leftOuterJoin and the filter functions. Something a bit hackish like the following might suffice:
items = sc.parallelize([(123, "Item A"), (456, "Item B")])
attributeTable = sc.parallelize([(123456, 123, "Attribute for A")])
sorted(items.leftOuterJoin(attributeTable.keyBy(lambda x: x[1]))
.filter(lambda x: x[1][1] is not None)
.map(lambda x: (x[0], x[1][0])).collect())
returns
[(123, 'Item A')]

Resources