Find users most frequent recommandations based on input queries - python-3.x

I have a input query table in the following:
query
0 orange
1 apple
2 meat
which I want to make against the user query table as following
user query
0 a1 orange
1 a1 strawberry
2 a1 pear
3 a2 orange
4 a2 strawberry
5 a2 lemon
6 a3 orange
7 a3 banana
8 a6 meat
9 a7 beer
10 a8 juice
Given a query in input query, I want to match it to query by other user in user query table, and return the top 3 ranked by total number of counts.
For example,
orange in input query, it matches user a1,a2,a3 in user query where all have queried orange, other items they have query are strawberry (count of 2), pear, lemon, banana (count of 1).
The answer will be strawberry (since it has max count), pear, lemon (since we only return top 3).
Similar reasoning for apple (no user query therefore output 'nothing') and meat query.
So the final output table is
query recommend
0 orange strawberry
1 orange pear
2 orange lemon
3 apple nothing
4 meat nothing
Here is the code
import pandas as pd
import numpy as np
# Create sample dataframes
df_input = pd.DataFrame( {'query': {0: 'orange', 1: 'apple', 2: 'meat'}} )
df_user = pd.DataFrame( {'user': {0: 'a1', 1: 'a1', 2: 'a1', 3: 'a2', 4: 'a2', 5: 'a2', 6: 'a3', 7: 'a3', 8: 'a6', 9: 'a7', 10: 'a8'}, 'query': {0: 'orange', 1: 'strawberry', 2: 'pear', 3: 'orange', 4: 'strawberry', 5: 'lemon', 6: 'orange', 7: 'banana', 8: 'meat', 9: 'beer', 10: 'juice'}} )
target_users = df_user[df_user['query'].isin(df_input['query'])]['user']
mask_users=df_user['user'].isin(target_users)
mask_queries=df_user['query'].isin(df_input['query'])
df1=df_user[mask_users & mask_queries]
df2=df_user[mask_users]
df=df1.merge(df2,on='user').rename(columns={"query_x":"query", "query_y":"recommend"})
df=df[df['query']!=df['recommend']]
df=df.groupby(['query','recommend'], as_index=False).count().rename(columns={"user":"count"})
df=df.sort_values(['query','recommend'],ascending=False, ignore_index=False)
df=df.groupby('query').head(3)
df=df.drop(columns=['count'])
df=df_input.merge(df,how='left',on='query').fillna('nothing')
df
Where df is the result. Is there any way to make the code more concise?

Unless there is a particular reason to favor pears over bananas (since they both count for one), I would suggest a more idiomatic way to do it:
import pandas as pd
df_input = pd.DataFrame(...)
df_user = pd.DataFrame(...)
df_input = (
df_input
.assign(
recommend=df_input["query"].map(
lambda x: df_user[
(df_user["user"].isin(df_user.loc[df_user["query"] == x, "user"]))
& (df_user["query"] != x)
]
.value_counts(subset="query")
.index[0:3]
.to_list()
if x in df_user["query"].unique()
else "nothing"
)
)
.explode("recommend")
.fillna("nothing")
.reset_index(drop=True)
)
print(df_input)
# Output
query recommend
0 orange strawberry
1 orange banana
2 orange lemon
3 apple nothing
4 meat nothing

Related

Given a column value, check if another column value is present in preceding or next 'n' rows in a Pandas data frame

I have the following data
jsonDict = {'Fruit': ['apple', 'orange', 'apple', 'banana', 'orange', 'apple','banana'], 'price': [1, 2, 1, 3, 2, 1, 3]}
Fruit price
0 apple 1
1 orange 2
2 apple 1
3 banana 3
4 orange 2
5 apple 1
6 banana 3
What I want to do is check if Fruit == banana and if yes, I want the code to scan the preceding as well as the next n rows from the index position of the 'banana' row, for an instance where Fruit == apple. An example of the expected output is shown below taking n=2.
Fruit price
2 apple 1
5 apple 1
I have tried doing
position = df[df['Fruit'] == 'banana'].index
resultdf= df.loc[((df.index).isin(position)) & (((df['Fruit'].index+2).isin(['apple']))|((df['Fruit'].index-2).isin(['apple'])))]
# Output is an empty dataframe
Empty DataFrame
Columns: [Fruit, price]
Index: []
Preference will be given to vectorized approaches.
IIUC, you can use 2 masks and boolean indexing:
# df = pd.DataFrame(jsonDict)
n = 2
m1 = df['Fruit'].eq('banana')
# is the row ±n of a banana?
m2 = m1.rolling(2*n+1, min_periods=1, center=True).max().eq(1)
# is the row an apple?
m3 = df['Fruit'].eq('apple')
out = df[m2&m3]
output:
Fruit price
2 apple 1
5 apple 1

Getting values in pivot table

This is the pivot table and I hope to get value in red and green rectangle
import pandas as pd
import numpy as np
# Create sample dataframes
df_user = pd.DataFrame( {'user': {0: 'a1', 1: 'a1', 2: 'a1', 3: 'a2', 4: 'a2', 5: 'a2', 6: 'a3', 7: 'a3', 8: 'a6', 9: 'a7', 10: 'a8'}, 'query': {0: 'orange', 1: 'strawberry', 2: 'pear', 3: 'orange', 4: 'strawberry', 5: 'lemon', 6: 'orange', 7: 'banana', 8: 'meat', 9: 'beer', 10: 'juice'}} )
df_user['count']=1
df_pivot=pd.pivot_table(df_user,index=['query'],columns=['user'],values=['count'],aggfunc=np.sum).fillna(0)
#getting value in red rectangle, incorrect
print(df_pivot.loc['banana':'beer','a1':'a2'])
#getting value in green rectangle, error
print(df_pivot.loc[:,'a8'])
What's the right way to get them?
Use:
>> df_pivot.loc[["banana","beer"], ('count',['a1','a2'])]
count
user a1 a2
query
banana 0.0 0.0
beer 0.0 0.0
and
>> df_pivot.loc[:, ('count',['a8'])]
count
user a8
query
banana 0.0
beer 0.0
juice 1.0
lemon 0.0
meat 0.0
orange 0.0
pear 0.0
strawberry 0.0
Try to first drop the first column level, like so:
df_pivot = df_pivot.droplevel(0,1)

How do I check total amount of times a certain value occurred in a nested loop?

Question: Calculate the total number of apples bought on monday and wendsday only.
This is my code currently:
apple = 0
banana = 0
orange = 0
#number of fruits bought on monday, tuesday and wendsday respectively
fruit = [ ['apple', 'cherry', 'apple', 'orange'], \
['orange', 'apple', 'apple', 'banana'], \
['banana', 'apple', 'cherry', 'orange'] ]
for x in fruit:
if 'apple' in x:
if fruit.index(x) == 0 or fruit.index(x) == 2:
apple + 1
print(apple)
For some reason, the current result which I am getting for printing apple is 0.
What is wrong with my code?
the problem in your code is you are only incrementing the the number of apples but you are not assigning them into any variable, that's why it is printing it's initial value:
apple = 0
apple + 1
you need to do:
apple += 1
and also fruit.index(x) always return the index of the first occurence of that element, that is:
fruit[1].index('apple')
will return index of first occurence of 'apple' in fruit[1], which is 1.
but According to your question, this solution is incorrect because they were asking no of apples on monday and wednesday only so you need to this manually, because according to your solution it will also count the apples on tuesday also where index of 'apple' is 0 or 2. below is the correct solution
apple = 0
banana = 0
orange = 0
#number of fruits bought on monday, tuesday and wendsday respectively
fruit = [ ['apple', 'cherry', 'apple', 'orange'],
['orange', 'apple', 'apple', 'banana'],
['banana', 'apple', 'cherry', 'orange'] ]
apple += fruit[0].count('apple')
apple += fruit[2].count('apple')
print(apple)
There are two issues with your code.
The first issue is:
if fruit.index(x) == 0 or fruit.index(x) == 2:
apple + 1
apple + 1 is not doing anything meaningful. If you want you need to increment apple, you need to do apple += 1. This results in apple being 2
The second issue is that you need to calculate the total number, which is 3 apples and not 2. Two apples were bought on Monday and 1 on Wednesday.
You can use collections.Counter for this
from collections import Counter
for x in fruit:
if 'apple' in x:
if fruit.index(x) == 0 or fruit.index(x) == 2:
apple += Counter(x)['apple']
it should be apple += 1, not apple + 1

How can I merge rows if a column data is same and change a value of another specific column on merged column efficiently in pandas?

I am trying to merge rows if value of certain column are same. I have been using groupby first and replace the data the value of column based on specific condition. I was wondering if there is a better option to do what I am trying to do.
This is what I have been doing
data={'Name': {0: 'Sam', 1: 'Amy', 2: 'Cat', 3: 'Sam', 4: 'Kathy'},
'Subject1': {0: 'Math', 1: 'Science', 2: 'Art', 3: np.nan, 4: 'Science'},
'Subject2': {0: np.nan, 1: np.nan, 2: np.nan, 3: 'English', 4: np.nan},
'Result': {0: 'Pass', 1: 'Pass', 2: 'Fail', 3: 'TBD', 4: 'Pass'}}
df=pd.DataFrame(data)
df=df.groupby('Name').agg({
'Subject1': 'first',
'Subject2': 'first',
'Result': ', '.join}).reset_index()
df['Result']=df['Result'].apply(lambda x: 'RESULT_FAILED' if x=='Pass, TBD' else x )
Starting: df looks like:
Name Subject1 Subject2 Result
0 Sam Math NaN Pass
1 Amy Science NaN Pass
2 Cat Art NaN Fail
3 Sam NaN English TBD
4 Kathy Science NaN Pass
Final result I want is :
Name Subject1 Subject2 Result
0 Amy Science NaN Pass
1 Cat Art NaN Fail
2 Kathy Science NaN Pass
3 Sam Math English RESULT_FAILED
I believe this might not be a good solution if there are more than 100 columns. I will have to manually change the dictionary for aggregation.
I tried using :
df.groupby('Name')['Result'].agg(' '.join).reset_index() but I only get 2 columns.
Your sample indicates each unique name having single non-NaN SubjectX value. I.e. each SubjectX has only one single non-NaN value for duplicate Name. You may try this way
import numpy as np
df_final = (df.fillna('').groupby('Name', as_index=False).agg(''.join)
.replace({'':np.nan, 'PassTBD': 'RESULT_FAILED'}))
Out[16]:
Name Subject1 Subject2 Result
0 Amy Science NaN Pass
1 Cat Art NaN Fail
2 Kathy Science NaN Pass
3 Sam Math English RESULT_FAILED

Compare values in two different pandas columns

I have a dataframe that looks like this:
Fruit Cost Quantity Fruit_Copy
Apple 0.5 6 Watermelon
Orange 0.3 2 Orange
Apple 0.5 8 Apple
Apple 0.5 7 Apple
Banana 0.25 8 Banana
Banana 0.25 7 Banana
Apple 0.5 6 Apple
Apple 0.5 3 Apple
I want to write a snippet that, in pandas, compares Fruit and Fruit_Copy and outputs a new column "Match" that indicates if the values in Fruit = Fruit_Copy.
Thanks in advance!
Lets say your dataframe is 'fruits'. Then you can make use of the Pandas Series Equals function pd.Series.eq as,
fruits['Match'] = pd.Series.eq(fruits['Fruit'],fruits['Fruit_Copy'])
Something like this would work.
df.loc[df['Fruit'] == df['Fruit_Copy'], 'Match'] = 'Yes'
Using numpy.where:
df['Match'] = np.where(df['Fruit'] == df['Fruit_Copy'], 'Yes', 'No')
You could try something like this:
import pandas as pd
import numpy as np
fruits = pd.DataFrame({'Fruit':['Apple', 'Orange', 'Apple', 'Apple', 'Banana', 'Banana', 'Apple', 'Apple'], 'Cost':[0.5,0.3,0.5,0.5,0.25,0.25,0.5,0.5], 'Quantity':[6,2,8,7,8,7,6,3], 'Fruit_Copy':['Watermelon', 'Orange', 'Apple', 'Apple', 'Banana', 'Banana', 'Apple', 'Apple']})
fruits['Match'] = np.where(fruits['Fruit'] == fruits['Fruit_Copy'], 1, 0)
fruits
Fruit Cost Quantity Fruit_Copy Match
0 Apple 0.50 6 Watermelon 0
1 Orange 0.30 2 Orange 1
2 Apple 0.50 8 Apple 1
3 Apple 0.50 7 Apple 1
4 Banana 0.25 8 Banana 1
5 Banana 0.25 7 Banana 1
6 Apple 0.50 6 Apple 1
7 Apple 0.50 3 Apple 1

Resources