compare columns and replace result in existing column

compare columns and replace result in existing column - python-3.x

I have two pandas columns, where I first compare the two columns and then replace an old string with a new one.
My data:
shopping on_List
Banana 1
Apple 0
Grapes 1
None 0
Banana 1
Nuts 0
Lemon 1
In order to compare the two I have done the following:
results = []
for shopping, on_list in zip(df.shopping, df.on_list):
if shopping != 'None' and on_list == 1:
items = shopping
if items == 'Banana':
re = items.replace('Banana', 'Bananas')
elif items == 'Lemon':
re = items.replace('Lemon', 'Lemons')
elif items == 'Apples':
re= items.replace('Apple','Apples')
results.append(re)
print(results)
Output: ['Bananas','Lemons', 'Apples']
Ideally I would like to return a new column that replaces my new values with old ones in the 'shopping' column:
This is my desired output, but unfortunately my new list (results) is not the same length as the current df:
shopping
Bananas
Apples
Grapes
None
Bananas
Nuts
Lemons

I suggest create dictionary for mapping and replace filtered values:
d = {'Banana':'Bananas', 'Lemon':'Lemons', 'Apple':'Apples'}
mask = df['on_List'].eq(1) & df['on_List'].notnull()
df['shopping'] = df['shopping'].mask(mask, df['shopping'].map(d)).fillna(df['shopping'])
#slowier solution
#df['shopping'] = df['shopping'].mask(mask, df['shopping'].replace(d))
print (df)
shopping on_List
0 Bananas 1
1 Apple 0
2 Grapes 1
3 None 0
4 Bananas 1
5 Nuts 0
6 Lemons 1

val = []
for i in range(len(df)):
if df["shopping"][i] != None and df["on_List"][i] == 1:
if df["shopping"][i] == "Banana":
val.append("Bananas")
elif df["shopping"][i] == "Lemon":
val.append("Lemons")
elif df["shopping"][i] == "Apple":
val.append("Apples")
else:
val.append("None")
df["Result"] = pd.Series(val)

Related

Given a column value, check if another column value is present in preceding or next 'n' rows in a Pandas data frame

I have the following data
jsonDict = {'Fruit': ['apple', 'orange', 'apple', 'banana', 'orange', 'apple','banana'], 'price': [1, 2, 1, 3, 2, 1, 3]}
Fruit price
0 apple 1
1 orange 2
2 apple 1
3 banana 3
4 orange 2
5 apple 1
6 banana 3
What I want to do is check if Fruit == banana and if yes, I want the code to scan the preceding as well as the next n rows from the index position of the 'banana' row, for an instance where Fruit == apple. An example of the expected output is shown below taking n=2.
Fruit price
2 apple 1
5 apple 1
I have tried doing
position = df[df['Fruit'] == 'banana'].index
resultdf= df.loc[((df.index).isin(position)) & (((df['Fruit'].index+2).isin(['apple']))|((df['Fruit'].index-2).isin(['apple'])))]
# Output is an empty dataframe
Empty DataFrame
Columns: [Fruit, price]
Index: []
Preference will be given to vectorized approaches.

IIUC, you can use 2 masks and boolean indexing:
# df = pd.DataFrame(jsonDict)
n = 2
m1 = df['Fruit'].eq('banana')
# is the row ±n of a banana?
m2 = m1.rolling(2*n+1, min_periods=1, center=True).max().eq(1)
# is the row an apple?
m3 = df['Fruit'].eq('apple')
out = df[m2&m3]
output:
Fruit price
2 apple 1
5 apple 1

How to encode pandas data frame column with three values fast?

I have a pandas data frame that contains a column called Country. I have more than a million rows in my data frame.
Country
USA
Canada
Japan
India
Brazil
......
I want to create a new column called Country_Encode, which will replace USA with 1, Canada with 2, and all others with 0 like the following.
Country Country_Encode
USA 1
Canada 2
Japan 0
India 0
Brazil 0
..................
I have tried following.
for idx, row in df.iterrows():
if (df.loc[idx, 'Country'] == USA):
df.loc[idx, 'Country_Encode'] = 1
elif (df.loc[idx, 'Country'] == Canada):
df.loc[idx, 'Country_Encode'] = 2
elif ((df.loc[idx, 'Country'] != USA) and (df.loc[idx, 'Country'] != Canada)):
df.loc[idx, 'Country_Encode'] = 0
The above solution works but it is very slow. Do you know how I can do it in a fast way? I really appreciate any help you can provide.

Assuming no row contains two country names, you could assign values in a vectorized way using a boolean condition:
df['Country_encode'] = df['Country'].eq('USA') + df['Country'].eq('Canada')*2
Output:
Country Country_encode
0 USA 1
1 Canada 2
2 Japan 0
3 India 0
4 Brazil 0
But in general, loc is very fast:
df['Country_encode'] = 0
df.loc[df['Country'].eq('USA'), 'Country_encode'] = 1
df.loc[df['Country'].eq('Canada'), 'Country_encode'] = 2

There are many ways to do this, the most basic one is the following:
def coding(row):
if row == "USA":
return 1
elif row== "Canada":
return 2
else:
return 0
df["Country_code"] = df["Country"].apply(coding)

How do I check total amount of times a certain value occurred in a nested loop?

Question: Calculate the total number of apples bought on monday and wendsday only.
This is my code currently:
apple = 0
banana = 0
orange = 0
#number of fruits bought on monday, tuesday and wendsday respectively
fruit = [ ['apple', 'cherry', 'apple', 'orange'], \
['orange', 'apple', 'apple', 'banana'], \
['banana', 'apple', 'cherry', 'orange'] ]
for x in fruit:
if 'apple' in x:
if fruit.index(x) == 0 or fruit.index(x) == 2:
apple + 1
print(apple)
For some reason, the current result which I am getting for printing apple is 0.
What is wrong with my code?

the problem in your code is you are only incrementing the the number of apples but you are not assigning them into any variable, that's why it is printing it's initial value:
apple = 0
apple + 1
you need to do:
apple += 1
and also fruit.index(x) always return the index of the first occurence of that element, that is:
fruit[1].index('apple')
will return index of first occurence of 'apple' in fruit[1], which is 1.
but According to your question, this solution is incorrect because they were asking no of apples on monday and wednesday only so you need to this manually, because according to your solution it will also count the apples on tuesday also where index of 'apple' is 0 or 2. below is the correct solution
apple = 0
banana = 0
orange = 0
#number of fruits bought on monday, tuesday and wendsday respectively
fruit = [ ['apple', 'cherry', 'apple', 'orange'],
['orange', 'apple', 'apple', 'banana'],
['banana', 'apple', 'cherry', 'orange'] ]
apple += fruit[0].count('apple')
apple += fruit[2].count('apple')
print(apple)

There are two issues with your code.
The first issue is:
if fruit.index(x) == 0 or fruit.index(x) == 2:
apple + 1
apple + 1 is not doing anything meaningful. If you want you need to increment apple, you need to do apple += 1. This results in apple being 2
The second issue is that you need to calculate the total number, which is 3 apples and not 2. Two apples were bought on Monday and 1 on Wednesday.
You can use collections.Counter for this
from collections import Counter
for x in fruit:
if 'apple' in x:
if fruit.index(x) == 0 or fruit.index(x) == 2:
apple += Counter(x)['apple']

it should be apple += 1, not apple + 1

find index of row element in pandas

If you have a df:
apple banana carrot
a 1 2 3
b 2 3 1
c 0 0 1
To find the index for the columns where a cell is equal to 0 is df[df['apple']==0].index
but can you transpose this so you find the index of row c where it is 0?
Basically I need to drop the columns where c==0 and would like to do this in one line by row rather than by each column.

If want test row c and get all columns if 0:
c = df.columns[df.loc['c'] == 0]
print (c)
Index(['apple', 'banana'], dtype='object')
If want test all rows:
c1 = df.columns[df.eq(0).any()]
print (c1)
Index(['apple', 'banana'], dtype='object')
If need remove columns if 0 in any row:
df = df.loc[:, df.ne(0).all()]
print (df)
carrot
a 3
b 1
c 1
Detail/explanation:
First compare all values of DataFrame by ne (!=):
print (df.ne(0))
apple banana carrot
a True True True
b True True True
c False False True
Then get all rows if all True rows:
print (df.ne(0).all())
apple False
banana False
carrot True
dtype: bool
Last filter by DataFrame.loc:
print (df.loc[:, df.loc['c'].ne(0)])
carrot
a 3
b 1
c 1
If need test only c row solution is similar, only first select c row by loc and omit all:
df = df.loc[:, df.loc['c'].ne(0)]

Yes you can, df.T[df.T['c']==0]

Extract keywords from a title,relevant, and final column math

I have a DataFrame that is structured in the following way:
Title; Total Visits; Rank
The dog; 8 ; 4
The cat; 9 ; 4
The dog cat; 10 ; 3
The second DataFrame contains:
Keyword; Rank
snail ; 5
dog ; 1
cat ; 2
What I am trying to accomplish is:
Title; Total Visits; Rank ; Keywords ; Score
The dog; 8 ; 4 ; dog ; 1
The cat; 9 ; 4 ; cat ; 2
The dog cat; 10 ; 3 ; dog,cat ; 1.5
I have made use of the following reference, but for some
df['Tweet'].map(lambda x: tuple(re.findall(r'({})'.format('|'.join(w.values)), x)))
return null. Any help would be appreciated.

You can use:
#create list of all words
wants = df2.Keyword.tolist()
#dict for maping
d = df2.set_index('Keyword')['Rank'].to_dict()
#split all values by whitespaces, create series
s = df1.Title.str.split(expand=True).stack()
#filter by list wants
s = s[s.isin(wants)]
print (s)
0 1 dog
1 1 cat
2 1 dog
2 cat
dtype: object
#create new columns
df1['Keywords'] = s.groupby(level=0).apply(','.join)
df1['Score'] = s.map(d).groupby(level=0).mean()
print (df1)
Title Total Visits Rank Keywords Score
0 The dog 8 4 dog 1.0
1 The cat 9 4 cat 2.0
2 The dog cat 10 3 dog,cat 1.5
Another solution with lists manipulation:
wants = df2.Keyword.tolist()
d = df2.set_index('Keyword')['Rank'].to_dict()
#create list from each value
df1['Keywords'] = df1.Title.str.split()
#remove unnecessary words
df1['Keywords'] = df1.Keywords.apply(lambda x: [item for item in x if item in wants])
#maping each word
df1['Score'] = df1.Keywords.apply(lambda x: [d[item] for item in x])
#create ne columns
df1['Keywords'] = df1.Keywords.apply(','.join)
#mean
df1['Score'] = df1.Score.apply(lambda l: sum(l) / float(len(l)))
print (df1)
Title Total Visits Rank Keywords Score
0 The dog 8 4 dog 1.0
1 The cat 9 4 cat 2.0
2 The dog cat 10 3 dog,cat 1.5
Timings:
In [96]: %timeit (a(df11, df22))
100 loops, best of 3: 3.71 ms per loop
In [97]: %timeit (b(df1, df2))
100 loops, best of 3: 2.55 ms per loop
Code for testing:
df11 = df1.copy()
df22 = df2.copy()
def a(df1, df2):
wants = df2.Keyword.tolist()
d = df2.set_index('Keyword')['Rank'].to_dict()
s = df1.Title.str.split(expand=True).stack()
s = s[s.isin(wants)]
df1['Keywords'] = s.groupby(level=0).apply(','.join)
df1['Score'] = s.map(d).groupby(level=0).mean()
return (df1)
def b(df1,df2):
wants = df2.Keyword.tolist()
d = df2.set_index('Keyword')['Rank'].to_dict()
df1['Keywords'] = df1.Title.str.split()
df1['Keywords'] = df1.Keywords.apply(lambda x: [item for item in x if item in wants])
df1['Score'] = df1.Keywords.apply(lambda x: [d[item] for item in x])
df1['Keywords'] = df1.Keywords.apply(','.join)
df1['Score'] = df1.Score.apply(lambda l: sum(l) / float(len(l)))
return (df1)
print (a(df11, df22))
print (b(df1, df2))
EDIT by comment:
You can apply list comprhension if there are Keywords with more as one words:
print (df1)
Title Total Visits Rank
0 The dog 8 4
1 The cat 9 4
2 The dog cat 10 3
print (df2)
Keyword Rank
0 snail 5
1 dog 1
2 cat 2
3 The dog 8
4 the Dog 1
5 The Dog 3
wants = df2.Keyword.tolist()
print (wants)
['snail', 'dog', 'cat', 'The dog', 'the Dog', 'The Dog']
d = df2.set_index('Keyword')['Rank'].to_dict()
df1['Keywords'] = df1.Title.apply(lambda x: [item for item in wants if item in x])
df1['Score'] = df1.Keywords.apply(lambda x: [d[item] for item in x])
df1['Keywords'] = df1.Keywords.apply(','.join)
df1['Score'] = df1.Score.apply(lambda l: sum(l) / float(len(l)))
print (df1)
Title Total Visits Rank Keywords Score
0 The dog 8 4 dog,The dog 4.500000
1 The cat 9 4 cat 2.000000
2 The dog cat 10 3 dog,cat,The dog 3.666667

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

compare columns and replace result in existing column - python-3.x

Related

Given a column value, check if another column value is present in preceding or next 'n' rows in a Pandas data frame

How to encode pandas data frame column with three values fast?

How do I check total amount of times a certain value occurred in a nested loop?

find index of row element in pandas

Extract keywords from a title,relevant, and final column math

Categories

Resources