Extract keywords from a title,relevant, and final column math - string

I have a DataFrame that is structured in the following way:
Title; Total Visits; Rank
The dog; 8 ; 4
The cat; 9 ; 4
The dog cat; 10 ; 3
The second DataFrame contains:
Keyword; Rank
snail ; 5
dog ; 1
cat ; 2
What I am trying to accomplish is:
Title; Total Visits; Rank ; Keywords ; Score
The dog; 8 ; 4 ; dog ; 1
The cat; 9 ; 4 ; cat ; 2
The dog cat; 10 ; 3 ; dog,cat ; 1.5
I have made use of the following reference, but for some
df['Tweet'].map(lambda x: tuple(re.findall(r'({})'.format('|'.join(w.values)), x)))
return null. Any help would be appreciated.

You can use:
#create list of all words
wants = df2.Keyword.tolist()
#dict for maping
d = df2.set_index('Keyword')['Rank'].to_dict()
#split all values by whitespaces, create series
s = df1.Title.str.split(expand=True).stack()
#filter by list wants
s = s[s.isin(wants)]
print (s)
0 1 dog
1 1 cat
2 1 dog
2 cat
dtype: object
#create new columns
df1['Keywords'] = s.groupby(level=0).apply(','.join)
df1['Score'] = s.map(d).groupby(level=0).mean()
print (df1)
Title Total Visits Rank Keywords Score
0 The dog 8 4 dog 1.0
1 The cat 9 4 cat 2.0
2 The dog cat 10 3 dog,cat 1.5
Another solution with lists manipulation:
wants = df2.Keyword.tolist()
d = df2.set_index('Keyword')['Rank'].to_dict()
#create list from each value
df1['Keywords'] = df1.Title.str.split()
#remove unnecessary words
df1['Keywords'] = df1.Keywords.apply(lambda x: [item for item in x if item in wants])
#maping each word
df1['Score'] = df1.Keywords.apply(lambda x: [d[item] for item in x])
#create ne columns
df1['Keywords'] = df1.Keywords.apply(','.join)
#mean
df1['Score'] = df1.Score.apply(lambda l: sum(l) / float(len(l)))
print (df1)
Title Total Visits Rank Keywords Score
0 The dog 8 4 dog 1.0
1 The cat 9 4 cat 2.0
2 The dog cat 10 3 dog,cat 1.5
Timings:
In [96]: %timeit (a(df11, df22))
100 loops, best of 3: 3.71 ms per loop
In [97]: %timeit (b(df1, df2))
100 loops, best of 3: 2.55 ms per loop
Code for testing:
df11 = df1.copy()
df22 = df2.copy()
def a(df1, df2):
wants = df2.Keyword.tolist()
d = df2.set_index('Keyword')['Rank'].to_dict()
s = df1.Title.str.split(expand=True).stack()
s = s[s.isin(wants)]
df1['Keywords'] = s.groupby(level=0).apply(','.join)
df1['Score'] = s.map(d).groupby(level=0).mean()
return (df1)
def b(df1,df2):
wants = df2.Keyword.tolist()
d = df2.set_index('Keyword')['Rank'].to_dict()
df1['Keywords'] = df1.Title.str.split()
df1['Keywords'] = df1.Keywords.apply(lambda x: [item for item in x if item in wants])
df1['Score'] = df1.Keywords.apply(lambda x: [d[item] for item in x])
df1['Keywords'] = df1.Keywords.apply(','.join)
df1['Score'] = df1.Score.apply(lambda l: sum(l) / float(len(l)))
return (df1)
print (a(df11, df22))
print (b(df1, df2))
EDIT by comment:
You can apply list comprhension if there are Keywords with more as one words:
print (df1)
Title Total Visits Rank
0 The dog 8 4
1 The cat 9 4
2 The dog cat 10 3
print (df2)
Keyword Rank
0 snail 5
1 dog 1
2 cat 2
3 The dog 8
4 the Dog 1
5 The Dog 3
wants = df2.Keyword.tolist()
print (wants)
['snail', 'dog', 'cat', 'The dog', 'the Dog', 'The Dog']
d = df2.set_index('Keyword')['Rank'].to_dict()
df1['Keywords'] = df1.Title.apply(lambda x: [item for item in wants if item in x])
df1['Score'] = df1.Keywords.apply(lambda x: [d[item] for item in x])
df1['Keywords'] = df1.Keywords.apply(','.join)
df1['Score'] = df1.Score.apply(lambda l: sum(l) / float(len(l)))
print (df1)
Title Total Visits Rank Keywords Score
0 The dog 8 4 dog,The dog 4.500000
1 The cat 9 4 cat 2.000000
2 The dog cat 10 3 dog,cat,The dog 3.666667

Related

How to compare a string of one column of pandas with rest of the columns and if value is found in any column of the row append a new row?

I want to compare the Category column with all the predicted_site and if value matches with anyone column, append a column named rank and insert 1 if value is found or else insert 0
Use DataFrame.filter for predicted columns compared by DataFrame.eq with Category column, convert to integers, change columns names by DataFrame.add_prefix and last add new columns by DataFrame.join:
df = pd.DataFrame({
'category':list('abcabc'),
'B':[4,5,4,5,5,4],
'predicted1':list('adadbd'),
'predicted2':list('cbarac')
})
df1 = df.filter(like='predicted').eq(df['category'], axis=0).astype(int).add_prefix('new_')
df = df.join(df1)
print (df)
category B predicted1 predicted2 new_predicted1 new_predicted2
0 a 4 a c 1 0
1 b 5 d b 0 1
2 c 4 a a 0 0
3 a 5 d r 0 0
4 b 5 b a 1 0
5 c 4 d c 0 1
This solution is much less elegant than that proposed by #jezrael, however you can try it.
#sample dataframe
d = {'cat': ['comp-el', 'el', 'comp', 'comp-el', 'el', 'comp'], 'predicted1': ['com', 'al', 'p', 'col', 'el', 'comp'], 'predicted2': ['a', 'el', 'p', 'n', 's', 't']}
df = pd.DataFrame(data=d)
#iterating through rows
for i, row in df.iterrows():
#assigning values
cat = df.loc[i,'cat']
predicted1 = df.loc[i,'predicted1']
predicted2 = df.loc[i,'predicted2']
#condition
if (cat == predicted1 or cat == predicted2):
df.loc[i,'rank'] = 1
else:
df.loc[i,'rank'] = 0
output:
cat predicted1 predicted2 rank
0 comp-el com a 0.0
1 el al el 1.0
2 comp p p 0.0
3 comp-el col n 0.0
4 el el s 1.0
5 comp comp t 1.0

How can I extract numbers as well as string from multiple rows in a Data Frame column?

DF1
index|Number
0 |[Number 1]
1 |[Number 2]
2 |[kg]
3 |[]
4 |[kg,Number 3]
In my dataframe in the Number column, I need to extract the number if present, kg if the string has kg and NaN if there is no value. If the row has both the number and kg then I will extract only the number.
Expected Output
index|Number
0 |1
1 |2
2 |kg
3 |NaN
4 |3
I wrote a lambda function for this but I am getting Error
NumorKG = lambda x: x.str.extract('(\d+)') if x.str.extract('(\d+)').isdigit() else 'kg' if x.str.find('kg') else "NaN"
DF1['Number']=DF1['Number'].apply(NumorKG)
The error that I am getting is:
AttributeError: 'str' object has no attribute 'str'
Use numpy.where for set values:
#extract numeric to Series
d = df['Number'].str.extract('(\d+)', expand=False)
#test if digit
mask1 = d.str.isdigit().fillna(False)
#test if values contains kg
mask2 = df['Number'].str.contains('kg', na=False)
df['Number'] = np.where(mask1, d,
np.where(mask2 & ~mask1, 'kg',np.nan))
print (df)
Number
0 1
1 2
2 kg
3 nan
4 3
Your solution should be changed:
import re
def NumorKG(x):
a = re.findall('(\d+)', x)
if len(a) > 0:
return a[0]
elif 'kg' in x:
return 'kg'
else:
return np.nan
df['Number']=df['Number'].apply(NumorKG)
print (df)
Number
0 1
1 2
2 kg
3 NaN
4 3
And your lambda function should be changed:
NumorKG = lambda x: re.findall('(\d+)', x)[0]
if len(re.findall('(\d+)', x)) > 0
else 'kg'
if 'kg' in x
else np.nan
In apply, what is returned is a scalar, so you can't use the .str accessor.
As you are dealing with only one column, no need for apply.
As an alternative to Jezrael (that would be reproducible), this is a possible solution:
DF1 = pd.DataFrame({'Number': [["Number 1"], ["Number 2"], ["kg"], [""], ["kg", "Number 3"]]})
DF1['Number'] = DF1.Number.str.join(sep=" ")
mask_digit = DF1.Number.str.extract('(\d+)', expand=False).str.isdigit().fillna(False)
mask_kg = DF1['Number'].str.contains('kg', na=False)
DF1.loc[mask_digit, 'Number'] = DF1.Number.str.extract('(\d+)', expand=False)
DF1.loc[mask_kg,'Number'] = 'kg'
DF1.loc[~(mask_digit | mask_kg), 'Number'] = np.NaN

how to split dataframe into equal number of subset in python

I have a dataframe
import pandas as pd
d = {'user': [1, 1, 2,2,2,2 ,2,2,2,2], 'friends':
[1,2,1,5,4,6,7,20,9,7]}
df = pd.DataFrame(data=d)
I try to split the df into several n pieces in a loop. For example, for n=3
n=3
for i in range(3):
subdata = dosomething(df)
print(subdata)
the output will be someting like
# first loop
user friends
0 1 1
1 1 2
2 2 1
3 2 5
# second loop
user friends
0 2 4
1 2 6
2 2 7
3 2 20
#third loop
user friends
0 2 9
1 2 7
You can use iloc and loop through the dataframe, put each new dataframe in a dictionary for recall later.
dfs = {}
chunk = 4
Loop through the dataframe by chunk sizes. Create df and add to dict.
for n in range((df.shape[0] // chunk + 1)):
df_temp = df.iloc[n*chunk:(n+1)*chunk]
df_temp = df_temp.reset_index(drop=True)
dfs[n] = df_temp
Use this if statement for any left over rows at the end.
if df.shape[0] % chunk != 0:
df_temp = df.iloc[-int(df.shape[0] % chunk):]
df_temp = df_temp.reset_index(drop=True)
dfs[n] = df_temp
else:
pass
Access the dataframes in the dictionary.
print(dfs[0])
user friends
0 1 1
1 1 2
2 2 1
3 2 5
print(dfs[1])
user friends
0 2 4
1 2 6
2 2 7
3 2 20
print(dfs[2])
user friends
0 2 9
1 2 7

How to use pandas df column value in if-else expression to calculate additional columns

I am trying to calculate additional metrics from existing pandas dataframe by using an if/else condition on existing column values.
if(df['Sell_Ind']=='N').any():
df['MarketValue'] = df.apply(lambda row: row.SharesUnits * row.CurrentPrice, axis=1).astype(float).round(2)
elif(df['Sell_Ind']=='Y').any():
df['MarketValue'] = df.apply(lambda row: row.SharesUnits * row.Sold_price, axis=1).astype(float).round(2)
else:
df['MarketValue'] = df.apply(lambda row: 0)
For the if condition the MarketValue is calculated correctly but for the elif condition, its not giving the correct value.
Can anyone point me as what wrong I am doing in this code.
I think you need numpy.select, apply can be removed and multiple columns by mul:
m1 = df['Sell_Ind']=='N'
m2 = df['Sell_Ind']=='Y'
a = df.SharesUnits.mul(df.CurrentPrice).astype(float).round(2)
b = df.SharesUnits.mul(df.Sold_price).astype(float).round(2)
df['MarketValue'] = np.select([m1, m2], [a,b], default=0)
Sample:
df = pd.DataFrame({'Sold_price':[7,8,9,4,2,3],
'SharesUnits':[1,3,5,7,1,0],
'CurrentPrice':[5,3,6,9,2,4],
'Sell_Ind':list('NNYYTT')})
#print (df)
m1 = df['Sell_Ind']=='N'
m2 = df['Sell_Ind']=='Y'
a = df.SharesUnits.mul(df.CurrentPrice).astype(float).round(2)
b = df.SharesUnits.mul(df.Sold_price).astype(float).round(2)
df['MarketValue'] = np.select([m1, m2], [a,b], default=0)
print (df)
CurrentPrice Sell_Ind SharesUnits Sold_price MarketValue
0 5 N 1 7 5.0
1 3 N 3 8 9.0
2 6 Y 5 9 45.0
3 9 Y 7 4 28.0
4 2 T 1 2 0.0
5 4 T 0 3 0.0

Python/Pandas return column and row index of found string

I've searched previous answers relating to this but those answers seem to utilize numpy because the array contains numbers. I am trying to search for a keyword in a sentence in a dataframe ('Timeframe') where the full sentence is 'Timeframe for wave in ____' and would like to return the column and row index. For example:
df.iloc[34,0]
returns the string I am looking for but I am avoiding a hard code for dynamic reasons. Is there a way to return the [34,0] when I search the dataframe for the keyword 'Timeframe'
EDIT:
For check index need contains with boolean indexing, but then there are possible 3 values:
df = pd.DataFrame({'A':['Timeframe for wave in ____', 'a', 'c']})
print (df)
A
0 Timeframe for wave in ____
1 a
2 c
def check(val):
a = df.index[df['A'].str.contains(val)]
if a.empty:
return 'not found'
elif len(a) > 1:
return a.tolist()
else:
#only one value - return scalar
return a.item()
print (check('Timeframe'))
0
print (check('a'))
[0, 1]
print (check('rr'))
not found
Old solution:
It seems you need if need numpy.where for check value Timeframe:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,'Timeframe'],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 Timeframe 0 4 b
a = np.where(df.values == 'Timeframe')
print (a)
(array([5], dtype=int64), array([2], dtype=int64))
b = [x[0] for x in a]
print (b)
[5, 2]
In case you have multiple columns where to look into you can use following code example:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2,3,4],["a","b","Timeframe for wave in____","d"],[5,6,7,8]])
mask = np.column_stack([df[col].str.contains("Timeframe", na=False) for col in df])
find_result = np.where(mask==True)
result = [find_result[0][0], find_result[1][0]]
Then output for df and result would be:
>>> df
0 1 2 3
0 1 2 3 4
1 a b Timeframe for wave in____ d
2 5 6 7 8
>>> result
[1, 2]

Resources