including word boundary in string modification to be more specific - python-3.x

Background
The following is a minor change from modification of skipping empty list and continuing with function
import pandas as pd
Names = [list(['ann']),
list([]),
list(['elisabeth', 'lis']),
list(['his','he']),
list([])]
df = pd.DataFrame({'Text' : ['ann had an anniversery today',
'nothing here',
'I like elisabeth and lis 5 lists ',
'one day he and his cheated',
'same here'
],
'P_ID': [1,2,3, 4,5],
'P_Name' : Names
})
#rearrange columns
df = df[['Text', 'P_ID', 'P_Name']]
df
Text P_ID P_Name
0 ann had an anniversery today 1 [ann]
1 nothing here 2 []
2 I like elisabeth and lis 5 lists 3 [elisabeth, lis]
3 one day he and his cheated 4 [his, he]
4 same here 5 []
The code below works
m = df['P_Name'].str.len().ne(0)
df.loc[m, 'New'] = df.loc[m, 'Text'].replace(df.loc[m].P_Name,'**BLOCK**',regex=True)
And does the following
1) uses the name in P_Name to block the corresponding text in the Text column by placing **BLOCK**
2) produces a new column New
This is shown below
Text P_ID P_Name New
0 **BLOCK** had an **BLOCK**iversery today
1 NaN
2 I like **BLOCK** and **BLOCK** 5 **BLOCK**ts
3 one day **BLOCK** and **BLOCK** c**BLOCK**ated
4 NaN
Problem
However, this code works a little "too well."
Using ['his','he'] from P_Name to block Text:
Example: one day he and his cheated becomes one day **BLOCK** and **BLOCK** c**BLOCK**ated
Desired: one day he and his cheated becomes one day **BLOCK** and **BLOCK** cheated
In this example, I would like cheated to stay as cheated and not become c**BLOCK**ated
Desired Output
Text P_ID P_Name New
0 **BLOCK** had an anniversery today
1 NaN
2 I like **BLOCK** and **BLOCK**5 lists
3 one day **BLOCK** and **BLOCK** cheated
4 NaN
Question
How do I achieve my desired output?

You need to add word boundary to each string in lists of df.loc[m].P_Name as follows:
s = df.loc[m].P_Name.map(lambda x: [r'\b'+item+r'\b' for item in x])
Out[71]:
0 [\bann\b]
2 [\belisabeth\b, \blis\b]
3 [\bhis\b, \bhe\b]
Name: P_Name, dtype: object
df.loc[m, 'Text'].replace(s, '**BLOCK**',regex=True)
Out[72]:
0 **BLOCK** had an anniversery today
2 I like **BLOCK** and **BLOCK** 5 lists
3 one day **BLOCK** and **BLOCK** cheated
Name: Text, dtype: object

Sometime for loop is good practice
df['New']=[pd.Series(x).replace(dict.fromkeys(y,'**BLOCK**') ).str.cat(sep=' ')for x , y in zip(df.Text.str.split(),df.P_Name)]
df.New.where(df.P_Name.astype(bool),inplace=True)
df
Text ... New
0 ann had an anniversery today ... **BLOCK** had an anniversery today
1 nothing here ... NaN
2 I like elisabeth and lis 5 lists ... I like **BLOCK** and **BLOCK** 5 lists
3 one day he and his cheated ... one day **BLOCK** and **BLOCK** cheated
4 same here ... NaN
[5 rows x 4 columns]

Related

Optimize groupby->pd.DataFrame->.reset_index->.rename(columns)

I am very new at this, so bear with me please.
I do this:
example=
index Date Column_1 Column_2
1 2019-06-17 Car Red
2 2019-08-10 Car Yellow
3 2019-08-15 Truck Yellow
4 2020-08-12 Truck Yellow
data = example.groupby([pd.Grouper(freq='Y', key='Date'),'Column_1']).nunique()
df1=pd.DataFrame(data)
df2 = df1.reset_index(level=['Column_1','Date'])
df2 = df2.rename(columns={'Date':'interval_year','Column_2':'Sum'})
In order to get this:
df2=
index interval_year Column_1 Sum
1 2019-12-31 Car 2
2 2019-12-31 Truck 1
3 2020-12-31 Car 1
I get the expected result but my code gives me a lot of headache. I create 2 additional DataFrames and sometimes, when I get 2 columns with same name (one as index), the code becomes even more complicated.
Any solution how to make this more efficient?
Thank you
You can use pd.NamedAgg to do some renaming for you in the groupby like this:
example.groupby([pd.Grouper(key='Date', freq='Y'),'Column_1']).agg(sum=('Date','nunique')).reset_index()
Output:
Date Column_1 sum
0 2019-12-31 Car 2
1 2019-12-31 Truck 1
2 2020-12-31 Truck 1
To reduce visible noise and to make your code more performant, I suggest you to do method chaining.
Try this :
df2 = (
example
.assign(Date= pd.to_datetime(df["Date"]))
.groupby([pd.Grouper(freq='Y', key='Date'),'Column_1']).nunique()
.reset_index()
.rename(columns={'Date':'interval_year','Column_2':'Sum'})
)
# Output :
print(df2)
interval_year Column_1 Sum
0 2019-12-31 Car 2
1 2019-12-31 Truck 1
2 2020-12-31 Truck 1

Pandas data frame concat return same data of first dataframe

I have this datafram
PNN_sh NN_shap PNN_corr NN_corr
1 25005 1 25005
2 25012 2 25001
3 25011 3 25009
4 25397 4 25445
5 25006 5 25205
Then I made 2 dataframs from this one.
NN_sh = data[['PNN_sh', 'NN_shap']]
NN_corr = data[['PNN_corr', 'NN_corr']]
Thereafter, I sorted them and saved in new dataframes.
NN_sh_sort = NN_sh.sort_values(by=['NN_shap'])
NN_corr_sort = NN_corr.sort_values(by=['NN_corr'])
Now I want to combine 2 columns from the 2 dataframs above.
all_pd = pd.concat([NN_sh_sort['PNN_sh'], NN_corr_sort['PNN_corr']], axis=1, join='inner')
But what I got is only the first column copied into second one also.
PNN_sh PNN_corr
1 1
5 5
3 3
2 2
4 4
The second column should be
PNN_corr
2
1
3
5
4
Any idea how to fix it? Thanks in advance
Put ignore_index=True to sort_values():
NN_sh_sort = NN_sh.sort_values(by=['NN_shap'], ignore_index=True)
NN_corr_sort = NN_corr.sort_values(by=['NN_corr'], ignore_index=True)
Then the result after concat will be:
PNN_sh PNN_corr
0 1 2
1 5 1
2 3 3
3 2 5
4 4 4
I think when you sort you are preserving the original indices of the example DataFrames. Therefore, it is joining the PNN_corr value that was originally in the same row (at same index). Try resetting the index of each DataFrame after sorting, then join/concat.
NN_sh_sort = NN_sh.sort_values(by=['NN_shap']).reset_index()
NN_corr_sort = NN_corr.sort_values(by=['NN_corr']).reset_index()
all_pd = pd.concat([NN_sh_sort['PNN_sh'], NN_corr_sort['PNN_corr']], axis=1, join='inner')

pandas remove records conditionally based on records count of groups

I have a dataframe like this
import pandas as pd
import numpy as np
raw_data = {'Country':['UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK'],
'Product':['A','A','A','A','B','B','B','B','B','B','B','B','C','C','C','D','D','D','D','D','D'],
'Week': [1,2,3,4,1,2,3,4,5,6,7,8,1,2,3,1,2,3,4,5,6],
'val': [5,4,3,1,5,6,7,8,9,10,11,12,5,5,5,5,6,7,8,9,10]
}
df2 = pd.DataFrame(raw_data, columns = ['Country','Product','Week', 'val'])
print(df2)
and mapping dataframe
mapping = pd.DataFrame({'Product':['A','C'],'Product1':['B','D']}, columns = ['Product','Product1'])
and i wanted to compare products as per mapping. product A data should match with product B data.. the logic is product A number of records is 4 so product B records also should be 4 and those 4 records should be from the week number before and after form last week number of product A and including the last week number. so before 1 week of week number 4 i.e. 3rd week and after 2 weeks of week number 4 i.e 5,6 and week 4 data.
similarly product C number of records is 3 so product D records also should be 3 and those records before and after last week number of product C. so product c last week number 3 so product D records will be week number 2,3,4.
wanted data frame will be like below i wanted to remove those yellow records
Define the following function selecting rows from df, for products from
the current row in mapping:
def selRows(row, df):
rows_1 = df[df.Product == row.Product]
nr_1 = rows_1.index.size
lastWk_1 = rows_1.Week.iat[-1]
rows_2 = df[df.Product.eq(row.Product1) & df.Week.ge(lastWk_1 - 1)].iloc[:nr_1]
return pd.concat([rows_1, rows_2])
Then call it the following way:
result = pd.concat([ selRows(row, grp)
for _, grp in df2.groupby(['Country'])
for _, row in mapping.iterrows() ])
The list comprehension above creates a list on DataFrames - results of
calls of selRows on:
each group of rows from df2, for consecutive countries (the outer loop),
each row from mapping (the inner loop).
Then concat concatenates all of them into a single DataFrame.
Solution first create mapped column by mapping DataFrame and create dictionaries for mapping for length and last (maximal) value by groups by Country and Product:
df2['mapp'] = df2['Product'].map(mapping.set_index('Product1')['Product'])
df1 = df2.groupby(['Country','Product'])['Week'].agg(['max','size'])
#subtracted 1 for last previous value
dprev = df1['max'].sub(1).to_dict()
dlen = df1['size'].to_dict()
print(dlen)
{('UK', 'A'): 4, ('UK', 'B'): 8, ('UK', 'C'): 3, ('UK', 'D'): 6}
Then Series.map values of dict and filter out less values, then filter by second dictionary by lengths with DataFrame.head:
df3 = (df2[df2[['Country','mapp']].apply(tuple, 1).map(dprev) <= df2['Week']]
.groupby(['Country','mapp'])
.apply(lambda x: x.head(dlen.get(x.name))))
print(df3)
Country Product Week val mapp
Country mapp
UK A 6 UK B 3 7 A
7 UK B 4 8 A
8 UK B 5 9 A
9 UK B 6 10 A
C 16 UK D 2 6 C
17 UK D 3 7 C
18 UK D 4 8 C
Then filter original rows unmatched mapping['Product1'], add new df3 and sorting:
df = (df2[~df2['Product'].isin(mapping['Product1'])]
.append(df3, ignore_index=True)
.sort_values(['Country','Product'])
.drop('mapp', axis=1))
print(df)
Country Product Week val
0 UK A 1 5
1 UK A 2 4
2 UK A 3 3
3 UK A 4 1
7 UK B 3 7
8 UK B 4 8
9 UK B 5 9
10 UK B 6 10
4 UK C 1 5
5 UK C 2 5
6 UK C 3 5
11 UK D 2 6
12 UK D 3 7
13 UK D 4 8

skipping empty list and continuing with function

Background
import pandas as pd
Names = [list(['Jon', 'Smith', 'jon', 'John']),
list([]),
list(['Bob', 'bobby', 'Bobs'])]
df = pd.DataFrame({'Text' : ['Jon J Smith is Here and jon John from ',
'',
'I like Bob and bobby and also Bobs diner '],
'P_ID': [1,2,3],
'P_Name' : Names
})
#rearrange columns
df = df[['Text', 'P_ID', 'P_Name']]
df
Text P_ID P_Name
0 Jon J Smith is Here and jon John from 1 [Jon, Smith, jon, John]
1 2 []
2 I like Bob and bobby and also Bobs diner 3 [Bob, bobby, Bobs]
Goal
I would like to use the following function
df['new']=df.Text.replace(df.P_Name,'**BLOCK**',regex=True)
but skip row 2, since it has an empty list []
Tried
I have tried the following
try:
df['new']=df.Text.replace(df.P_Name,'**BLOCK**',regex=True)
except ValueError:
pass
But I get the following output
Text P_ID P_Name
0 Jon J Smith is Here and jon John from 1 [Jon, Smith, jon, John]
1 2 []
2 I like Bob and bobby and also Bobs diner 3 [Bob, bobby, Bobs]
Desired Output
Text P_ID P_Name new
0 `**BLOCK**` J `**BLOCK**` is Here and `**BLOCK**` `**BLOCK**` from
1 []
2 I like `**BLOCK**` and `**BLOCK**` and also `**BLOCK**` diner
Question
How do I get my desired output by skipping row 2 and continuing with my function?
Locate the rows which do not have an empty list and use your replace method only on those rows:
# Boolean indexing the rows which do not have an empty list
m = df['P_Name'].str.len().ne(0)
df.loc[m, 'New'] = df.loc[m, 'Text'].replace(df.loc[m].P_Name,'**BLOCK**',regex=True)
Output
Text P_ID P_Name New
0 Jon J Smith is Here and jon John from 1 [Jon, Smith, jon, John] **BLOCK** J **BLOCK** is Here and **BLOCK** **BLOCK** from
1 Test 2 [] NaN
2 I like Bob and bobby and also Bobs diner 3 [Bob, bobby, Bobs] I like **BLOCK** and **BLOCK** and also **BLOCK**s diner

How to select and subset rows based on sting in pandas dataframe?

My dataset looks like following. I am trying to subset my pandas dataframe such that only the responses by all 3 people will get selected. For example, in below data frame the responses that were answered by all 3 people were "I like to eat" and "You have nice day" . Thus only those should be subsetted. I am not sure how to achieve this in Pandas dataframe.
Note: I am new to Python ,please provide explanation with your code.
DataFrame example
import pandas as pd
data = {'Person':['1', '1','1','2','2','2','2','3','3'],'Response':['I like to eat','You have nice day','My name is ','I like to eat','You have nice day','My name is','This is it','I like to eat','You have nice day'],
}
df = pd.DataFrame(data)
print (df)
Output:
Person Response
0 1 I like to eat
1 1 You have nice day
2 1 My name is
3 2 I like to eat
4 2 You have nice day
5 2 My name is
6 2 This is it
7 3 I like to eat
8 3 You have nice day
IIUC I am using transform with nunique
yourdf=df[df.groupby('Response').Person.transform('nunique')==df.Person.nunique()]
yourdf
Out[463]:
Person Response
0 1 I like to eat
1 1 You have nice day
3 2 I like to eat
4 2 You have nice day
7 3 I like to eat
8 3 You have nice day
Method 2
df.groupby('Response').filter(lambda x : pd.Series(df['Person'].unique()).isin(x['Person']).all())
Out[467]:
Person Response
0 1 I like to eat
1 1 You have nice day
3 2 I like to eat
4 2 You have nice day
7 3 I like to eat
8 3 You have nice day

Resources