Pandas dataframe, match column against list of sub-strings, continuous rows, keep only sub-string - string

In a Pandas dataframe, I want to match Col1 against a list of keywords as follow:
Keywords need to be different, located in the same column and on 3 continuous rows (keyword1 != keyword2 != keyword3 and they are located for example on rows x, x+1 and x+2)
I only want the keywords to be returned as results (in the example below " def" is being removed)
list_keywords = ['abc', 'ghi', 'jkl mnop','blabla']
Index Col1
1 abc def
2 ghi
3 jkl mnop
4 qrstu
5 vw
>>>
1 abc
2 ghi
3 jkl mnop

You could do something like this with df.iterrows().
for _, row in df.iterrows():
if row['col1'] in list_keywords:
row['col1'] = row['col1']
else:
val = row['col1'].split()
row['col1'] = ' '.join(str(i) for i in val if i in list_keywords)
df
col1
0 abc
1 ghi
2 jkl mnop
3
4

Based on #HTRS 's answer, here is what seems to be a partial answer to my question.
This piece of code filters column Brands against a list of keywords and filter out the parts of strings that differ from keywords.
import pandas as pd
list_filtered = []
list_keywords = ['abc', 'ghi', 'jkl mnop','blabla']
for _, row in df.iterrows():
if row['Brand'] in list_keywords:
row['Brand'] = row['Brand']
list_filtered.append(row['Brand'])
else:
val = row['Brand'].split()
row['Brand'] = ' '.join(str(i) for i in val if i in list_keywords)
list_filtered.append(row['Brand'])
df['Filtered'] = list_filtered
print(df)

Related

Iterate over a column in pandas and return the matching value from the next column [duplicate]

I want to merge several strings in a dataframe based on a groupedby in Pandas.
This is my code so far:
import pandas as pd
from io import StringIO
data = StringIO("""
"name1","hej","2014-11-01"
"name1","du","2014-11-02"
"name1","aj","2014-12-01"
"name1","oj","2014-12-02"
"name2","fin","2014-11-01"
"name2","katt","2014-11-02"
"name2","mycket","2014-12-01"
"name2","lite","2014-12-01"
""")
# load string as stream into dataframe
df = pd.read_csv(data,header=0, names=["name","text","date"],parse_dates=[2])
# add column with month
df["month"] = df["date"].apply(lambda x: x.month)
I want the end result to look like this:
I don't get how I can use groupby and apply some sort of concatenation of the strings in the column "text". Any help appreciated!
You can groupby the 'name' and 'month' columns, then call transform which will return data aligned to the original df and apply a lambda where we join the text entries:
In [119]:
df['text'] = df[['name','text','month']].groupby(['name','month'])['text'].transform(lambda x: ','.join(x))
df[['name','text','month']].drop_duplicates()
Out[119]:
name text month
0 name1 hej,du 11
2 name1 aj,oj 12
4 name2 fin,katt 11
6 name2 mycket,lite 12
I sub the original df by passing a list of the columns of interest df[['name','text','month']] here and then call drop_duplicates
EDIT actually I can just call apply and then reset_index:
In [124]:
df.groupby(['name','month'])['text'].apply(lambda x: ','.join(x)).reset_index()
Out[124]:
name month text
0 name1 11 hej,du
1 name1 12 aj,oj
2 name2 11 fin,katt
3 name2 12 mycket,lite
update
the lambda is unnecessary here:
In[38]:
df.groupby(['name','month'])['text'].apply(','.join).reset_index()
Out[38]:
name month text
0 name1 11 du
1 name1 12 aj,oj
2 name2 11 fin,katt
3 name2 12 mycket,lite
We can groupby the 'name' and 'month' columns, then call agg() functions of Panda’s DataFrame objects.
The aggregation functionality provided by the agg() function allows multiple statistics to be calculated per group in one calculation.
df.groupby(['name', 'month'], as_index = False).agg({'text': ' '.join})
The answer by EdChum provides you with a lot of flexibility but if you just want to concateate strings into a column of list objects you can also:
output_series = df.groupby(['name','month'])['text'].apply(list)
If you want to concatenate your "text" in a list:
df.groupby(['name', 'month'], as_index = False).agg({'text': list})
For me the above solutions were close but added some unwanted /n's and dtype:object, so here's a modified version:
df.groupby(['name', 'month'])['text'].apply(lambda text: ''.join(text.to_string(index=False))).str.replace('(\\n)', '').reset_index()
Please try this line of code : -
df.groupby(['name','month'])['text'].apply(','.join).reset_index()
Although, this is an old question. But just in case. I used the below code and it seems to work like a charm.
text = ''.join(df[df['date'].dt.month==8]['text'])
Thanks to all the other answers, the following is probably the most concise and feels more natural. Using df.groupby("X")["A"].agg() aggregates over one or many selected columns.
df = pandas.DataFrame({'A' : ['a', 'a', 'b', 'c', 'c'],
'B' : ['i', 'j', 'k', 'i', 'j'],
'X' : [1, 2, 2, 1, 3]})
A B X
a i 1
a j 2
b k 2
c i 1
c j 3
df.groupby("X", as_index=False)["A"].agg(' '.join)
X A
1 a c
2 a b
3 c
df.groupby("X", as_index=False)[["A", "B"]].agg(' '.join)
X A B
1 a c i i
2 a b j k
3 c j

pandas help: map and match tab delimted strings in a column and print into new column

I have a dataframe data which have last column containing a bunch of sting and digits and i have one more dataframe info where those sting and digits means, i want to map user input(item) with info and match, print and count how many of them present in the last column in data and prioritize the dataframe data based on numbder of match
import pandas
#data
data = {'id': [123, 456, 789, 1122, 3344],
'Name': ['abc', 'def', 'hij', 'klm', 'nop'],
'MP-ID': ['MP:001|MP:0085|MP:0985', 'MP:005|MP:0258', 'MP:025|MP:5890', 'MP:0589|MP:02546', 'MP:08597|MP:001|MP:005']}
test_data = pd.DataFrame(data)
#info
info = {'MP-ID': ['MP:001', 'MP:002', 'MP:003', 'MP:004', 'MP:005'], 'Item': ['apple', 'orange', 'grapes', 'bannan', 'mango']}
test_info = pd.DataFrame(info)
user input exmaple:
run.py apple mango
desired output:
id Name MP-ID match count
3344 nop MP:08597|MP:001|MP:005 MP:001|MP:005 2
123 abc MP:001|MP:0085|MP:0985 MP:001 1
456 def MP:005|MP:0258 MP:005 1
789 hij MP:025|MP:5890 0
1122 klm MP:0589|MP:02546 0
Thank you for your help in advance
First get all arguments to variable vars, filter MP-ID by Series.isin with DataFrame.loc and extract them by Series.str.findall with Series.str.join, last use Series.str.count with DataFrame.sort_values:
import sys
vals = sys.argv[1:]
#vals = ['apple','mango']
s = test_info.loc[test_info['Item'].isin(vals), 'MP-ID']
test_data['MP-ID match'] = test_data['MP-ID'].str.findall('|'.join(s)).str.join('|')
test_data['count'] = test_data['MP-ID match'].str.count('MP')
test_data = test_data.sort_values('count', ascending=False, ignore_index=True)
print (test_data)
id Name MP-ID MP-ID match count
0 3344 nop MP:08597|MP:001|MP:005 MP:001|MP:005 2
1 123 abc MP:001|MP:0085|MP:0985 MP:001 1
2 456 def MP:005|MP:0258 MP:005 1
3 789 hij MP:025|MP:5890 0
4 1122 klm MP:0589|MP:02546 0

Prevent column name from disappearing after using replace on dataframe

So I have a real dataframe that somewhat follows the next structure:
d = {'col1':['1_ABC','2_DEF','3 GHI']}
df = pd.DataFrame(data=d)
Basically, some entries have the " _ ", others have " ".
My goal is to split that first number into a new column and keep the rest. For this, I thought I'd first replace the '_' by ' ' to normalize everything, and then simply split by ' ' to get the new column.
#Replace the '_' for ' '
new_df['Name'] = df['Name'].str.replace('_',' ')
My problem is that now my new_df now lost its column name:
0 1 ABC
1 2 DEF
Any way to prevent this from happening?
Thanks!
Function str.replace return Series, so there is no column name, only Series name.
s = df['col1'].str.replace('_',' ')
print (s)
0 1 ABC
1 2 DEF
2 3 GHI
Name: col1, dtype: object
print (type(s))
<class 'pandas.core.series.Series'>
print (s.name)
col1
If need new column assign to same DataFrame - df['Name']:
df['Name'] = df['col1'].str.replace('_',' ')
print (df)
col1 Name
0 1_ABC 1 ABC
1 2_DEF 2 DEF
2 3 GHI 3 GHI
Or overwrite values of original column:
df['col1'] = df['col1'].str.replace('_',' ')
print (df)
col1
0 1 ABC
1 2 DEF
2 3 GHI
If need new one column DataFrame use Series.to_frame for convert Series to df:
df2 = df['col1'].str.replace('_',' ').to_frame()
print (df2)
col1
0 1 ABC
1 2 DEF
2 3 GHI
Also is possible define new column name:
df1 = df['col1'].str.replace('_',' ').to_frame('New')
print (df1)
New
0 1 ABC
1 2 DEF
2 3 GHI
Like #anky_91 commented, if need new 2 columns add str.split:
df1 = df['col1'].str.replace('_',' ').str.split(expand=True)
df1.columns = ['A','B']
print (df1)
A B
0 1 ABC
1 2 DEF
2 3 GHI
If need add columns to existing DataFrame:
df[['A','B']] = df['col1'].str.replace('_',' ').str.split(expand=True)
print (df)
col1 A B
0 1_ABC 1 ABC
1 2_DEF 2 DEF
2 3 GHI 3 GHI

Renaming Duplicates in CSV column in sequence by python

I have CSV data with a particular column having duplicate entries say
like a,b,c,a,b,c,v,f,c... I want to replace the values to
a,b,c,a_1,b_1,c_1,v,f,c_2...
I have wrote the below code to find duplicates:-
import csv
from collections import Counter
import pandas as pd
duplicate_names=[]
file='2018_Akola_August.csv'
with open(file, 'r', newline='') as csv_file:
occurrences = Counter()
for line in csv.reader(csv_file):
email = line[3]
if email in occurrences:
print(email)
duplicate_names.append(email)
occurrences[email] += 1
else:
occurrences[email] = 1
Also to replace a string in CSV column I wrote code as below but is
not working as desired for two duplicate values.
df = pd.read_csv(file, index_col=False, header=0)
#Finds 'a' and replaces it with 'a_1'
df.loc[df['Circle'] == 'a' , 'Circle']= 'a_1'
print(df)
df.to_csv(file)
What effect does this statement will have is not clear?
df.loc[df['Circle'] == 'a' , 'Circle'][]= 'a_1'
How to go about renaming such duplicates in sequence?
here is a way in 2 steps:
>>> df
Circle
0 a
1 b
2 c
3 a
4 b
5 c
6 v
7 f
8 c
dups = (df.loc[df['Circle'].duplicated(),'Circle'] + '_' +
df.groupby('Circle').cumcount().astype(str))
df.loc[dups.notnull(),'Circle'] = dups
>>> df
Circle
0 a
1 b
2 c
3 a_1
4 b_1
5 c_1
6 v
7 f
8 c_2
In answer to your second question, the line:
df.loc[df['Circle'] == 'a' , 'Circle']= 'a_1'
Will take all values of Circle where it is equal to a and change it to a_1

How to compute the difference between 2 string columns in a pandas dataframe

I have 2 columns in the same dataframe as below:
A B
abcdef(as3456) as3456
pqrst(dh6546) dh6546
I need to create another column C such as below:
C
abcdef
pqrst
I have been able to create column B from A however, my purpose is not fully served yet. Can someone please help me getting Column C as I mentioned from Column A and B. I tried doing != opertor and "~loc". HOwever, that does not seem to work in this case.(may be because its a string)
For difference between columns per rows use replace with strip:
df = pd.DataFrame({'A': ['abcdef(as3456)', 'pqrst(dh6546)', 'abcdef(dh6546)'],
'B': ['as3456', 'dh6546', 'as3456']})
print (df)
A B
0 abcdef(as3456) as3456
1 pqrst(dh6546) dh6546
2 abcdef(dh6546) as3456#df.B not matched per rows, but matched abcdef by first value of df.B
#replace values from df.B per rows
df['C'] = [i.replace(j, '').strip('()') for i, j in zip(df.A, df.B)]
#replace all values from df.bvalues
pat = '|'.join([r'\({}\)'.format(i) for i in df.B])
df['D'] = df.A.str.replace(pat, '')
print (df)
A B C D
0 abcdef(as3456) as3456 abcdef abcdef
1 pqrst(dh6546) dh6546 pqrst pqrst
2 abcdef(dh6546) as3456 abcdef(dh6546 abcdef
df['C']=df.A.replace(regex=r'\(.*$', value='')
df
A B C
0 abcdef(as3456) as3456 abcdef
1 pqrst(dh6546) dh6546 pqrst
or you can do:
df['C']=df.A.replace(regex=r'\('+ df.B +r'\)',value="")
A B C
0 abcdef(as3456) as3456 abcdef
1 pqrst(dh6546) dh6546 pqrst

Resources