Pandas Split on '. ' - python-3.x

Given the following data frame:
import pandas as pd
df=pd.DataFrame({'foo':['abc','2. abc','3. abc']})
df
foo
abc
2. abc
3. abc
I'd like to split on '. ' to produce this:
foo bar
abc
1 abc
2 abc
Thanks in advance!

you can do it using .str.extract() function:
In [163]: df.foo.str.extract(r'(?P<foo>\d*)[\.\s]*(?P<bar>.*)', expand=True)
Out[163]:
foo bar
0 abc
1 2 abc
2 3 abc

You can use str.split, but then you need swap values if mask is True by numpy.where. Last fillna by '' column foo:
df1 = (df.foo.str.split('. ', expand=True))
df1.columns = ['foo','bar']
print (df1)
foo bar
0 abc None
1 2 abc
2 3 abc
mask = df1.bar.isnull()
print (mask)
0 True
1 False
2 False
Name: bar, dtype: bool
df1['foo'], df1['bar'] = np.where(mask, df1['bar'], df1['foo']),
np.where(mask, df1['foo'], df1['bar'] )
df1.foo.fillna('', inplace=True)
print (df1)
foo bar
0 abc
1 2 abc
2 3 abc

If you have a folder you can put a temporary file into, you can create a csv file and reread it with your new separator:
df.to_csv('yourfolder/yourfile.csv',index = False)
df = pd.read_csv('yourfolder/yourfile.csv',sep = '. ')

Related

How to count unique values in one colulmn based on value in another column by group in Pandas

I'm trying to count unique values in one column only when the value meets a certain condition based on another column. For example, the data looks like this:
GroupID ID Value
ABC TX123 0
ABC TX678 1
ABC TX678 2
DEF AG123 1
DEF AG123 1
DEF AG123 1
GHI TE203 0
GHI TE203 0
I want to count the number of unique ID by Group ID but only when the value column >0. When all values for a group ID =0, it will simply have 0. For example, the result dataset would look like this:
GroupID UniqueNum
ABC 1
DEF 1
GHI 0
I've tried the following but it simply returns the uique number of IDs regardless of its value. How do I add the condition of when value >0?
count_df = df.groupby(['GroupID'])['ID'].nunique()
positive counts only
You can use pre-filtering with loc and named aggregation with groupby.agg('nunique'):
(df.loc[df['Value'].gt(0), 'ID']
.groupby(df['GroupID'])
.agg(UniqueNum='nunique')
.reset_index()
)
Output:
GroupID UniqueNum
0 ABC 1
1 DEF 1
all counts (including zero)
If you want to count as zero, the groups with no match, you can reindex:
(df.loc[df['Value'].gt(0), 'ID']
.groupby(df['GroupID'])
.agg(UniqueNum='nunique')
.reindex(df['GroupID'].unique(), fill_value=0)
.reset_index()
)
Or mask:
(df['ID'].where(df['Value'].gt(0))
.groupby(df['GroupID'])
.agg(UniqueNum='nunique')
.reset_index()
)
Output:
GroupID UniqueNum
0 ABC 1
1 DEF 1
2 GHI 0
Used input:
GroupID ID Value
ABC TX123 0
ABC TX678 1
ABC TX678 2
DEF AG123 1
DEF AG123 1
DEF AG123 1
GHI AB123 0
If need 0 for non matched values use Series.where for NaNs for non matched condition, then aggregate by DataFrameGroupBy.nunique:
df = pd.DataFrame({ 'GroupID': ['ABC', 'ABC', 'ABC', 'DEF', 'DEF', 'NEW'],
'ID': ['TX123', 'TX678', 'TX678', 'AG123', 'AG123', 'AG123'],
'Value': [0, 1, 2, 1, 1, 0]})
df = (df['ID'].where(df["Value"].gt(0)).groupby(df['GroupID'])
.nunique()
.reset_index(name='nunique'))
print (df)
GroupID nunique
0 ABC 1
1 DEF 1
2 NEW 0
How it working:
print (df.assign(new=df['ID'].where(df["Value"].gt(0))))
GroupID ID Value new
0 ABC TX123 0 NaN
1 ABC TX678 1 TX678
2 ABC TX678 2 TX678
3 DEF AG123 1 AG123
4 DEF AG123 1 AG123
5 NEW AG123 0 NaN <- set NaN for non matched condition

pandas help: map and match tab delimted strings in a column and print into new column

I have a dataframe data which have last column containing a bunch of sting and digits and i have one more dataframe info where those sting and digits means, i want to map user input(item) with info and match, print and count how many of them present in the last column in data and prioritize the dataframe data based on numbder of match
import pandas
#data
data = {'id': [123, 456, 789, 1122, 3344],
'Name': ['abc', 'def', 'hij', 'klm', 'nop'],
'MP-ID': ['MP:001|MP:0085|MP:0985', 'MP:005|MP:0258', 'MP:025|MP:5890', 'MP:0589|MP:02546', 'MP:08597|MP:001|MP:005']}
test_data = pd.DataFrame(data)
#info
info = {'MP-ID': ['MP:001', 'MP:002', 'MP:003', 'MP:004', 'MP:005'], 'Item': ['apple', 'orange', 'grapes', 'bannan', 'mango']}
test_info = pd.DataFrame(info)
user input exmaple:
run.py apple mango
desired output:
id Name MP-ID match count
3344 nop MP:08597|MP:001|MP:005 MP:001|MP:005 2
123 abc MP:001|MP:0085|MP:0985 MP:001 1
456 def MP:005|MP:0258 MP:005 1
789 hij MP:025|MP:5890 0
1122 klm MP:0589|MP:02546 0
Thank you for your help in advance
First get all arguments to variable vars, filter MP-ID by Series.isin with DataFrame.loc and extract them by Series.str.findall with Series.str.join, last use Series.str.count with DataFrame.sort_values:
import sys
vals = sys.argv[1:]
#vals = ['apple','mango']
s = test_info.loc[test_info['Item'].isin(vals), 'MP-ID']
test_data['MP-ID match'] = test_data['MP-ID'].str.findall('|'.join(s)).str.join('|')
test_data['count'] = test_data['MP-ID match'].str.count('MP')
test_data = test_data.sort_values('count', ascending=False, ignore_index=True)
print (test_data)
id Name MP-ID MP-ID match count
0 3344 nop MP:08597|MP:001|MP:005 MP:001|MP:005 2
1 123 abc MP:001|MP:0085|MP:0985 MP:001 1
2 456 def MP:005|MP:0258 MP:005 1
3 789 hij MP:025|MP:5890 0
4 1122 klm MP:0589|MP:02546 0

Pandas dataframe, match column against list of sub-strings, continuous rows, keep only sub-string

In a Pandas dataframe, I want to match Col1 against a list of keywords as follow:
Keywords need to be different, located in the same column and on 3 continuous rows (keyword1 != keyword2 != keyword3 and they are located for example on rows x, x+1 and x+2)
I only want the keywords to be returned as results (in the example below " def" is being removed)
list_keywords = ['abc', 'ghi', 'jkl mnop','blabla']
Index Col1
1 abc def
2 ghi
3 jkl mnop
4 qrstu
5 vw
>>>
1 abc
2 ghi
3 jkl mnop
You could do something like this with df.iterrows().
for _, row in df.iterrows():
if row['col1'] in list_keywords:
row['col1'] = row['col1']
else:
val = row['col1'].split()
row['col1'] = ' '.join(str(i) for i in val if i in list_keywords)
df
col1
0 abc
1 ghi
2 jkl mnop
3
4
Based on #HTRS 's answer, here is what seems to be a partial answer to my question.
This piece of code filters column Brands against a list of keywords and filter out the parts of strings that differ from keywords.
import pandas as pd
list_filtered = []
list_keywords = ['abc', 'ghi', 'jkl mnop','blabla']
for _, row in df.iterrows():
if row['Brand'] in list_keywords:
row['Brand'] = row['Brand']
list_filtered.append(row['Brand'])
else:
val = row['Brand'].split()
row['Brand'] = ' '.join(str(i) for i in val if i in list_keywords)
list_filtered.append(row['Brand'])
df['Filtered'] = list_filtered
print(df)

How to find if a list is sub string of a string in a Data Frame column?

I am facing a challenge to find a substring from a list inside a DataFrame column
list =['ab', 'bc', 'ca']
DF1
Index|A
0 |ajbijio_ab_jadds
1 |bhjbj_ab_jiui
Expected OUTPUT:
DF
ab
ab
I have written something but it is giving error
unhashable type: 'list'
DF1['A'].str.lower().str.contains(list)
Using str.extract
Ex:
import pandas as pd
lst =['ab','bc','ca']
df = pd.DataFrame({"A": ["ajbijio_ab_jadds", "bhjbj_ab_jiui", "Hello World"]})
df["Found"] = df["A"].str.extract("(" + "|".join(lst) + ")")
print(df)
Output:
A Found
0 ajbijio_ab_jadds ab
1 bhjbj_ab_jiui ab
2 Hello World NaN
Use Series.str.extract if need first match only with join list by | for regex OR:
L =['ab','bc','ca']
df['new'] = df['A'].str.extract('('+ '|'.join(L) + ')')
print (df)
A new
0 ajbijio_ab_jadds ab
1 bhjbj_ab_jiui ab
If need all matches use Series.str.findall with Series.str.join:
df['new'] = df['A'].str.findall('|'.join(L)).str.join(',')
I am using findall
df["Found"] = df["A"].str.findall("|".join(lst)).str[0]
df
Out[82]:
A Found
0 ajbijio_ab_jadds ab
1 bhjbj_ab_jiui ab
2 Hello World NaN

Prevent column name from disappearing after using replace on dataframe

So I have a real dataframe that somewhat follows the next structure:
d = {'col1':['1_ABC','2_DEF','3 GHI']}
df = pd.DataFrame(data=d)
Basically, some entries have the " _ ", others have " ".
My goal is to split that first number into a new column and keep the rest. For this, I thought I'd first replace the '_' by ' ' to normalize everything, and then simply split by ' ' to get the new column.
#Replace the '_' for ' '
new_df['Name'] = df['Name'].str.replace('_',' ')
My problem is that now my new_df now lost its column name:
0 1 ABC
1 2 DEF
Any way to prevent this from happening?
Thanks!
Function str.replace return Series, so there is no column name, only Series name.
s = df['col1'].str.replace('_',' ')
print (s)
0 1 ABC
1 2 DEF
2 3 GHI
Name: col1, dtype: object
print (type(s))
<class 'pandas.core.series.Series'>
print (s.name)
col1
If need new column assign to same DataFrame - df['Name']:
df['Name'] = df['col1'].str.replace('_',' ')
print (df)
col1 Name
0 1_ABC 1 ABC
1 2_DEF 2 DEF
2 3 GHI 3 GHI
Or overwrite values of original column:
df['col1'] = df['col1'].str.replace('_',' ')
print (df)
col1
0 1 ABC
1 2 DEF
2 3 GHI
If need new one column DataFrame use Series.to_frame for convert Series to df:
df2 = df['col1'].str.replace('_',' ').to_frame()
print (df2)
col1
0 1 ABC
1 2 DEF
2 3 GHI
Also is possible define new column name:
df1 = df['col1'].str.replace('_',' ').to_frame('New')
print (df1)
New
0 1 ABC
1 2 DEF
2 3 GHI
Like #anky_91 commented, if need new 2 columns add str.split:
df1 = df['col1'].str.replace('_',' ').str.split(expand=True)
df1.columns = ['A','B']
print (df1)
A B
0 1 ABC
1 2 DEF
2 3 GHI
If need add columns to existing DataFrame:
df[['A','B']] = df['col1'].str.replace('_',' ').str.split(expand=True)
print (df)
col1 A B
0 1_ABC 1 ABC
1 2_DEF 2 DEF
2 3 GHI 3 GHI

Resources