New column with in a Pandas Dataframe with respect to duplicates in given column - python-3.x

Hi i have a dataframe with a column "id" like below
id
abc
def
ghi
abc
abc
xyz
def
I need a new column "id1" with a number 1 appended to it and number should be incremented for every duplicate. output should be like below.
id id1
abc abc1
def def1
ghi ghi1
abc abc2
abc abc3
xyz xyz1
def def2
Can anyone suggest me a solution for this?

Use groupby.cumcount for count ids, add 1 and convert to strings:
df['id1'] = df['id'] + df.groupby('id').cumcount().add(1).astype(str)
print (df)
id id1
0 abc abc1
1 def def1
2 ghi ghi1
3 abc abc2
4 abc abc3
5 xyz xyz1
6 def def2
Detail:
print (df.groupby('id').cumcount())
0 0
1 0
2 0
3 1
4 2
5 0
6 1
dtype: int64

Related

How to count unique values in one colulmn based on value in another column by group in Pandas

I'm trying to count unique values in one column only when the value meets a certain condition based on another column. For example, the data looks like this:
GroupID ID Value
ABC TX123 0
ABC TX678 1
ABC TX678 2
DEF AG123 1
DEF AG123 1
DEF AG123 1
GHI TE203 0
GHI TE203 0
I want to count the number of unique ID by Group ID but only when the value column >0. When all values for a group ID =0, it will simply have 0. For example, the result dataset would look like this:
GroupID UniqueNum
ABC 1
DEF 1
GHI 0
I've tried the following but it simply returns the uique number of IDs regardless of its value. How do I add the condition of when value >0?
count_df = df.groupby(['GroupID'])['ID'].nunique()
positive counts only
You can use pre-filtering with loc and named aggregation with groupby.agg('nunique'):
(df.loc[df['Value'].gt(0), 'ID']
.groupby(df['GroupID'])
.agg(UniqueNum='nunique')
.reset_index()
)
Output:
GroupID UniqueNum
0 ABC 1
1 DEF 1
all counts (including zero)
If you want to count as zero, the groups with no match, you can reindex:
(df.loc[df['Value'].gt(0), 'ID']
.groupby(df['GroupID'])
.agg(UniqueNum='nunique')
.reindex(df['GroupID'].unique(), fill_value=0)
.reset_index()
)
Or mask:
(df['ID'].where(df['Value'].gt(0))
.groupby(df['GroupID'])
.agg(UniqueNum='nunique')
.reset_index()
)
Output:
GroupID UniqueNum
0 ABC 1
1 DEF 1
2 GHI 0
Used input:
GroupID ID Value
ABC TX123 0
ABC TX678 1
ABC TX678 2
DEF AG123 1
DEF AG123 1
DEF AG123 1
GHI AB123 0
If need 0 for non matched values use Series.where for NaNs for non matched condition, then aggregate by DataFrameGroupBy.nunique:
df = pd.DataFrame({ 'GroupID': ['ABC', 'ABC', 'ABC', 'DEF', 'DEF', 'NEW'],
'ID': ['TX123', 'TX678', 'TX678', 'AG123', 'AG123', 'AG123'],
'Value': [0, 1, 2, 1, 1, 0]})
df = (df['ID'].where(df["Value"].gt(0)).groupby(df['GroupID'])
.nunique()
.reset_index(name='nunique'))
print (df)
GroupID nunique
0 ABC 1
1 DEF 1
2 NEW 0
How it working:
print (df.assign(new=df['ID'].where(df["Value"].gt(0))))
GroupID ID Value new
0 ABC TX123 0 NaN
1 ABC TX678 1 TX678
2 ABC TX678 2 TX678
3 DEF AG123 1 AG123
4 DEF AG123 1 AG123
5 NEW AG123 0 NaN <- set NaN for non matched condition

How can I replace the column name in a panda dataframe

I have a excel file as
Old_name new_name
xyz abc
opq klm
And I have my dataframe as like this
Id timestamp xyz opq
1 04-10-2021 3 4
2 05-10-2021 4 9
As you see I have my old names as column name and I would like to map and replace them with new name as in my csv file. How can I do that?
Try with rename:
df.rename(columns=col_names.set_index('Old_name')['new_name'], inplace=True)
# verify
print(df)
Output:
Id timestamp abc klm
0 1 04-10-2021 3 4
1 2 05-10-2021 4 9

How to create a Pandas Dataframe from the comma separated values in txt file

I have a txt file which contains data as in below format
"column1,column2,column3,column4,column5,column6,column7,column8"
"abc,abc,abc,10,datetime,abc,abc,abc"
"xyz,xyz,""xyz1,xyz2"",2,datetime2,xyz,xyz,xyz"
"xyz,xyz,""xyz1 , xyz2"",2,datetime2,xyz,xyz,xyz"
I want to convert it into a Pandas DataFrame which will be having 8 columns of header same as row 1
it is different from normal/regular Dataframe question.
I tried with following code,
df = pd.read_csv('tst.txt')
But output was
column1,column2,column3,column4,column5,column6,column7,column8
0 abc,abc,abc,10,datetime,abc,abc,abc
1 xyz,xyz,"xyz1,xyz2",2,datetime2,xyz,xyz,xyz
2 xyz,xyz,"xyz1 , xyz2",2,datetime2,xyz,xyz,xyz
I tried with other things as well like
df1 = pd.DataFrame([line.replace(' , ','$$$').replace('"','').replace('\n','').split(',') for line in open('tst.txt')])
but output was different and not expected
0 1 2 3 4 5 6 7 8
0 column1 column2 column3 column4 column5 column6 column7 column8 None
1 abc abc abc 10 datetime abc abc abc None
2 xyz xyz xyz1 xyz2 2 datetime2 xyz xyz xyz
3 xyz xyz xyz1$$$xyz2 2 datetime2 xyz xyz xyz None
So you can see here that only 8 columns should be there not 9.
datetime should be in 5th column.
Actual output should be like,
column1 column2 column3 column4 column5 column6 column7 column8
0 abc abc abc 10 datetime abc abc abc
1 xyz xyz xyz1,xyz2 2 datetime2 xyz xyz xyz
2 xyz xyz xyz1 , xyz2 2 datetime2 xyz xyz xyz
Try pass the quotechar with "
df=pd.read_csv('tst.txt', quotechar='"', sep=',')
column1 column2 column3 column4 column5 column6 column7 column8
0 abc abc abc 10 datetime abc abc abc
1 xyz xyz xyz1,xyz2 2 datetime2 xyz xyz xyz
2 xyz xyz xyz1 , xyz2 2 datetime2 xyz xyz xyz

How to get the top values within each group?

I am new to Pandas and I have a dataset that looks something like this.
s_name Time p_name qty
A 12/01/2019 ABC 1
A 12/01/2019 ABC 1
A 12/01/2019 DEF 2
A 12/01/2019 DEF 2
A 12/01/2019 FGH 0
B 13/02/2019 ABC 3
B 13/02/2019 DEF 1
B 13/02/2019 DEF 1
B 13/03/2019 ABC 3
B 13/03/2019 FGH 0
I am trying to group by s_name and find the sum of the qty of each unique p_name in a month but only display the p_name with the top 2 most quantities. Below is an example of how I want the final output to look like.
s_name Time p_name qty
A 01 DEF 4
A 01 ABC 2
B 02 ABC 3
B 02 DEF 2
B 03 ABC 2
B 03 FGH 0
Do you have any ideas? I have been stuck here for quite long so much help is appreciated.
Create a month using dt, then group by s_name and month, then apply a function to the groups, group each group by name and do a sum over qty, sort_values descending and only get the first two rows with head:
df.Time = pd.to_datetime(df.Time, format='%d/%m/%Y')
df['month'] = df.Time.dt.month
df_f = df.groupby(['s_name', 'month']).apply(
lambda df:
df.groupby('p_name').qty.sum()
.sort_values(ascending=False).head(2)
).reset_index()
df_f
# s_name month p_name qty
# 0 A 1 DEF 4
# 1 A 1 ABC 2
# 2 B 2 ABC 3
# 3 B 2 DEF 2
# 4 B 3 ABC 3
# 5 B 3 FGH 0
I am new to Pandas myself. I am going to attempt to answer your question.
See this code.
from io import StringIO
import pandas as pd
columns = "s_name Time p_name qty"
# Create dataframe from text.
df = pd.read_csv(
StringIO(
f"""{columns}
A 12/01/2019 ABC 1
A 12/01/2019 ABC 1
A 12/01/2019 DEF 2
A 12/01/2019 DEF 2
A 12/01/2019 FGH 0
B 13/02/2019 ABC 3
B 13/02/2019 DEF 1
B 13/02/2019 DEF 1
B 13/03/2019 ABC 3
B 13/03/2019 FGH 0"""
),
sep=" ",
)
S_NAME, TIME, P_NAME, QTY = columns.split()
MONTH = "month"
# Convert the TIME col to datetime types.
df.Time = pd.to_datetime(df.Time, dayfirst=True)
# Create a month column with zfilled strings.
df[MONTH] = df.Time.apply(lambda x: str(x.month).zfill(2))
# Group
group = df.groupby(by=[S_NAME, P_NAME, MONTH])
gdf = (
group.sum()
.sort_index()
.sort_values(by=[S_NAME, MONTH, QTY], ascending=False)
.reset_index()
)
gdf.groupby([S_NAME, MONTH]).head(2).sort_values(by=[S_NAME, MONTH]).reset_index()
Is this the result you expected?

Prevent column name from disappearing after using replace on dataframe

So I have a real dataframe that somewhat follows the next structure:
d = {'col1':['1_ABC','2_DEF','3 GHI']}
df = pd.DataFrame(data=d)
Basically, some entries have the " _ ", others have " ".
My goal is to split that first number into a new column and keep the rest. For this, I thought I'd first replace the '_' by ' ' to normalize everything, and then simply split by ' ' to get the new column.
#Replace the '_' for ' '
new_df['Name'] = df['Name'].str.replace('_',' ')
My problem is that now my new_df now lost its column name:
0 1 ABC
1 2 DEF
Any way to prevent this from happening?
Thanks!
Function str.replace return Series, so there is no column name, only Series name.
s = df['col1'].str.replace('_',' ')
print (s)
0 1 ABC
1 2 DEF
2 3 GHI
Name: col1, dtype: object
print (type(s))
<class 'pandas.core.series.Series'>
print (s.name)
col1
If need new column assign to same DataFrame - df['Name']:
df['Name'] = df['col1'].str.replace('_',' ')
print (df)
col1 Name
0 1_ABC 1 ABC
1 2_DEF 2 DEF
2 3 GHI 3 GHI
Or overwrite values of original column:
df['col1'] = df['col1'].str.replace('_',' ')
print (df)
col1
0 1 ABC
1 2 DEF
2 3 GHI
If need new one column DataFrame use Series.to_frame for convert Series to df:
df2 = df['col1'].str.replace('_',' ').to_frame()
print (df2)
col1
0 1 ABC
1 2 DEF
2 3 GHI
Also is possible define new column name:
df1 = df['col1'].str.replace('_',' ').to_frame('New')
print (df1)
New
0 1 ABC
1 2 DEF
2 3 GHI
Like #anky_91 commented, if need new 2 columns add str.split:
df1 = df['col1'].str.replace('_',' ').str.split(expand=True)
df1.columns = ['A','B']
print (df1)
A B
0 1 ABC
1 2 DEF
2 3 GHI
If need add columns to existing DataFrame:
df[['A','B']] = df['col1'].str.replace('_',' ').str.split(expand=True)
print (df)
col1 A B
0 1_ABC 1 ABC
1 2_DEF 2 DEF
2 3 GHI 3 GHI

Resources