How to count unique values in one colulmn based on value in another column by group in Pandas - python-3.x

I'm trying to count unique values in one column only when the value meets a certain condition based on another column. For example, the data looks like this:
GroupID ID Value
ABC TX123 0
ABC TX678 1
ABC TX678 2
DEF AG123 1
DEF AG123 1
DEF AG123 1
GHI TE203 0
GHI TE203 0
I want to count the number of unique ID by Group ID but only when the value column >0. When all values for a group ID =0, it will simply have 0. For example, the result dataset would look like this:
GroupID UniqueNum
ABC 1
DEF 1
GHI 0
I've tried the following but it simply returns the uique number of IDs regardless of its value. How do I add the condition of when value >0?
count_df = df.groupby(['GroupID'])['ID'].nunique()

positive counts only
You can use pre-filtering with loc and named aggregation with groupby.agg('nunique'):
(df.loc[df['Value'].gt(0), 'ID']
.groupby(df['GroupID'])
.agg(UniqueNum='nunique')
.reset_index()
)
Output:
GroupID UniqueNum
0 ABC 1
1 DEF 1
all counts (including zero)
If you want to count as zero, the groups with no match, you can reindex:
(df.loc[df['Value'].gt(0), 'ID']
.groupby(df['GroupID'])
.agg(UniqueNum='nunique')
.reindex(df['GroupID'].unique(), fill_value=0)
.reset_index()
)
Or mask:
(df['ID'].where(df['Value'].gt(0))
.groupby(df['GroupID'])
.agg(UniqueNum='nunique')
.reset_index()
)
Output:
GroupID UniqueNum
0 ABC 1
1 DEF 1
2 GHI 0
Used input:
GroupID ID Value
ABC TX123 0
ABC TX678 1
ABC TX678 2
DEF AG123 1
DEF AG123 1
DEF AG123 1
GHI AB123 0

If need 0 for non matched values use Series.where for NaNs for non matched condition, then aggregate by DataFrameGroupBy.nunique:
df = pd.DataFrame({ 'GroupID': ['ABC', 'ABC', 'ABC', 'DEF', 'DEF', 'NEW'],
'ID': ['TX123', 'TX678', 'TX678', 'AG123', 'AG123', 'AG123'],
'Value': [0, 1, 2, 1, 1, 0]})
df = (df['ID'].where(df["Value"].gt(0)).groupby(df['GroupID'])
.nunique()
.reset_index(name='nunique'))
print (df)
GroupID nunique
0 ABC 1
1 DEF 1
2 NEW 0
How it working:
print (df.assign(new=df['ID'].where(df["Value"].gt(0))))
GroupID ID Value new
0 ABC TX123 0 NaN
1 ABC TX678 1 TX678
2 ABC TX678 2 TX678
3 DEF AG123 1 AG123
4 DEF AG123 1 AG123
5 NEW AG123 0 NaN <- set NaN for non matched condition

Related

value count for an attribute from the column when there are multiple values for the attribute

I a trying to count and visualize netflix dataset depending on the country column, but when checked the data set I found there are some rows in the column that contains multiple values for country such as the
below one;
following is the code to count
country_count=joint_data['country'].value_counts().sort_values(ascending=False)
country_count=pd.DataFrame(country_count)
topcountries=country_count[0:11]
topcountries.shape
so I wanted to count those rows as individual countries to get the proper count of countries.
You can split the country column by , and then .explode(). Next step is .groupby():
df = df['country'].apply(lambda x: x.split(',')).explode().to_frame()
print( df.groupby('country').agg('size') )
Prints:
country
Austria 1
Canada 1
Germany 1
India 2
United Kingdom 1
United States 1
dtype: int64
You can compile all possible values from your 'country' column, make a set out of it and create new columns for each.
Then you can iterate your rows and fill in if the column is inside this rows 'country':
import pandas as pd
df = pd.DataFrame({"country":["A,B,C","A,D,E,F","G"]})
print(df)
df[[*sorted(set(','.join(df["country"]).split(",")))]] = 0
for row in df.iterrows():
row[1][ [*(row[1]["country"].split(","))]] = 1
print(df)
Output:
country A B C D E F G
0 A,B,C 1 1 1 None None None None
1 A,D,E,F 1 None None 1 1 1 None
2 G None None None None None None 1
If you'd rather have 0 instead of Noneuse df.fillna(0, inplace=True) to convert them:
# 0 instead of None
df.fillna(value=0, inplace=True)
print(df)
# print sums
for c in df.columns:
if c == "country":
continue
print(f"{c} {df[c].sum()}")
Output:
country A B C D E F G
0 A,B,C 1 1 1 0 0 0 0
1 A,D,E,F 1 0 0 1 1 1 0
2 G 0 0 0 0 0 0 1
A 2
B 1
C 1
D 1
E 1
F 1
G 1

pandas help: map and match tab delimted strings in a column and print into new column

I have a dataframe data which have last column containing a bunch of sting and digits and i have one more dataframe info where those sting and digits means, i want to map user input(item) with info and match, print and count how many of them present in the last column in data and prioritize the dataframe data based on numbder of match
import pandas
#data
data = {'id': [123, 456, 789, 1122, 3344],
'Name': ['abc', 'def', 'hij', 'klm', 'nop'],
'MP-ID': ['MP:001|MP:0085|MP:0985', 'MP:005|MP:0258', 'MP:025|MP:5890', 'MP:0589|MP:02546', 'MP:08597|MP:001|MP:005']}
test_data = pd.DataFrame(data)
#info
info = {'MP-ID': ['MP:001', 'MP:002', 'MP:003', 'MP:004', 'MP:005'], 'Item': ['apple', 'orange', 'grapes', 'bannan', 'mango']}
test_info = pd.DataFrame(info)
user input exmaple:
run.py apple mango
desired output:
id Name MP-ID match count
3344 nop MP:08597|MP:001|MP:005 MP:001|MP:005 2
123 abc MP:001|MP:0085|MP:0985 MP:001 1
456 def MP:005|MP:0258 MP:005 1
789 hij MP:025|MP:5890 0
1122 klm MP:0589|MP:02546 0
Thank you for your help in advance
First get all arguments to variable vars, filter MP-ID by Series.isin with DataFrame.loc and extract them by Series.str.findall with Series.str.join, last use Series.str.count with DataFrame.sort_values:
import sys
vals = sys.argv[1:]
#vals = ['apple','mango']
s = test_info.loc[test_info['Item'].isin(vals), 'MP-ID']
test_data['MP-ID match'] = test_data['MP-ID'].str.findall('|'.join(s)).str.join('|')
test_data['count'] = test_data['MP-ID match'].str.count('MP')
test_data = test_data.sort_values('count', ascending=False, ignore_index=True)
print (test_data)
id Name MP-ID MP-ID match count
0 3344 nop MP:08597|MP:001|MP:005 MP:001|MP:005 2
1 123 abc MP:001|MP:0085|MP:0985 MP:001 1
2 456 def MP:005|MP:0258 MP:005 1
3 789 hij MP:025|MP:5890 0
4 1122 klm MP:0589|MP:02546 0

Prevent column name from disappearing after using replace on dataframe

So I have a real dataframe that somewhat follows the next structure:
d = {'col1':['1_ABC','2_DEF','3 GHI']}
df = pd.DataFrame(data=d)
Basically, some entries have the " _ ", others have " ".
My goal is to split that first number into a new column and keep the rest. For this, I thought I'd first replace the '_' by ' ' to normalize everything, and then simply split by ' ' to get the new column.
#Replace the '_' for ' '
new_df['Name'] = df['Name'].str.replace('_',' ')
My problem is that now my new_df now lost its column name:
0 1 ABC
1 2 DEF
Any way to prevent this from happening?
Thanks!
Function str.replace return Series, so there is no column name, only Series name.
s = df['col1'].str.replace('_',' ')
print (s)
0 1 ABC
1 2 DEF
2 3 GHI
Name: col1, dtype: object
print (type(s))
<class 'pandas.core.series.Series'>
print (s.name)
col1
If need new column assign to same DataFrame - df['Name']:
df['Name'] = df['col1'].str.replace('_',' ')
print (df)
col1 Name
0 1_ABC 1 ABC
1 2_DEF 2 DEF
2 3 GHI 3 GHI
Or overwrite values of original column:
df['col1'] = df['col1'].str.replace('_',' ')
print (df)
col1
0 1 ABC
1 2 DEF
2 3 GHI
If need new one column DataFrame use Series.to_frame for convert Series to df:
df2 = df['col1'].str.replace('_',' ').to_frame()
print (df2)
col1
0 1 ABC
1 2 DEF
2 3 GHI
Also is possible define new column name:
df1 = df['col1'].str.replace('_',' ').to_frame('New')
print (df1)
New
0 1 ABC
1 2 DEF
2 3 GHI
Like #anky_91 commented, if need new 2 columns add str.split:
df1 = df['col1'].str.replace('_',' ').str.split(expand=True)
df1.columns = ['A','B']
print (df1)
A B
0 1 ABC
1 2 DEF
2 3 GHI
If need add columns to existing DataFrame:
df[['A','B']] = df['col1'].str.replace('_',' ').str.split(expand=True)
print (df)
col1 A B
0 1_ABC 1 ABC
1 2_DEF 2 DEF
2 3 GHI 3 GHI

New column with in a Pandas Dataframe with respect to duplicates in given column

Hi i have a dataframe with a column "id" like below
id
abc
def
ghi
abc
abc
xyz
def
I need a new column "id1" with a number 1 appended to it and number should be incremented for every duplicate. output should be like below.
id id1
abc abc1
def def1
ghi ghi1
abc abc2
abc abc3
xyz xyz1
def def2
Can anyone suggest me a solution for this?
Use groupby.cumcount for count ids, add 1 and convert to strings:
df['id1'] = df['id'] + df.groupby('id').cumcount().add(1).astype(str)
print (df)
id id1
0 abc abc1
1 def def1
2 ghi ghi1
3 abc abc2
4 abc abc3
5 xyz xyz1
6 def def2
Detail:
print (df.groupby('id').cumcount())
0 0
1 0
2 0
3 1
4 2
5 0
6 1
dtype: int64

Pandas If Statements (excel equivalent)

I'm trying to create a simple if statement in Pandas.
The excel version is as follows:
=IF(E2="ABC",C2,E2)
I'm stuck on how to assign it based on a string or partial string.
Here is what I have.
df['New Value'] = df['E'].map(lambda x: df['C'] if x == 'ABC' else df['E']]
I know I'm making a mistake here.
As the outcome is the entire dataframe values in each cell.
Any help would be much appreciated!
use np.where:
In [36]:
df = pd.DataFrame({'A':np.random.randn(5), 'B':0, 'C':np.arange(5),'D':1, 'E':['asdsa','ABC','DEF','ABC','DAS']})
df
Out[36]:
A B C D E
0 0.831728 0 0 1 asdsa
1 0.734007 0 1 1 ABC
2 -1.032752 0 2 1 DEF
3 1.414198 0 3 1 ABC
4 1.042621 0 4 1 DAS
In [37]:
df['New Value'] = np.where(df['E'] == 'ABC', df['C'], df['E'])
df
Out[37]:
A B C D E New Value
0 0.831728 0 0 1 asdsa asdsa
1 0.734007 0 1 1 ABC 1
2 -1.032752 0 2 1 DEF DEF
3 1.414198 0 3 1 ABC 3
4 1.042621 0 4 1 DAS DAS
The syntax for np.where is:
np.where( < condition >, True condition, False condition )
So when the condition is True it returns the True condition and when False the other condition.

Resources