Create new variable for grouped data using python - python-3.x

I have a data frame like this:
d = {'name': ['john', 'john', 'john', 'Tim', 'Tim', 'Tim','Bob', 'Bob'], 'Prod': ['101', '102', '101', '501', '505', '301', '302', '302'],'Qty': ['5', '4', '1', '3', '5', '4', '1', '3']}
df = pandas.DataFrame(data= d)
What I want to do is, create a new id variable. Whenever a name (say john) appears for the first time this id will be equal to 1, for other occurrence of the same name (john) this id variable will be 0. This will be done for all the other names in the data. How do I go about doing that ?
Final output should be like this:
NOTE: If someone knows SAS, there you can sort your data by the name and then use first.name.
""if first.variable = 1 then id = 1""
For first occurrence of same name first.name = 1. For any other repeat occurrence of same name, first.name = 0. I am trying to replicate the same in python.
So far I have tried pandas groupby and first() functionality and also numpy.where() but couldnt make any of that work. Any fresh perspective will be appreciated.

You can using cumcount
s=df.groupby(['Prod','name']).cumcount().add(1)
df['counter']=s.mask(s.gt(1),0)
df
Out[1417]:
Prod Qty name counter
0 101 5 john 1
1 102 4 john 1
2 101 1 john 0
3 501 3 Tim 1
4 505 5 Tim 1
5 301 4 Tim 1
6 302 1 Bob 1
7 302 3 Bob 0
Update :
s=df.groupby(['name']).cumcount().add(1).le(1).astype(int)
s
Out[1421]:
0 1
1 0
2 0
3 1
4 0
5 0
6 1
7 0
dtype: int32
More Fast
df.loc[df.name.drop_duplicates().index,'counter']=1
df.fillna(0)
Out[1430]:
Prod Qty name counter
0 101 5 john 1.0
1 102 4 john 0.0
2 101 1 john 0.0
3 501 3 Tim 1.0
4 505 5 Tim 0.0
5 301 4 Tim 0.0
6 302 1 Bob 1.0
7 302 3 Bob 0.0

We can just work directly with your dictionary d and loop through to create a new entry.
d = {'name': ['john', 'john', 'john', 'Tim', 'Tim', 'Tim','Bob', 'Bob'], 'Prod': ['101', '102', '101', '501', '505', '301', '302', '302'],'Qty': ['5', '4', '1', '3', '5', '4', '1', '3']}
names = set() #store names that have appeared
id = []
for i in d['name']:
if i in names: #if it appeared add 0
id.append(0)
else:
id.append(1) #add 1 and note that it has appeared
names.add(i)
d['id'] = id #add entry to your dictionary
df = pandas.DataFrame(data= d)

Related

De-duplication with merge of data

I have a dataset with duplicates, triplicates and more and I want to keep only one record of each unique with merge of data, for example:
id name address age city
1 Alex 123,blv
1 Alex 13
3 Alex 24 Florida
1 Alex Miami
Merging data using the id field:
Output:
id name address age city
1 Alex 123,blv 13 Miami
3 Alex 24 Florida
I've changed a bit the code from this answer.
Code to create the initial dataframe:
import pandas as pd
import numpy as np
d = {'id': [1,1,3,1],
'name': ["Alex", "Alex", "Alex", "Alex"],
'address': ["123,blv" , None, None, None],
'age': [None, 13, 24, None],
'city': [None, None, "Florida", "Miami"]
}
df = pd.DataFrame(data=d, index=d["id"])
print(df)
Output:
id name address age city
1 1 Alex 123,blv NaN None
1 1 Alex None 13.0 None
3 3 Alex None 24.0 Florida
1 1 Alex None NaN Miami
Aggregation code:
def get_notnull(x):
if x.notnull().any():
return x[x.notnull()]
else:
return np.nan
aggregation_functions = {'name': 'first',
'address': get_notnull,
'age': get_notnull,
'city': get_notnull
}
df = df.groupby(df['id']).aggregate(aggregation_functions)
print(df)
Output:
name address age city
id
1 Alex 123,blv 13.0 Miami
3 Alex NaN 24.0 Florida
(
df
.reset_index(drop=True) # set unique index for eash record
.drop('id', axis=1) # exclude 'id' column from processing
.groupby(df['id']) # group by 'id'
.agg(
# return first non-NA/None value for each column
lambda s: s.get(s.first_valid_index())
)
.reset_index() # get back the 'id' value for each record
)
ps. As an option:
df.replace([None, ''], pd.NA).groupby('id').first().reset_index()

Making few columns into one column if certain conditions are fulfilled

I have an exercise in which I need to turn few or several rows into one row if they have the same data in three columnes.
substances = pd.DataFrame({'id': ['id_1', 'id_1', 'id_1', 'id_2', 'id_3'],
'part': ['1', '1', '2', '2', '3'],
'sub': ['paracetamolum', 'paracetamolum', 'ibuprofenum', 'dienogestum', 'etynyloestradiol'],
'strength': ['150', '50', '50', '20', '30'],
'unit' : ['mg', 'mg', 'mg', 'mg', 'mcg'],
'other irrelevant columns for this task' : ['sth1' , 'sth2', 'sth3', 'sth4', 'sth5']
})
Now provided that id, part and substance is the same, I am supposed to make it into one row, so the end result is:
id
part
strength
substance
unit
id_1
1
'150 # 50'
paracetamolum
mg
id_1
2
50
ibuprofenum
mg
id_2
2
20
dienogestum
mg
id_3
3
30
etynyloestradiol
mcg
The issue I have is that I have problem joining these rows into one row to show possible strength like this '150 # 50' I have tried to something like this, but it is not going great:
substances = substances.groupby('id', 'part', 'sub', 'strength').id.apply(lambda x: str(substances['strength']) + ' # ' + str(next(substances['strength'])))
df = df.groupby(['id','part','sub','unit']).agg({'strength':' # '.join}).reset_index()
df = df[['id','part','strength', 'sub','unit']]
print(df)
output:
id part strength sub unit
0 id_1 1 150 # 50 paracetamolum mg
1 id_1 2 50 ibuprofenum mg
2 id_2 2 20 dienogestum mg
3 id_3 3 30 etynyloestradiol mcg

Replace string column endwiths specific substrings under condition of another column with Pandas

Given a dataset as follows:
id company name value
0 1 Finl Corp. 7
1 2 Fund Tr Corp 6
2 3 Inc Invt Fd 5
3 4 Govt Fd Inc. 3
4 5 Trinity Inc 5
Or:
[{'id': 1, 'company name': 'Finl Corp.', 'value': 7},
{'id': 2, 'company name': 'Fund Tr Corp', 'value': 6},
{'id': 3, 'company name': 'Inc Invt Fd', 'value': 5},
{'id': 4, 'company name': 'Govt Fd Inc.', 'value': 3},
{'id': 5, 'company name': 'Trinity Inc', 'value': 5}]
I need to replace if company name column's contents endwiths ['Corp.', 'Corp', 'Inc.', 'Inc'], while at same time value is >= 5
The expected result will be:
id company name value
0 1 Finl 7
1 2 Fund Tr 6
2 3 Inc Invt Fd 5
3 4 Govt Fd Inc. 3
4 5 Trinity 5
How could I acheive that in Pandas and regex?
Trial code with error: TypeError: replace() missing 1 required positional argument: 'repl'
mask = (df1['value'] >= 5)
df1.loc[mask, 'company_name_concise']= df1.loc[mask, 'company name'].str.replace(r'\bCorp.|Corp|Inc.|Inc$', regex=True)
You can change values in regex by add \s* for spaces with $ for end of strings:
mask = (df1['value'] >= 5)
L = ['Corp.', 'Corp', 'Inc.', 'Inc']
pat = '|'.join(f'\s*{x}$' for x in L)
df1.loc[mask, 'company name']= df1.loc[mask,'company name'].str.replace(pat,'',regex=True)
print (df1)
id company name value
0 1 Finl 7
1 2 Fund Tr 6
2 3 Inc Invt Fd 5
3 4 Govt Fd Inc. 3
4 5 Trinity 5
str.replace takes two arguments, the pattern and the replacement:
mask = (df1['value'] >= 5)
df1.loc[mask, 'company_name_concise']= df1.loc[mask, 'company name'].str.replace(r'\b(?:Corp\.?|Inc\.?)$', '', regex=True)
Note that the regex pattern you want here is:
\b word boundary
(?:
Corp\.? match Corp or Corp.
| OR
Inc\.? match Inc or Inc.
)
$ at the end of the company name
Or for length of code you could just directly modify the whole column and assign by index:
df.loc[df['value'] > 5, 'company name'] = df['company name'].str.replace(r'\bCorp.|Corp|Inc.|Inc$', '')
>>> df
id company name value
0 1 Finl 7
1 2 Fund Tr 6
2 3 Inc Invt Fd 5
3 4 Govt Fd Inc. 3
4 5 Trinity Inc 5
>>>
Or a solution with np.where:
>>> df['company name'] = np.where(df['value'] > 5, df['company name'].str.replace(r'\bCorp.|Corp|Inc.|Inc$', ''), df['company name'])
>>> df
id company name value
0 1 Finl 7
1 2 Fund Tr 6
2 3 Inc Invt Fd 5
3 4 Govt Fd Inc. 3
4 5 Trinity Inc 5
>>>

how to find how many times given value repeating in pandas dataseries if dataseries contains list of string values?

Example 1:
let suppose we have record data series
record
['ABC' ,'GHI']
['ABC' , 'XYZ']
['XYZ','PQR']
if I want to calculate how many times each value is repeating from record data-series like
value Count
'ABC' 2
'XYZ' 2
'GHI' 1
'PQR' 1
In the record series, 'ABC' and 'XYZ' are repeating for 2 times.
'GHI' and 'PQR' repeating for 1 times.
Example 2:
below is the new dataframe.
teams
0 ['Australia', 'Sri Lanka']
1 ['Australia', 'Sri Lanka']
2 ['Australia', 'Sri Lanka']
3 ['Ireland', 'Hong Kong']
4 ['Zimbabwe', 'India']
... ...
1412 ['Pakistan', 'Sri Lanka']
1413 ['Bangladesh', 'India']
1414 ['United Arab Emirates', 'Netherlands']
1415 ['Sri Lanka', 'Australia']
1416 ['Sri Lanka', 'Australia']
Now if I apply
print(new_df.explode('teams').value_counts())
it gives me
teams
['England', 'Pakistan'] 29
['Australia', 'Pakistan'] 26
['England', 'Australia'] 25
['Australia', 'India'] 24
['England', 'West Indies'] 23
... ..
['Namibia', 'Sierra Leone'] 1
['Namibia', 'Scotland'] 1
['Namibia', 'Oman'] 1
['Mozambique', 'Rwanda'] 1
['Afghanistan', 'Bangladesh'] 1
Length: 399, dtype: int64
But I want
team occurrence of team
India ?
England ?
Australia ?
... ...
I want the occurrence of each team from the dataframe.
How to perform this task?
Try explode and value_counts
On Series:
import pandas as pd
s = pd.Series({0: ['ABC', 'GHI'],
1: ['ABC', 'XYZ'],
2: ['XYZ', 'PQR']})
r = s.explode().value_counts()
print(r)
XYZ 2
ABC 2
GHI 1
PQR 1
dtype: int64
On DataFrame:
import pandas as pd
df = pd.DataFrame({'record': {0: ['ABC', 'GHI'],
1: ['ABC', 'XYZ'],
2: ['XYZ', 'PQR']}})
r = df.explode('record')['record'].value_counts()
print(r)
XYZ 2
ABC 2
GHI 1
PQR 1
Name: record, dtype: int64

how to find minimum and maximum values inside collections.defaultdict

good day!
I'm trying to find the minimum and maximum values for a given dataset
foo,1,1
foo,2,5
foo,3,0
bar,1,5
bar,2,0
bar,3,0
foo,1,1
foo,2,2
foo,3,4
bar,1,4
bar,2,0
bar,3,1
foo,1,4
foo,2,2
foo,3,3
bar,1,1
bar,2,3
bar,3,0
I try to sort my data using the 1st and the 2nd columns as ID and the 3rd column as value
from collections import defaultdict
data = defaultdict(list)
with open("file1.txt", 'r') as infile:
for line in infile:
line = line.strip().split(',')
meta = line[0]
id_ = line[1]
value = line[2]
try:
value = int(line[2])
data[meta+id_].append(value)
except ValueError:
print ('nope', sep='')
the output of my function is:
defaultdict(list,
{'foo1': ['1', '1', '4'],
'foo2': ['5', '2', '2'],
'foo3': ['0', '4', '3'],
'bar1': ['5', '4', '1'],
'bar2': ['0', '0', '3'],
'bar3': ['0', '1', '0']})
please advice how can I get minimum and maximum values for each ID?
so I need output of something like this:
defaultdict(list,
{'foo1': ['1', '4'],
'foo2': ['2', '5'],
'foo3': ['0', '4'],
'bar1': ['1', '5'],
'bar2': ['0', '3'],
'bar3': ['0', '1']})
Update:
with #AndiFB help I add sorting to my lists:
def sorting_func(string):
return int(string)
from collections import defaultdict
data = defaultdict(list)
with open("file1.txt", 'r') as infile:
for line in infile:
line = line.strip().split(',')
meta = line[0]
id_ = line[1]
value = line[2]
try:
if value != "-":
value = int(line[2])
data[meta+id_].append(value)
data[meta+id_].sort(key=sorting_func)
print("max:", *data[meta+id_][-1:], 'min:',*data[meta+id_][:1])
except ValueError:
print ('nope', sep='')
data
Output:
max: 1 min: 1
max: 5 min: 5
max: 0 min: 0
max: 5 min: 5
max: 0 min: 0
max: 0 min: 0
max: 1 min: 1
max: 5 min: 2
max: 4 min: 0
max: 5 min: 4
max: 0 min: 0
max: 1 min: 0
max: 4 min: 1
max: 5 min: 2
max: 4 min: 0
max: 5 min: 1
max: 3 min: 0
max: 1 min: 0
defaultdict(list,
{'foo1': [1, 1, 4],
'foo2': [2, 2, 5],
'foo3': [0, 3, 4],
'bar1': [1, 4, 5],
'bar2': [0, 0, 3],
'bar3': [0, 0, 1]})
Please advice how to save only min and max(the first and the last) values in the list?
to get something like this:
defaultdict(list,
{'foo1': ['1', '4'],
'foo2': ['2', '5'],
'foo3': ['0', '4'],
'bar1': ['1', '5'],
'bar2': ['0', '3'],
'bar3': ['0', '1']})
def sorting_func(string):
return int(string)
d = defaultdict(list)
d['python'].append('10')
d['python'].append('2')
d['python'].append('5')
print("d['python'].__contains__('10'): {}".format(d['python'].__contains__('10')))
print(str(d['python']))
d['python'].sort(key=sorting_func)
print('d["python"]: ' + str(d['python']))
print('d["python"][0]: ' + d['python'][0])
print('d["python"][2]: ' + d['python'][2])
print(str(len(d['python'])))
Resulting in the following output
d['python'].__contains__('10'): True
['10', '2', '5']
d["python"]: ['2', '5', '10']
d["python"][0]: 2
d["python"][2]: 10
3
You can sort the List leaving in the first position the minimum value, and in the last one
the max value
Be aware that if the string contained in the dic can not be casted to Int will result in an exception. The sorting function expects a number to compare. For example another sorting function could be:
def sorting_func(string):
return len(string)
This one sorts by the length of the string.
Since you are working on the dataset an easy way to achieve this would be using pandas and then doing a groupby on id and aggregating on values to get min and max values for each id
#your question
s ="""foo,1,1
foo,2,5
foo,3,0
bar,1,5
bar,2,0
bar,3,0
foo,1,1
foo,2,2
foo,3,4
bar,1,4
bar,2,0
bar,3,1
foo,1,4
foo,2,2
foo,3,3
bar,1,1
bar,2,3
bar,3,0"""
#splitting on new line
t = s.split('\n')
#creating datframe with comma separation
import pandas as pd
df = pd.DataFrame([i.split(',') for i in t])
Output:
>>> df
0 1 2
0 foo 1 1
1 foo 2 5
2 foo 3 0
3 bar 1 5
4 bar 2 0
5 bar 3 0
6 foo 1 1
7 foo 2 2
8 foo 3 4
9 bar 1 4
10 bar 2 0
11 bar 3 1
12 foo 1 4
13 foo 2 2
14 foo 3 3
15 bar 1 1
16 bar 2 3
17 bar 3 0
#creating id column by concatenating column 1 and 2, renaming column 2 as 'value' and dropping them col1 and 2
df['id']=df[0]+df[1]
df = df.rename(columns={df.columns[2]: 'value'})
df = df.drop([0,1], axis = 1)
Output:
>>> df
value id
0 1 foo1
1 5 foo2
2 0 foo3
3 5 bar1
4 0 bar2
5 0 bar3
6 1 foo1
7 2 foo2
8 4 foo3
9 4 bar1
10 0 bar2
11 1 bar3
12 4 foo1
13 2 foo2
14 3 foo3
15 1 bar1
16 3 bar2
17 0 bar3
#doing grouby and aggregating to get min and max for each id
df.groupby('id').value.agg([min,max])
Output:
min max
id
bar1 1 5
bar2 0 3
bar3 0 1
foo1 1 4
foo2 2 5
foo3 0 4

Resources