Truncate and re-number a column that corresponds to a specific id/group by using Python - python-3.x

I have a dataset given as such in Python:
#Load the required libraries
import pandas as pd
#Create dataset
data = {'id': [1, 1, 1, 1, 1,1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3],
'runs': [6, 6, 6, 6, 6,6,7,8,9,10, 3, 3, 3,4,5,6, 5, 5,5, 5,5,6,7,8],
'Children': ['No', 'Yes', 'Yes', 'Yes', 'No','No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'No'],
'Days': [123, 128, 66, 120, 141,123, 128, 66, 120, 141, 52,96, 120, 141, 52,96, 120, 141,123,15,85,36,58,89],
}
#Convert to dataframe
df = pd.DataFrame(data)
print("df = \n", df)
The above dataframe looks as such :
Here, for every 'id', I wish to truncate the columns where 'runs' are being repeated and make the numbering continuous in that id.
For example,
For id=1, truncate the 'runs' at 6 and re-number the dataset starting from 1.
For id=2, truncate the 'runs' at 3 and re-number the dataset starting from 1.
For id=3, truncate the 'runs' at 5 and re-number the dataset starting from 1.
The net result needs to look as such:
Can somebody please let me know how to achieve this task in python?
I wish to truncate and re-number a column that corresponds to a specific id/group by using Python

Filter out the duplicates with loc and duplicated, then renumber with groupby.cumcount:
out = (df[~df.duplicated(subset=['id', 'runs'], keep=False)]
.assign(runs=lambda d: d.groupby(['id']).cumcount().add(1))
)
Output:
id runs Children Days
6 1 1 Yes 128
7 1 2 Yes 66
8 1 3 Yes 120
9 1 4 No 141
13 2 1 Yes 141
14 2 2 Yes 52
15 2 3 Yes 96
21 3 1 Yes 36
22 3 2 Yes 58
23 3 3 No 89

You can create a loop to go through each id and run cutoff value, and for each iteration of the loop, determine the new segment of your dataframe by the id and run values of the original dataframe, and append the new dataframe to your final dataframe.
df_truncated = pd.DataFrame(columns=df.columns)
for id,run_cutoff in zip([1,2,3],[6,3,5]):
df_chunk = df[(df['id'] == id) & (df['runs'] > run_cutoff)].copy()
df_chunk['runs'] = range(1, len(df_chunk)+1)
df_truncated = pd.concat([df_truncated, df_chunk])
Result:
id runs Children Days
6 1 1 Yes 128
7 1 2 Yes 66
8 1 3 Yes 120
9 1 4 No 141
13 2 1 Yes 141
14 2 2 Yes 52
15 2 3 Yes 96
21 3 1 Yes 36
22 3 2 Yes 58
23 3 3 No 89

def function1(dd:pd.DataFrame):
dd1=dd.drop_duplicates(subset='runs',keep=False)
return dd1.assign(runs=dd1.runs.rank().astype(int))
df.groupby('id').apply(function1).reset_index(drop=True)
out:
id runs Children Days
0 1 1 Yes 128
1 1 2 Yes 66
2 1 3 Yes 120
3 1 4 No 141
4 2 1 Yes 141
5 2 2 Yes 52
6 2 3 Yes 96
7 3 1 Yes 36
8 3 2 Yes 58
9 3 3 No 89

Related

Arrange the dataframe in increasing or decreasing order based on specific group/id in Python

I have a dataframe given as such:
#Load the required libraries
import pandas as pd
#Create dataset
data = {'id': ['A', 'A', 'A', 'A', 'A','A', 'A', 'A', 'A', 'A', 'A',
'B', 'B', 'B', 'B', 'B', 'B',
'C', 'C', 'C', 'C', 'C', 'C',
'D', 'D', 'D', 'D',
'E', 'E', 'E', 'E', 'E','E', 'E', 'E','E'],
'cycle': [1,2, 3, 4, 5,6,7,8,9,10,11,
1,2, 3,4,5,6,
1,2, 3, 4, 5,6,
1,2, 3, 4,
1,2, 3, 4, 5,6,7,8,9,],
'Salary': [7, 7, 7,8,9,10,11,12,13,14,15,
4, 4, 4,4,5,6,
8,9,10,11,12,13,
8,9,10,11,
7, 7,9,10,11,12,13,14,15,],
'Children': ['No', 'Yes', 'Yes', 'Yes', 'Yes', 'No','No', 'Yes', 'Yes', 'Yes', 'No',
'Yes', 'Yes', 'No', 'Yes', 'Yes', 'Yes',
'No','Yes', 'Yes', 'No','No', 'Yes',
'Yes', 'No','Yes', 'Yes',
'No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'No',],
'Days': [123, 128, 66, 66, 120, 141, 52,96, 120, 141, 52,
96, 120,120, 141, 52,96,
15,123, 128, 66, 120, 141,
141,123, 128, 66,
123, 128, 66, 123, 128, 66, 120, 141, 52,],
}
#Convert to dataframe
df = pd.DataFrame(data)
print("df = \n", df)
The above dataframe looks as such:
Here,
id 'A' as 11 cycle
id 'B' as 6 cycle
id 'C' as 6 cycle
id 'D' as 4 cycle
id 'E' as 9 cycle
I need to regroup the dataframe based on following two cases:
Case 1: Increasing order of the cycle
The datafrmae needs to be arraged in the increasing order of the cycle.
i.e. D(4 cycle) comes first, then B(6 cycle), C(6 cycle), E(9 cycle), A(11 cycle)
The dataframe need to look as such:
Case 2: Decreasing order of the cycle
The datafrmae needs to be arraged in the decreasing order of the cycle.
i.e. A(11 cycle) comes first, then E(9 cycle), B(6 cycle), C(6 cycle), D(4 cycle)
The dataframe need to look as such:
In both the cases, id 'B' and 'C' has 6 cycle. So it is immaterial which will come first amongst 'B' and 'C'.
Also, the index number dosen't change in the original and regrouped cases.
Can somebody please let me know hot to achieve this task in Python?
Use groupby.transform('size') as the sorting value:
Either using a temporary column:
(df.assign(size=df.groupby('id')['cycle'].transform('size'))
.sort_values(by=['size', 'id'], kind='stable',
# ascending=False # uncomment for descending order
)
.drop(columns='size')
)
Or, passing as key to sort_values:
df.sort_values(by='id', key=lambda x: df.groupby(x)['cycle'].transform('size'),
kind='stable')
Output:
id cycle Salary Children Days
23 D 1 8 Yes 141
24 D 2 9 No 123
25 D 3 10 Yes 128
26 D 4 11 Yes 66
11 B 1 4 Yes 96
12 B 2 4 Yes 120
13 B 3 4 No 120
14 B 4 4 Yes 141
15 B 5 5 Yes 52
16 B 6 6 Yes 96
17 C 1 8 No 15
18 C 2 9 Yes 123
19 C 3 10 Yes 128
20 C 4 11 No 66
21 C 5 12 No 120
22 C 6 13 Yes 141
27 E 1 7 No 123
28 E 2 7 Yes 128
29 E 3 9 No 66
30 E 4 10 No 123
31 E 5 11 Yes 128
32 E 6 12 Yes 66
33 E 7 13 Yes 120
34 E 8 14 Yes 141
35 E 9 15 No 52
0 A 1 7 No 123
1 A 2 7 Yes 128
2 A 3 7 Yes 66
3 A 4 8 Yes 66
4 A 5 9 Yes 120
5 A 6 10 No 141
6 A 7 11 No 52
7 A 8 12 Yes 96
8 A 9 13 Yes 120
9 A 10 14 Yes 141
10 A 11 15 No 52
col1=df.groupby("id").cycle.transform("max")
case1=df.assign(col1=col1).sort_values(['col1','id'])
case1
case2=df.assign(col1=col1).sort_values(['col1','id'],ascending=False)
case2

Removing rows from DataFrame based on different conditions applied to subset of a data

Here is the Dataframe I am working with:
You can create it using the snippet:
my_dict = {'id': [1,2,1,2,1,2,1,2,1,2,3,1,3, 3],
'category':['a', 'a', 'b', 'b', 'b', 'b', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'a'],
'value' : [1, 12, 34, 12, 12 ,34, 12, 35, 34, 45, 65, 55, 34, 25]
}
x = pd.DataFrame(my_dict)
x
I want to filter IDs based on the condition: for category a, the count of values should be 2 and for category b, the count of values should be 3. Therefore, I would remove id 1 from category a and id 3 from category b from my original dataset x.
I can write the code for individual categories and start removing id's manually by using the code:
x.query('category == "a"').groupby('id').value.count().loc[lambda x: x != 2]
x.query('category == "b"').groupby('id').value.count().loc[lambda x: x != 3]
But, I don't want to do it manually since there are multiple categories. Is there a better way of doing it by considering all the categories at once and remove id's based on the condition listed in a list/dictionary?
If need filter MultiIndex Series - s by dictionary use Index.get_level_values with Series.map and get equal values per groups in boolean indexing:
s = x.groupby(['category','id']).value.count()
d = {'a': 2, 'b': 3}
print (s[s.eq(s.index.get_level_values(0).map(d))])
category id
a 2 2
3 2
b 1 3
2 3
Name: value, dtype: int64
If need filter original DataFrame:
s = x.groupby(['category','id'])['value'].transform('count')
print (s)
0 3
1 2
2 3
3 3
4 3
5 3
6 3
7 2
8 3
9 3
10 1
11 3
12 2
13 2
Name: value, dtype: int64
d = {'a': 2, 'b': 3}
print (x[s.eq(x['category'].map(d))])
id category value
1 2 a 12
2 1 b 34
3 2 b 12
4 1 b 12
5 2 b 34
7 2 a 35
8 1 b 34
9 2 b 45
12 3 a 34
13 3 a 25

Efficient evaluation of weighted average variable in a Pandas Dataframe

Please, considere the dataframe df generated below:
import pandas as pd
def creatingDataFrame():
raw_data = {'code': [1, 2, 3, 2 , 3, 3],
'var1': [10, 20, 30, 20 , 30, 30],
'var2': [2,4,6,4,6,6],
'price': [20, 30, 40 , 50, 10, 20],
'sells': [3, 4 , 5, 1, 2, 3]}
df = pd.DataFrame(raw_data, columns = ['code', 'var1','var2', 'price', 'sells'])
return df
if __name__=="__main__":
df=creatingDataFrame()
setCode=set(df['code'])
listDF=[]
for code in setCode:
dfCode=df[df['code'] == code].copy()
print(dfCode)
lenDfCode=len(dfCode)
if(lenDfCode==1):
theData={'code': [dfCode['code'].iloc[0]],
'var1': [dfCode['var1'].iloc[0]],
'var2': [dfCode['var2'].iloc[0]],
'averagePrice': [dfCode['price'].iloc[0]],
'totalSells': [dfCode['sells'].iloc[0]]
}
else:
dfCode['price*sells']=dfCode['price']*dfCode['sells']
sumSells=np.sum(dfCode['sells'])
sumProducts=np.sum(dfCode['price*sells'])
dfCode['totalSells']=sumSells
av=sumProducts/sumSells
dfCode['averagePrice']=av
theData={'code': [dfCode['code'].iloc[0]],
'var1': [dfCode['var1'].iloc[0]],
'var2': [dfCode['var2'].iloc[0]],
'averagePrice': [dfCode['averagePrice'].iloc[0]],
'totalSells': [dfCode['totalSells'].iloc[0]]
}
dfPart=pd.DataFrame(theData, columns = ['code', 'var1','var2', 'averagePrice','totalSells'])
listDF.append(dfPart)
newDF = pd.concat(listDF)
print(newDF)
I have this dataframe
code var1 var2 price sells
0 1 10 2 20 3
1 2 20 4 30 4
2 3 30 6 40 5
3 2 20 4 50 1
4 3 30 6 10 2
5 3 30 6 20 3
I want to generate the following dataframe:
code var1 var2 averagePrice totalSells
0 1 10 2 20.0 3
0 2 20 4 34.0 5
0 3 30 6 28.0 10
Note that this dataframe is created from the first by evaluating the average price and total sells for each code. Furthermore, var1 and var2 are the same for each code. The python code above does that, but I know that it is inefficient. I believe that a desired solution can be done using groupby, but I am not able to generate it.
It is different , apply with pd.Series
df.groupby(['code','var1','var2']).apply(lambda x : pd.Series({'averagePrice': sum(x['sells']*x['price'])/sum(x['sells']),'totalSells':sum(x['sells'])})).reset_index()
Out[366]:
code var1 var2 averagePrice totalSells
0 1 10 2 20.0 3.0
1 2 20 4 34.0 5.0
2 3 30 6 28.0 10.0

Pandas: Random integer between values in two columns

How can I create a new column that calculates random integer between values of two columns in particular row.
Example df:
import pandas as pd
import numpy as np
data = pd.DataFrame({'start': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'end': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]})
data = data.iloc[:, [1, 0]]
Result:
Now I am trying something like this:
data['rand_between'] = data.apply(lambda x: np.random.randint(data.start, data.end))
or
data['rand_between'] = np.random.randint(data.start, data.end)
But it doesn't work of course because data.start is a Series not a number.
how can I used numpy.random with data from columns as vectorized operation?
You are close, need specify axis=1 for process data by rows and change data.start/end to x.start/end for working with scalars:
data['rand_between'] = data.apply(lambda x: np.random.randint(x.start, x.end), axis=1)
Another possible solution:
data['rand_between'] = [np.random.randint(s, e) for s,e in zip(data['start'], data['end'])]
print (data)
start end rand_between
0 1 10 8
1 2 20 3
2 3 30 23
3 4 40 35
4 5 50 30
5 6 60 28
6 7 70 60
7 8 80 14
8 9 90 85
9 10 100 83
If you want to truly vectorize this, you can generate a random number between 0 and 1 and normalize it with your min/max numbers:
(
data['start'] + np.random.rand(len(data)) * (data['end'] - data['start'] + 1)
).astype('int')
Out:
0 1
1 18
2 18
3 35
4 22
5 27
6 35
7 23
8 33
9 81
dtype: int64

Delete rows of a pandas data frame having string values in python 3.4.1

I have read a csv file with pandas read_csv having 8 columns. Each column may contain int/string/float values. But I want to remove those rows having string values and return a data frame with only numeric values in it. Attaching the csv sample.
I have tried to run this following code:
import pandas as pd
import numpy as np
df = pd.read_csv('new200_with_errors.csv',dtype={'Geo_Level_1' : int,'Geo_Level_2' : int,'Geo_Level_3' : int,'Product_Level_1' : int,'Product_Level_2' : int,'Product_Level_3' : int,'Total_Sale' : float})
print(df)
but I get the following error:
TypeError: unorderable types: NoneType() > int()
I am running with python 3.4.1.
Here is the sample csv.
Geo_L_1,Geo_L_2,Geo_L_3,Pro_L_1,Pro_L_2,Pro_L_3,Date,Sale
1, 2, 3, 129, 1, 5193316745, 1/1/2012, 9
1 ,2, 3, 129, 1, 5193316745, 1/1/2013,
1, 2, 3, 129, 1, 5193316745, , 8
1, 2, 3, 129, NA, 5193316745, 1/10/2012, 10
1, 2, 3, 129, 1, 5193316745, 1/10/2013, 4
1, 2, 3, ghj, 1, 5193316745, 1/10/2014, 6
1, 2, 3, 129, 1, 5193316745, 1/11/2012, 4
1, 2, 3, 129, 1, ghgj, 1/11/2013, 2
1, 2, 3, 129, 1, 5193316745, 1/11/2014, 6
1, 2, 3, 129, 1, 5193316745, 1/12/2012, ghgj
1, 2, 3, 129, 1, 5193316745, 1/12/2013, 5
So the way I would approach this is to try to convert the columns to an int using a user function with a Try/Catch to handle the situation where the value cannot be coerced into an Int, these get set to NaN values. Drop the row where you have an empty value, for some reason it actually has a length of 1 when I tested this with your data, it may work for you using len 0.
In [42]:
# simple function to try to convert the type, returns NaN if the value cannot be coerced
def func(x):
try:
return int(x)
except ValueError:
return NaN
# assign multiple columns
df['Pro_L_1'], df['Pro_L_3'], df['Sale'] = df['Pro_L_1'].apply(func), df['Pro_L_3'].apply(func), df['Sale'].apply(func)
# drop the 'empty' date row, take a copy() so we don't get a warning
df = df.loc[df['Date'].str.len() > 1].copy()
# convert the string to a datetime, if we didn't drop the row it would set the empty row to today's date
df['Date']= pd.to_datetime(df['Date'])
# now convert all the dtypes that are numeric to a numeric dtype
df = df.convert_objects(convert_numeric=True)
# check the dtypes
df.dtypes
Out[42]:
Geo_L_1 int64
Geo_L_2 int64
Geo_L_3 int64
Pro_L_1 float64
Pro_L_2 float64
Pro_L_3 float64
Date datetime64[ns]
Sale float64
dtype: object
In [43]:
# display the current situation
df
Out[43]:
Geo_L_1 Geo_L_2 Geo_L_3 Pro_L_1 Pro_L_2 Pro_L_3 Date Sale
0 1 2 3 129 1 5193316745 2012-01-01 9
1 1 2 3 129 1 5193316745 2013-01-01 NaN
3 1 2 3 129 NaN 5193316745 2012-01-10 10
4 1 2 3 129 1 5193316745 2013-01-10 4
5 1 2 3 NaN 1 5193316745 2014-01-10 6
6 1 2 3 129 1 5193316745 2012-01-11 4
7 1 2 3 129 1 NaN 2013-01-11 2
8 1 2 3 129 1 5193316745 2014-01-11 6
9 1 2 3 129 1 5193316745 2012-01-12 NaN
10 1 2 3 129 1 5193316745 2013-01-12 5
In [44]:
# drop the rows
df.dropna()
Out[44]:
Geo_L_1 Geo_L_2 Geo_L_3 Pro_L_1 Pro_L_2 Pro_L_3 Date Sale
0 1 2 3 129 1 5193316745 2012-01-01 9
4 1 2 3 129 1 5193316745 2013-01-10 4
6 1 2 3 129 1 5193316745 2012-01-11 4
8 1 2 3 129 1 5193316745 2014-01-11 6
10 1 2 3 129 1 5193316745 2013-01-12 5
For the last line assign it so df = df.dropna()

Resources