Nodal/Label assignment according to the load at a terminal in pandas data frame python - python-3.x

Input data frame with requirement for boarding in a transport allocation setting.
import pandas as pd
data1 = {
'N_Id': [0,1,2,2,2,2,4,5,5,5,10,10,11,11,13,13,13,13],
'N_Name': ['Destiny','N_Area','Pstation','Pstation','Pstation','Pstation','A_Area','T_depot','T_depot','T_depot','B_colony','B_colony','c_colony','c_colony','Z_colony','Z_colony','Z_colony','Z_colony'],
'Geocode': ['9.994798,0.206728','2.994798,0.206728','2.989385,0.201941','2.989385,0.201941','2.989385,0.201941','2.989385,0.201941','3.006515,0.247101','3.000568,0.256468','3.000568,0.256468','3.000568,0.256468','3.049031,0.167278','3.049031,0.167278','4.049031,0.167278','4.049031,0.167278','5.2,0.7','5.2,0.7','5.2,0.7','5.2,0.7'],
'pcode': ['None','M023','L123','M0232','L1234','K0324','M0137','M01368','M01369','K07249','M01375','F04509','F04609','F04610','F1','F2','F3','F4']
}
df1 = pd.DataFrame.from_dict(data1)
output
df1
Out[45]:
N_Id N_Name Geocode pcode
0 0 Destiny 9.994798,0.206728 None
1 1 N_Area 2.994798,0.206728 M023
2 2 Pstation 2.989385,0.201941 L123
3 2 Pstation 2.989385,0.201941 M0232
4 2 Pstation 2.989385,0.201941 L1234
5 2 Pstation 2.989385,0.201941 K0324
6 4 A_Area 3.006515,0.247101 M0137
7 5 T_depot 3.000568,0.256468 M01368
8 5 T_depot 3.000568,0.256468 M01369
9 5 T_depot 3.000568,0.256468 K07249
10 10 B_colony 3.049031,0.167278 M01375
11 10 B_colony 3.049031,0.167278 F04509
12 11 c_colony 4.049031,0.167278 F04609
13 11 c_colony 4.049031,0.167278 F04610
14 13 Z_colony 5.2,0.7 F1
15 13 Z_colony 5.2,0.7 F2
16 13 Z_colony 5.2,0.7 F3
17 13 Z_colony 5.2,0.7 F4
second data frame with a route plan/to load according to the Terminal/Nodal load.
data2= {
'BusID': ['V1','V1','V1','V4','V4','V4','V5','V5','V5','V0','v100'],
'Tcap': [4,4,4,8,8,8,12,12,12,12,8],
'Terminal_Load':[1,2,1,2,1,1,2,1,1,0,4],
'N_Id': [1,2,5,2,5,10,11,4,10,0,13],
'N_Name': ['N_Area','Pstation','T_depot','Pstation','T_depot','B_colony','c_colony','A_Area','B_colony','Destiny','Z_colony'],
}
df2 = pd.DataFrame.from_dict(data2)
output of the second data frame.
df2
Out[46]:
BusID Tcap Terminal_Load N_Id N_Name
0 V1 4 1 1 N_Area
1 V1 4 2 2 Pstation
2 V1 4 1 5 T_depot
3 V4 8 2 2 Pstation
4 V4 8 1 5 T_depot
5 V4 8 1 10 B_colony
6 V5 12 2 11 c_colony
7 V5 12 1 4 A_Area
8 V5 12 1 10 B_colony
9 V0 12 0 0 Destiny
10 v100 8 4 13 Z_colony
required Dataframe format..
data3 = {
'BusID': ['V1','V1','V1','V4','V4','V4','V5','V5','V5','V0','v100'],
'Tcap': [4,4,4,8,8,8,12,12,12,12,8],
'Terminal_Load':[1,2,1,2,1,1,2,1,1,0,4],
'N_Id': [1,2,5,2,5,10,11,4,10,0,13],
'N_Name': ['N_Area','Pstation','T_depot','Pstation','T_depot','B_colony','c_colony','A_Area','B_colony','Destiny','Z_colony'],
'Pcode': ['M023','L123,M0232','M01368','L1234,K0324','M01369','M01375','F04609,F04610','M0137','F04509','','F1,F2,F3,F4'],
}
df3 = pd.DataFrame.from_dict(data3)
required output...
df3
Out[47]:
BusID Tcap Terminal_Load N_Id N_Name Pcode
0 V1 4 1 1 N_Area M023
1 V1 4 2 2 Pstation L123,M0232
2 V1 4 1 5 T_depot M01368
3 V4 8 2 2 Pstation L1234,K0324
4 V4 8 1 5 T_depot M01369
5 V4 8 1 10 B_colony M01375
6 V5 12 2 11 c_colony F04609,F04610
7 V5 12 1 4 A_Area M0137
8 V5 12 1 10 B_colony F04509
9 V0 12 0 0 Destiny
10 v100 8 4 13 Z_colony F1,F2,F3,F4
The required Dataframe is the combination of both the first to dataframe, I need to assign the PCode based on the requirement Terminal Load.
The load at a terminal is on the Terminal_Load value, for example at nodal ID 1 the load is 1 so 1 person is listed in the pcode, 2nd row terminal load is 2 and the nodal ID is 2 hence two person are loaded at Pcode. Terminal load 0 at the 0 node.
requesting a pandas dataframe solution or any Non Pandas implementation just guessing any labelled pop up methods..or any algorithm to capacity and allocate.. and after allocating remove the label from the queue.

First aggregate list like previous solution and add new column to df2 with aggregate lists:
s = (df1.groupby(['N_Id','N_Name'])['pcode']
.agg(list)
.rename('Pcode'))
df = df2.join(s, on=['N_Id','N_Name'])
Then need split lists per groups (same per groups, so selected first list by iat[0]) by Terminal_Load values like linked solution - only is necessary cumulative sum and starting by 0:
def f(x):
indices = [0] + x['Terminal_Load'].cumsum().tolist()
s = x['Pcode'].iat[0]
#https://stackoverflow.com/a/10851479/2901002
x['Pcode'] = [','.join(s[indices[i]:indices[i+1]]) for i in range(len(indices)-1)]
return x
df = df.groupby(['N_Id','N_Name']).apply(f)
print (df)
BusID Tcap Terminal_Load N_Id N_Name Pcode
0 V1 4 1 1 N_Area M023
1 V1 4 2 2 Pstation L123,M0232
2 V1 4 1 5 T_depot M01368
3 V4 8 2 2 Pstation L1234,K0324
4 V4 8 1 5 T_depot M01369
5 V4 8 1 10 B_colony M01375
6 V5 12 2 11 c_colony F04609,F04610
7 V5 12 1 4 A_Area M0137
8 V5 12 1 10 B_colony F04509
9 V0 12 0 0 Destiny
10 v100 8 4 13 Z_colony F1,F2,F3,F4

Related

Reassigning multiple columns with same array

I've been breaking my head over this simple thing. Ik we can assign a single value to multiple columns using .loc. But how to assign multiple columns with the same array.
Ik I can do this. Let's say we have a dataframe df in which I wish to replace some columns with the array arr:
df=pd.DataFrame({'a':[random.randint(1,25) for i in range(5)],'b':[random.randint(1,25) for i in range(5)],'c':[random.randint(1,25) for i in range(5)]})
>>df
a b c
0 14 8 5
1 10 25 9
2 14 14 8
3 10 6 7
4 4 18 2
arr = [i for i in range(5)]
#Suppose if I wish to replace columns `a` and `b` with array `arr`
df['a'],df['b']=[arr for j in range(2)]
Desired output:
a b c
0 0 0 16
1 1 1 10
2 2 2 1
3 3 3 20
4 4 4 11
Or I can also do this in a loopwise assignment. But is there a more efficient way without repetition or loops?
Let's try with assign:
cols = ['a', 'b']
df.assign(**dict.fromkeys(cols, arr))
a b c
0 0 0 5
1 1 1 9
2 2 2 8
3 3 3 7
4 4 4 2
I did an assign statement df.a = df.b = arr
df=pd.DataFrame({'a':[random.randint(1,25) for i in range(5)],'b':[random.randint(1,25) for i in range(5)],'c':[random.randint(1,25) for i in range(5)]})
arr = [i for i in range(5)]
df
a b c
0 2 8 18
1 17 15 25
2 6 5 17
3 12 15 25
4 10 10 6
df.a = df.b = arr
df
a b c
0 0 0 18
1 1 1 25
2 2 2 17
3 3 3 25
4 4 4 6

count Total rows of an Id from another column

I have a dataframe
Intialise data of lists.
data = {'Id':['1', '2', '3', '4','5','6','7','8','9','10'], 'reply_id':[2, 2,2, 5,5,6,8,8,1,1]}
Create DataFrame
df = pd.DataFrame(data)
Id reply_id
0 1 2
1 2 2
2 3 2
3 4 5
4 5 5
5 6 6
6 7 8
7 8 8
8 9 1
9 10 1
I want to get total of reply_id in new for every Id.
Id=1 have 2 time occurrence in reply_id which i want in new column new
Desired output
Id reply_id new
0 1 2 2
1 2 2 3
2 3 2 0
3 4 5 0
4 5 5 2
5 6 6 1
6 7 8 0
7 8 8 2
8 9 1 0
9 10 1 0
I have done this line of code.
df['new'] = df.reply_id.eq(df.Id).astype(int).groupby(df.Id).transform('sum')
In this answer, I used Series.value_counts to count values in reply_id, and converted the result to a dict. Then, I used Series.map on the Id column to associate counts to Id. fillna(0) is used to fill values not present in reply_id
df['new'] = (df['Id']
.astype(int)
.map(df['reply_id'].value_counts().to_dict())
.fillna(0)
.astype(int))
Use, Series.groupby on the column reply_id, then use the aggregation function GroupBy.count to create a mapping series counts, finally use Series.map to map the values in Id column with their respective counts:
counts = df['reply_id'].groupby(df['reply_id']).count()
df['new'] = df['Id'].map(counts).fillna(0).astype(int)
Result:
# print(df)
Id reply_id new
0 1 2 2
1 2 2 3
2 3 2 0
3 4 5 0
4 5 5 2
5 6 6 1
6 7 8 0
7 8 8 2
8 9 1 0
9 10 1 0

Create an aggregate column based on other columns in pandas dataframe

I have a dataframe as below:
import pandas as pd
import numpy as np
import datetime
# intialise data of lists.
data = {'group' :["A","A","B","B","B"],
'A1_val' :[4,5,7,6,5],
'A1M_val' :[10,100,100,10,1],
'AB_val' :[4,5,7,6,5],
'ABM_val' :[10,100,100,10,1],
'AM_VAL' : [4,5,7,6,5]
}
# Create DataFrame
df1 = pd.DataFrame(data)
df1
group A1_val A1M_val AB_val ABM_val AM_VAL
0 A 4 10 4 10 4
1 A 5 100 5 100 5
2 B 7 100 7 100 7
3 B 6 10 6 10 6
4 B 5 1 5 1 5
Step 1: I want to create columns as below:
A1_agg_val = sum of A1_val + A1M_val (stripping M out of the column and if the name matches then sum it)
Similarly, AB_agg_val = AB_val + ABM_val
Since there is no matching columns for 'AM_VAL', AM_agg_val = AM_val
My expected output:
group A1_val A1M_val AB_val ABM_val AM_VAL A1_AGG_val AB_AGG_val A_AGG_val
0 A 4 10 4 10 4 14 14 4
1 A 5 100 5 100 5 105 105 5
2 B 7 100 7 100 7 107 107 7
3 B 6 10 6 10 6 16 16 6
4 B 5 1 5 1 5 6 6 5
you can use groupby on axis=1
out = (df1.assign(**df1.loc[:,df1.columns.str.lower().str.endswith('_val')]
.groupby(lambda x: x[:2],axis=1).sum().add_suffix('_agg_value')))
print(out)
group A1_val A1M_val AB_val ABM_val AM_VAL A1_agg_value AB_agg_value \
0 A 4 10 4 10 4 14 14
1 A 5 100 5 100 5 105 105
2 B 7 100 7 100 7 107 107
3 B 6 10 6 10 6 16 16
4 B 5 1 5 1 5 6 6
AM_agg_value
0 4
1 5
2 7
3 6
4 5

How can we groupby selected row values from a column and assign it to a new column in pandas df?

Id B
1 6
2 13
1 6
2 6
1 6
2 6
1 10
2 6
2 6
2 6
I want a new columns say C where I can get a grouped value of B=6 at Id level
Jan18.loc[Jan18['Enquiry Purpose']==6].groupby(Jan18['Member Reference']).transform('count')
Id B No_of_6
1 6 3
2 13 5
1 6 3
2 6 5
1 6 3
2 6 5
1 10 3
2 6 5
2 6 5
2 6 5
Comapre values by Series.eq for ==, convert to integers and use GroupBy.transform for new column filled by sum per groups:
df['No_of_6'] = df['B'].eq(6).astype(int).groupby(df['Id']).transform('sum')
#alternative
#df['No_of_6'] = df.assign(B= df['B'].eq(6).astype(int)).groupby('Id')['B'].transform('sum')
print (df)
Id B No_of_6
0 1 6 3
1 2 13 5
2 1 6 3
3 2 6 5
4 1 6 3
5 2 6 5
6 1 10 3
7 2 6 5
8 2 6 5
9 2 6 5
Generally create boolean mask by your condition(s) and pass below:
mask = df['B'].eq(6)
#alternative
#mask = (df['B'] == 6)
df['No_of_6'] = mask.astype(int).groupby(df['Id']).transform('sum')
A solution using map. This solution will return NaN on groups of Id have no number of 6
df['No_of_6'] = df.Id.map(df[df.B.eq(6)].groupby('Id').B.count())
Out[113]:
Id B No_of_6
0 1 6 3
1 2 13 5
2 1 6 3
3 2 6 5
4 1 6 3
5 2 6 5
6 1 10 3
7 2 6 5
8 2 6 5
9 2 6 5

How do I calculate the probability of every value in a dataframe column quickly in Python?

I want to calculate the probability of all the data in a column dataframe according to its own distribution.For example,my data like this:
data
0 1
1 1
2 2
3 3
4 2
5 2
6 7
7 8
8 3
9 4
10 1
And the output I expect like this:
data pro
0 1 0.155015
1 1 0.155015
2 2 0.181213
3 3 0.157379
4 2 0.181213
5 2 0.181213
6 7 0.048717
7 8 0.044892
8 3 0.157379
9 4 0.106164
10 1 0.155015
I also refer to another question(How to compute the probability ...) and get an example of the above.My code is as follows:
import scipy.stats
samples = [1,1,2,3,2,2,7,8,3,4,1]
samples = pd.DataFrame(samples,columns=['data'])
print(samples)
kde = scipy.stats.gaussian_kde(samples['data'].tolist())
samples['pro'] = kde.pdf(samples['data'].tolist())
print(samples)
But what I can't stand is that if my column is too long, it makes the operation slow.Is there a better way to do it in pandas?Thanks in advance.
Its own distribution does not mean kde. You can use value_counts with normalize=True
df.assign(pro=df.data.map(df.data.value_counts(normalize=True)))
data pro
0 1 0.272727
1 1 0.272727
2 2 0.272727
3 3 0.181818
4 2 0.272727
5 2 0.272727
6 7 0.090909
7 8 0.090909
8 3 0.181818
9 4 0.090909
10 1 0.272727

Resources