Create a new Id column which start with 0000 and increments one by one in python - python-3.x

I want to create a new Id column to a data frame which should start from 0000 and increments
Expecting output:

You can use this
df['id'] = pd.Series(np.arange(len(df))).astype(str).str.zfill(4)
input
Place Number Code
0 X A 1
1 Y B 2
2 X C 3
3 Y D 0
4 X F 1
5 Y G 2
6 X H 5
7 Y I 4
output
Place Number Code id
0 X A 1 0000
1 Y B 2 0001
2 X C 3 0002
3 Y D 0 0003
4 X F 1 0004
5 Y G 2 0005
6 X H 5 0006
7 Y I 4 0007

Related

pandas fill 0s with mean based on rows that match a condition in another column

I have a dataframe like below in which I need to replace the 0s with the mean of the rows where the parent_key matches the self_key.
Input DataFrame: df= pd.DataFrame ({'self_key':['a','b','c','d','e','e','e','f','f','f'],'parent_key':[np.nan,'a','b','b','c','c','c','d','d','d'], 'value':[0,0,0,0,4,6,14,12,8,22],'level':[1,2,3,3,4,4,4,4,4,4]})
The row 3 has self_key of 'd' so I would need to replace its 0 value in column 'value' with the mean of rows 7,8,9 to fill with the correct value of 14. Since the lower levels feed into the higher levels I would need to do it from lowest level to highest to fill out the dataframe as well but when I do the below code it doesn't work and I get the error "ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional". How can I fill in the 0s with the means from lowest level to highest?
df['value']=np.where((df['value']==0) & (df['level']==3), df['value'].groupby(df.where(df['parent_key']==df['self_key'])).transform('mean'), df['value'])
Input
self_key parent_key value level
0 a NaN 0 1
1 b a 0 2
2 c b 0 3
3 d b 0 3
4 e c 4 4
5 e c 6 4
6 e c 14 4
7 f d 12 4
8 f d 8 4
9 f d 22 4
My approach is to repeat the above code 3 times and change the level from 3 to 2 to 1, but its not working for even level 3.
Expected Ouput:
self_key parent_key value level
0 a NaN 11 1
1 b a 11 2
2 c b 8 3
3 d b 14 3
4 e c 4 4
5 e c 6 4
6 e c 14 4
7 f d 12 4
8 f d 8 4
9 f d 22 4
If I understand your problem correctly, you are trying to compute mean in a bottom-up fashion by filtering dataframe on certain keys. If so, then following should solve it:
for l in range(df["level"].max()-1, 0, -1):
df_sub = df[(df["level"] == l) & (df["value"] == 0)]
self_keys = df_sub["self_key"].tolist()
for k in self_keys:
df.loc[df_sub[df_sub["self_key"] == k].index, "value"] = df[df["parent_key"] == k]["value"].mean()
[Out]:
self_key parent_key value level
0 a 11 1
1 b a 11 2
2 c b 8 3
3 d b 14 3
4 e c 4 4
5 e c 6 4
6 e c 14 4
7 f d 12 4
8 f d 8 4
9 f d 22 4

Complete DataFrame with missing steps python

I have a pandas data frame which misses some rows. It actually has the following format:
id step var1 var2
1 1 a h
2 1 b i
3 1 c g
1 3 d k
2 2 e l
5 2 f m
6 1 g n
...
An observation should pass through every steps. I mean id ==1 has step 1 and 3 but misses step 2 (which I don't want). id==2 has step 1 and 2 and there is no step 3 and this is fine because there is no gap. id ==5 has step 2 but doesn't have step 1 so I am missing a line there.
I need to add some rows to complete the steps, I would keep var1 var2 and id as the same.
I would like to obtain this df :
id step var1 var2
1 1 a h
2 1 b i
3 1 c g
1 3 d k
2 2 e l
5 2 f m
6 1 g n
1 2 a h
5 1 f m
...
It would be awesome if anyone could help with a smooth solution
You can try pivot the table then ffill and bfill:
(df.pivot(index='id', columns='step')
.groupby(level=0, axis=1)
.apply(lambda x: x.ffill().bfill())
.stack()
.reset_index()
)
Output:
id step var1 var2
0 1 1 a h
1 1 2 e l
2 1 3 d k
3 2 1 b i
4 2 2 e l
5 2 3 d k
6 3 1 c g
7 3 2 e l
8 3 3 d k
9 5 1 c g
10 5 2 f m
11 5 3 d k
12 6 1 g n
13 6 2 f m
14 6 3 d k

Finding max value in another column for each unique value in a column in pandas

I am trying to get the max start for each id, this is the table that I have:
id descrip start
0 0000 x 4
1 0000 y 60
2 1111 x 7
3 1111 x 0
4 2222 z 452
5 3333 x 36622
6 3333 t 32
And this is what I want:
id descrip start
0 0000 y 60
1 1111 x 7
2 2222 z 452
3 3333 x 36622
I tried doing this
df.loc[df.reset_index().groupby(['id'])['start'].idxmax()]
But i have been getting this error:
KeyError: 'Passing list-likes to .loc or [] with any missing labels is no longer supported, see https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike'

If a value in a column has multiple values in another column, how to filter based on priority in pandas

If I have a data frame like this:
id descrip
0 0000 x
1 0000 y
2 0000 z
3 1111 x
4 1111 z
5 2222 z
6 3333 x
7 3333 y
And I want to basically keep rows based on a priority of the descrip column, where if there is a z, then that is preferred over a y, which is preferred over an x.
So I basically want this:
id descrip
0 0000 z
1 1111 z
2 2222 z
3 3333 y
Not sure how I would approach this
df.groupby('id')['descrip'].max().reset_index()
id descrip
0 0 z
1 1111 z
2 2222 z
3 3333 y
Its always good to keep a track of what is exactly preferred over what.
Lets say the ordering was different ie: y<z<x where x is the most prefered. Then we could do:
df['descrip'] = df.descrip.astype('category').cat.reorder_categories(['y', 'z', 'x']).\
cat.as_ordered()
df.groupby('id')['descrip'].max().reset_index()
id descrip
0 0 x
1 1111 x
2 2222 z
3 3333 x

Pandas Dataframe show Count with Group by and Aggregate

I have this data
ID Value1 Value2 Type Type2
1 3 1 A X
2 2 2 A X
3 5 3 B Y
4 2 4 B Z
5 6 8 C Z
6 7 9 C Z
7 8 0 C L
8 3 2 D M
9 4 3 D M
10 6 5 D M
11 8 7 D M
Right now i am able to generate this output using this code
pandabook.groupby(['Type','Type2'],as_index=False)['Value1', 'Value2'].agg({'Value1': 'sum','Value2': 'sum'})
ID Value 1 Value2 Type Type2
1 5 3 A X
2 5 3 B Y
3 2 5 B Z
4 13 17 C Z
5 8 0 C L
6 21 17 D M
I want to show the Aggregated count as well, as show in this example
How can i achieve this output ?
Add new value to dictionary with size function, remove as_index=False for prevent:
ValueError: cannot insert Type, already exists
and last rename with reset_index:
df = pandabook.groupby(['Type','Type2']).agg({'Value1': 'sum','Value2': 'sum', 'Type':'size'})
df = df.rename(columns={'Type':'Count'}).reset_index()
print (df)
Type Type2 Value1 Value2 Count
0 A X 5 3 2
1 B Y 5 3 1
2 B Z 2 4 1
3 C L 8 0 1
4 C Z 13 17 2
5 D M 21 17 4

Resources