Complete DataFrame with missing steps python - python-3.x

I have a pandas data frame which misses some rows. It actually has the following format:
id step var1 var2
1 1 a h
2 1 b i
3 1 c g
1 3 d k
2 2 e l
5 2 f m
6 1 g n
...
An observation should pass through every steps. I mean id ==1 has step 1 and 3 but misses step 2 (which I don't want). id==2 has step 1 and 2 and there is no step 3 and this is fine because there is no gap. id ==5 has step 2 but doesn't have step 1 so I am missing a line there.
I need to add some rows to complete the steps, I would keep var1 var2 and id as the same.
I would like to obtain this df :
id step var1 var2
1 1 a h
2 1 b i
3 1 c g
1 3 d k
2 2 e l
5 2 f m
6 1 g n
1 2 a h
5 1 f m
...
It would be awesome if anyone could help with a smooth solution

You can try pivot the table then ffill and bfill:
(df.pivot(index='id', columns='step')
.groupby(level=0, axis=1)
.apply(lambda x: x.ffill().bfill())
.stack()
.reset_index()
)
Output:
id step var1 var2
0 1 1 a h
1 1 2 e l
2 1 3 d k
3 2 1 b i
4 2 2 e l
5 2 3 d k
6 3 1 c g
7 3 2 e l
8 3 3 d k
9 5 1 c g
10 5 2 f m
11 5 3 d k
12 6 1 g n
13 6 2 f m
14 6 3 d k

Related

Groupby count in pandas of two categorical column - pandas

I have df as shown below.
df
ID Type Status Age
1 2 P 23
2 1 P 28
3 1 F 33
4 3 P 48
5 1 F 23
6 2 P 28
7 2 F 23
8 3 P 38
From the above I would like to perform groupby count of Status based on Type
Expected output:
Type Status Frequency
1 F 2
1 P 1
2 F 1
2 P 2
3 F 0
3 P 2
I tried below code
df.groupby('Type').agg('Status': 'size').\
sort_values(ascending = False).reset_index()
I think you want value_counts:
df.groupby('Type').Status.value_counts().reset_index(name='Frequency')
Output:
Type Status Frequency
0 1 F 2
1 1 P 1
2 2 P 2
3 2 F 1
4 3 P 2
Or replace reset_index with unstack to get the missing groups:
df.groupby('Type').Status.value_counts().unstack(fill_value=0)
Output:
Status F P
Type
1 2 1
2 1 2
3 0 2
Note: df.groupby('Type').Status.value_counts() is somewhat equivalent to df.groupby(['Type,'Status']).size().
Let us try crosstab
pd.crosstab(df.Type, df.Status)
Out[268]:
Status F P
Type
1 2 1
2 1 2
3 0 2
pd.crosstab(df.Type, df.Status).stack().reset_index(name = 'freq')
Out[273]:
Type Status freq
0 1 F 2
1 1 P 1
2 2 F 1
3 2 P 2
4 3 F 0
5 3 P 2

pandas transform one row into multiple rows

I have a dataframe as below.
My dataframe as below.
ID list
1 a, b, c
2 a, s
3 NA
5 f, j, l
I need to break each items in the list column(String) into independent row as below:
ID item
1 a
1 b
1 c
2 a
2 s
3 NA
5 f
5 j
5 l
Thanks.
Use str.split to separate your items then explode:
print (df.assign(list=df["list"].str.split(", ")).explode("list"))
ID list
0 1 a
0 1 b
0 1 c
1 2 a
1 2 s
2 3 NaN
3 5 f
3 5 j
3 5 l
A beginners approach : Just another way of doing the same thing using pd.DataFrame.stack
df['list'] = df['list'].map(lambda x : str(x).split(','))
dfOut = pd.DataFrame(df['list'].values.tolist())
dfOut.index = df['ID']
dfOut = dfOut.stack().reset_index()
del dfOut['level_1']
dfOut.rename(columns = {0 : 'list'}, inplace = True)
Output:
ID list
0 1 a
1 1 b
2 1 c
3 2 a
4 2 s
5 3 nan
6 5 f
7 5 j
8 5 l

Pandas Dataframe show Count with Group by and Aggregate

I have this data
ID Value1 Value2 Type Type2
1 3 1 A X
2 2 2 A X
3 5 3 B Y
4 2 4 B Z
5 6 8 C Z
6 7 9 C Z
7 8 0 C L
8 3 2 D M
9 4 3 D M
10 6 5 D M
11 8 7 D M
Right now i am able to generate this output using this code
pandabook.groupby(['Type','Type2'],as_index=False)['Value1', 'Value2'].agg({'Value1': 'sum','Value2': 'sum'})
ID Value 1 Value2 Type Type2
1 5 3 A X
2 5 3 B Y
3 2 5 B Z
4 13 17 C Z
5 8 0 C L
6 21 17 D M
I want to show the Aggregated count as well, as show in this example
How can i achieve this output ?
Add new value to dictionary with size function, remove as_index=False for prevent:
ValueError: cannot insert Type, already exists
and last rename with reset_index:
df = pandabook.groupby(['Type','Type2']).agg({'Value1': 'sum','Value2': 'sum', 'Type':'size'})
df = df.rename(columns={'Type':'Count'}).reset_index()
print (df)
Type Type2 Value1 Value2 Count
0 A X 5 3 2
1 B Y 5 3 1
2 B Z 2 4 1
3 C L 8 0 1
4 C Z 13 17 2
5 D M 21 17 4

Python Pandas: copy several columns at specific row from one dataframe to another with different names

I have dataframe1 with columns a,b,c,d with 5 rows.
I also have another dataframe2 with columns e,f,g,h
Let's say I want to copy columns a,b in row 3 from dataframe1 to columns f,g in row 3 at dataframe2.
I tried to use this code:
dataframe2.loc[3,['f','g']] = dataframe1.loc[3,['a','b']].
The results was NaN in dataframe2.
Any ideas how can I solve it?
One idea is convert to numpy array for avoid alignment data by columns names:
dataframe2.loc[3,['f','g']] = dataframe1.loc[3,['a','b']].values
Sample:
dataframe1 = pd.DataFrame({'a':list('abcdef'),
'b':[4,5,4,5,5,4],
'c':[7,8,9,4,2,3]})
print (dataframe1)
a b c
0 a 4 7
1 b 5 8
2 c 4 9
3 d 5 4
4 e 5 2
5 f 4 3
dataframe2 = pd.DataFrame({'f':list('HIJK'),
'g':[0,0,7,1],
'h':[0,1,0,1]})
print (dataframe2)
f g h
0 H 0 0
1 I 0 1
2 J 7 0
3 K 1 1
dataframe2.loc[3,['f','g']] = dataframe1.loc[3,['a','b']].values
print (dataframe2)
f g h
0 H 0 0
1 I 0 1
2 J 7 0
3 d 5 1

Column name and index of max value

I currently have a pandas dataframe where values between 0 and 1 are saved. I am looking for a function which can provide me the top 5 values of a column, together with the name of the column and the associated index of the values.
Sample Input: data frame with column names a:z, index 1:23, entries are values between 0 and 1
Sample Output: array of 5 highest entries in each column, each with column name and index
Edit:
For the following data frame:
np.random.seed([3,1415])
df = pd.DataFrame(np.random.randint(10, size=(10, 4)), list('abcdefghij'), list('ABCD'))
df
A B C D
a 0 2 7 3
b 8 7 0 6
c 8 6 0 2
d 0 4 9 7
e 3 2 4 3
f 3 6 7 7
g 4 5 3 7
h 5 9 8 7
i 6 4 7 6
j 2 6 6 5
I would like to get an output like (for example for the first column):
[[8,b,A], [8, c, A], [6,i,A], [5, h, A], [4,g,A]].
consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(
np.random.randint(10, size=(10, 4)), list('abcdefghij'), list('ABCD'))
df
A B C D
a 0 2 7 3
b 8 7 0 6
c 8 6 0 2
d 0 4 9 7
e 3 2 4 3
f 3 6 7 7
g 4 5 3 7
h 5 9 8 7
i 6 4 7 6
j 2 6 6 5
I'm going to use np.argpartition to separate each column into the 5 smallest and 10 - 5 (also 5) largest
v = df.values
i = df.index.values
k = len(v) - 5
pd.DataFrame(
i[v.argpartition(k, 0)[-k:]],
np.arange(k), df.columns
)
A B C D
0 g f i i
1 b c a d
2 h h f h
3 i b d f
4 c j h g
print(your_dataframe.sort_values(ascending=False)[0:4])

Resources