Fill missing rows in a python pandas dataframe with repetitive pattern - python-3.x

I am trying to fix missing rows in a pandas DataFrame like this:
import pandas as pd
df = pd.DataFrame([[1, 1.2, 3.4], [2, 4.5, 6.7], [3, 1.3, 2.5], [4, 5.6, 7.3],
[1, 3.4, 5.8], [2, 5.7, 8.9], [4, 2.4, 2.6], [1, 6.7, 8.4],
[3, 6.9, 4.2], [4, 4.2, 1.2]], columns = ['#', 'foo', 'bar'])
The above code give me a pandas dataframe like this:
Out[10]:
# foo bar
0 1 1.2 3.4
1 2 4.5 6.7
2 3 1.3 2.5
3 4 5.6 7.3
4 1 3.4 5.8
5 2 5.7 8.9
6 4 2.4 2.6
7 1 6.7 8.4
8 3 6.9 4.2
9 4 4.2 1.2
As you probably noticed, the values in the '#' column are in a repetitive pattern as 1, 2, 3, 4, 1, 2, 3, 4 ... but with some missing values (for this instance, 3 before row 6 and 2 before row 8). My question is: Is there any built in method (function) in pandas to fill the missing rows in this dataframe according to the repetitive pattern of '#' column? The values in the other columns of the filling rows can be NaN, or the interpolation\extrapolation\average of the values before and\or after the filling rows. In the other words, what I want is like this:
Out[16]:
# foo bar
0 1 1.2 3.4
1 2 4.5 6.7
2 3 1.3 2.5
3 4 5.6 7.3
4 1 3.4 5.8
5 2 5.7 8.9
6 3 NaN NaN
7 4 2.4 2.6
8 1 6.7 8.4
9 2 NaN NaN
10 3 6.9 4.2
11 4 4.2 1.2
I tried to set the '#' column as the index of the dataframe and reindex it with regular pattern without missing values. But the problem is the pd.reindex doesn't work with duplicate values. I know I can always go traditional way by iterating in a loop from line to line to fix it but I am afraid this would be time consuming if working with large size data.
I would appreciate if anyone can give me a hint on this.

You need create groups some way - here is used difference of values # and comparing with >1 by Series.le, then is used GroupBy.apply with Series.reindex:
df1 = (df.groupby(df['#'].diff().lt(1).cumsum())
.apply(lambda x: x.set_index('#').reindex(range(1, 5)))
.reset_index(level=0, drop=True)
.reset_index())
print (df1)
# foo bar
0 1 1.2 3.4
1 2 4.5 6.7
2 3 1.3 2.5
3 4 5.6 7.3
4 1 3.4 5.8
5 2 5.7 8.9
6 3 NaN NaN
7 4 2.4 2.6
8 1 6.7 8.4
9 2 NaN NaN
10 3 6.9 4.2
11 4 4.2 1.2
Another idea is create MultiIndex and reshape by unstack and stack:
df = (df.set_index(['#', df['#'].diff().lt(1).cumsum()])
.unstack()
.reindex(np.arange(4)+1)
.stack(dropna=False)
.sort_index(level=1)
.reset_index(level=1, drop=True)
.reset_index())
print (df)
# foo bar
0 1 1.2 3.4
1 2 4.5 6.7
2 3 1.3 2.5
3 4 5.6 7.3
4 1 3.4 5.8
5 2 5.7 8.9
6 3 NaN NaN
7 4 2.4 2.6
8 1 6.7 8.4
9 2 NaN NaN
10 3 6.9 4.2
11 4 4.2 1.2

We can mark each group of 1,2,3,4 with eq and cumsum.
Then we groupby on these groups and use reindex and finally concat them back together.
s = df['#'].eq(4).shift().cumsum().bfill()
pd.concat(
[d.set_index('#').reindex(np.arange(4)+1) for _, d in df.groupby(s)]
).reset_index()
Output
# foo bar
0 1 1.2 3.4
1 2 4.5 6.7
2 3 1.3 2.5
3 4 5.6 7.3
4 1 3.4 5.8
5 2 5.7 8.9
6 3 NaN NaN
7 4 2.4 2.6
8 1 6.7 8.4
9 2 NaN NaN
10 3 6.9 4.2
11 4 4.2 1.2
Note: if you would have a 4 as missing value in your # column, this method would fail.

This is similar to #jezrael sans the reindex and sort_index:
df['rep'] = df['#'].diff().le(0).cumsum()
(df.set_index(['rep','#'])
.unstack('#')
.stack('#', dropna=False)
.reset_index('#')
.reset_index(drop=True)
)
Output:
# foo bar
0 1 1.2 3.4
1 2 4.5 6.7
2 3 1.3 2.5
3 4 5.6 7.3
4 1 3.4 5.8
5 2 5.7 8.9
6 3 NaN NaN
7 4 2.4 2.6
8 1 6.7 8.4
9 2 NaN NaN
10 3 6.9 4.2
11 4 4.2 1.2

You could use the complete function from pyjanitor to expose the missing values :
# pip install pyjanitor
import pandas as pd
import janitor as jn
# cumsum creates identifiers for the groups in `#`
(df.assign(counter = df['#'].eq(1).cumsum())
.complete('#', 'counter')
# sorting can be ignored, if order is not important
.sort_values('counter', ignore_index = True)
.drop(columns='counter'))
# foo bar
0 1 1.2 3.4
1 2 4.5 6.7
2 3 1.3 2.5
3 4 5.6 7.3
4 1 3.4 5.8
5 2 5.7 8.9
6 3 NaN NaN
7 4 2.4 2.6
8 1 6.7 8.4
9 2 NaN NaN
10 3 6.9 4.2
11 4 4.2 1.2

Related

How to groupby a key and return min/max values in other columns on a single row?

I have a set of data that I am trying to group together based on a common key in column A and I want it to return a single row of information per grouped key value. Grouping is easy, but I am having issues with my other columns returning the values that I need. Here is the dataframe:
df = pd.DataFrame({'A': [1,2,1,2,3,3,3,4,5,6,6,4,5,5],
'B': [1.1,2.1,1.2,2.2,3.1,3.2,3.3,4.1,5.1,6.1,6.2,4.2,5.2,5.3],
'C':[10.1,20.1,10.1,20.1,30.1,30.1,30.1,40.1,50.1,60.1,60.1,40.1,50.1,50.1],
'D':['','',10.2,20.2,'','',30.2,'','','',60.2,40.2,'',50.2]
})
df
--------------------------------------------------------------------------------------------------
A B C D
0 1 1.1 10.1
1 2 2.1 20.1
2 1 1.2 10.1 10.2
3 2 2.2 20.1 20.2
4 3 3.1 30.1
5 3 3.2 30.1
6 3 3.3 30.1 30.2
7 4 4.1 40.1
8 5 5.1 50.1
9 6 6.1 60.1
10 6 6.2 60.1 60.2
11 4 4.2 40.1 40.2
12 5 5.2 50.1
13 5 5.3 50.1 50.2
I want to group by column "A", have column "B" display the minimum value, and then column "D" return the maximum value. My desired output would look something like this:
A B C D
0 1 1.1 10.1 10.2
1 2 2.1 20.1 20.2
2 3 3.1 30.1 30.2
3 4 4.1 40.1 40.2
4 5 5.1 50.1 50.2
5 6 6.1 60.1 60.2
I have tried grouping by column "A" and then have column "B" only pull the minimum value for each grouped key and then display the remaining column values for that minimum value in column "B" in a single row, but it outputs the NaN values for column "D". Currently the output of the code looks like this:
df = df.loc[df.groupby('A')['B'].idxmin()]
df
------------------------------------------------------------------------------------------------
A B C D
0 1 1.1 10.1
1 2 2.1 20.1
4 3 3.1 30.1
7 4 4.1 40.1
8 5 5.1 50.1
9 6 6.1 60.1
I also tried using groupby with lambda and ffill().tail(1), and got the result I wanted for column "D" but column "B" isn't the minimum/lowest value. Here is the code and output for that:
out = df.replace({'': pd.NA}) \
.groupby("A", as_index=False) \
.apply(lambda x: x.ffill().tail(1)) \
.reset_index(level=0,drop=True)
df = out
df
-------------------------------------------------------------------------------------------------
A B C D
2 1 1.2 10.1 10.2
3 2 2.2 20.1 20.2
6 3 3.3 30.1 30.2
11 4 4.2 40.1 40.2
13 5 5.3 50.1 50.2
10 6 6.2 60.1 60.2
Any ideas how I can combine these two pieces of code to make it so that I get the minimum value in column "A" and the maximum value in column "B" all within the same row based on the common key value.
Any help is appreciated.
try via replace() method:
df['D']=df['D'].replace('| ',float('NaN'),regex=True)
#replace the '' or ' ' to NaN
Finally use groupby() and agg():
out=df.groupby('A',as_index=False).agg({'B':'min','C':'first','D':'max'})
#use groupby and agg your according to your need
output of out:
A B C D
0 1 1.1 10.1 10.2
1 2 2.1 20.1 20.2
2 3 3.1 30.1 30.2
3 4 4.1 40.1 40.2
4 5 5.1 50.1 50.2
5 6 6.1 60.1 60.2

Replace multiple columns' NaNs with other columns' values in Pandas

Given a dataframe as follows:
date city gdp gdp1 gdp2 gross domestic product pop pop1 pop2
0 2001-03 bj 3.0 NaN NaN NaN 7.0 NaN NaN
1 2001-06 bj 5.0 NaN NaN NaN 6.0 6.0 NaN
2 2001-09 bj 8.0 NaN NaN 8.0 4.0 4.0 NaN
3 2001-12 bj 7.0 NaN 7.0 NaN 2.0 NaN 2.0
4 2001-03 sh 4.0 4.0 NaN NaN 3.0 NaN NaN
5 2001-06 sh 5.0 NaN NaN 5.0 5.0 5.0 NaN
6 2001-09 sh 9.0 NaN NaN NaN 4.0 4.0 NaN
7 2001-12 sh 3.0 3.0 NaN NaN 6.0 NaN 6.0
I want to replace NaNs from gdp and pop with values of gdp1, gdp2, gross domestic product and pop1, pop2 respectively.
date city gdp pop
0 2001-03 bj 3 7
1 2001-06 bj 5 6
2 2001-09 bj 8 4
3 2001-12 bj 7 2
4 2001-03 sh 4 3
5 2001-06 sh 5 5
6 2001-09 sh 9 4
7 2001-12 sh 3 6
The following code works, but I wonder if it's possible to make it more concise, since I have many similar columns?
df.loc[df['gdp'].isnull(), 'gdp'] = df['gdp1']
df.loc[df['gdp'].isnull(), 'gdp'] = df['gdp2']
df.loc[df['gdp'].isnull(), 'gdp'] = df['gross domestic product']
df.loc[df['pop'].isnull(), 'pop'] = df['pop1']
df.loc[df['pop'].isnull(), 'pop'] = df['pop2']
df.drop(['gdp1', 'gdp2', 'gross domestic product', 'pop1', 'pop2'], axis=1)
Idea is use back filling missing values filtered by DataFrame.filter, if possible more values per group then is prioritize columns from left side, if change .bfill(axis=1).iloc[:, 0] to .ffill(axis=1).iloc[:, -1] then is prioritize columns from right side:
#if first column is gdp, pop
df['gdp'] = df.filter(like='gdp').bfill(axis=1)['gdp']
df['pop'] = df.filter(like='pop').bfill(axis=1)['pop']
#if possible any first column
df['gdp'] = df.filter(like='gdp').bfill(axis=1).iloc[:, 0]
df['pop'] = df.filter(like='pop').bfill(axis=1).iloc[:, 0]
But if only one non missing values is posible use max, min...:
df['gdp'] = df.filter(like='gdp').max(axis=1)
df['pop'] = df.filter(like='pop').max(axis=1)
If need specify columns names by list:
gdp_c = ['gdp1','gdp2','gross domestic product']
pop_c = ['pop1','pop2']
df['gdp'] = df[gdp_c].bfill(axis=1).iloc[:, 0]
df['pop'] = df[pop_c].bfill(axis=1).iloc[:, 0]
df = df[['date','city','gdp','pop']]
print (df)
date city gdp pop
0 2001-03 bj 3.0 7.0
1 2001-06 bj 5.0 6.0
2 2001-09 bj 8.0 4.0
3 2001-12 bj 7.0 2.0
4 2001-03 sh 4.0 3.0
5 2001-06 sh 5.0 5.0
6 2001-09 sh 9.0 4.0
7 2001-12 sh 3.0 6.0

Python 3.x: Pandas DataFrame How do we combine multiple csv files into one csv file?

I have multiple datasets that has the same number of rows and columns. The column is 0.1,2,3,4,5,6,7,8.
For instance,
Data1
0.1 3
2 3
3 0.1
4 10
5 5
6 7
7 9
8 2
Data2
0.1 2
2 1
3 0.1
4 0.5
5 4
6 0.3
7 9
8 2
I want to combine the data sets. However, I would like to combine the data by keeping the column and by adding the 2nd columns for multiple files.
0.1 3 2
2 3 1
3 0.1 0.1
4 10 0.5
5 5 4
6 7 0.3
7 9 9
8 2 2
I prefer to use Pandas Dataframe. Any clever way to go about this?
Assuming the first column is the index and the second is data:
df = Data1.join(Data2, lsuffix='_1', rsuffix='_2')
Or using merge, and setting column names as 'A' and 'B'
pd.merge(df1, df2, on='A',suffixes=('_data1','_data2'))
A B_data1 B_data2
0 0.1 3.0 2.0
1 2.0 3.0 1.0
2 3.0 0.1 0.1
3 4.0 10.0 0.5
4 5.0 5.0 4.0
5 6.0 6.0 0.3
6 7.0 9.0 9.0
7 8.0 2.0 2.0

shifting a column down in a pandas dataframe

I have data in the following way
A B C
1 2 3
2 5 6
7 8 9
I want to change the dataframe into
A B C
2 3
1 5 6
2 8 9
3
One way would be to add a blank row to the dataframe and then use shift
# input df:
A B C
0 1 2 3
1 2 5 6
2 7 8 9
df.loc[len(df.index), :] = None
df['A'] = df.A.shift(1)
print (df)
A B C
0 NaN 2.0 3.0
1 1.0 5.0 6.0
2 2.0 8.0 9.0
3 7.0 NaN NaN

Overwrite Value in Dataframe with checking Line before

So the DataFrame is:
1 28.3
2 27.9
3 22.4
4 18.1
5 15.5
6 7.1
7 5.1
8 12.0
9 15.1
10 10.1
Now i want to replace all over 25 with HSE and all below with LSE. Everthing else is "Middle". But i want to know if it was over 25 or below 8, before it got "Middle". So if it was over 25 before I would replace the value with "fHtM" and if it was below 8 before I would replace the value with "fLtM".
Thank you in advance.
Desired output:
Maybe like that:
1 S4
2 S4
3 S4
4 dS3 (down to class S3)
5 dS3
6 dS2
7 dS1
8 uS2 (up to class S2)
9 uS3
10 dS2
You can use cut:
bins = [-np.inf, 6, 13, 19, np.inf]
labels=['S1','S2','S3','S4']
df['label'] = pd.cut(df['value'], bins=bins, labels=labels)
print (df)
a value label
0 1 28.3 S4
1 2 27.9 S4
2 3 22.4 S4
3 4 18.1 S3
4 5 15.5 S3
5 6 7.1 S2
6 7 5.1 S1
7 8 12.0 S2
8 9 15.1 S3
9 10 10.1 S2
And if need add trend, use diff:
Explaining:
First get from column label second characters by str[1], convert it to int number and count diff. If duplicates, you get 0, so need replace them by NaN and forward fill by ffill().
dif = (df.label.str[1].astype(int).diff().replace(0,np.nan).ffill())
print (dif)
0 NaN
1 NaN
2 NaN
3 -1.0
4 -1.0
5 -1.0
6 -1.0
7 1.0
8 1.0
9 -1.0
Name: label, dtype: float64
Then use numpy.where for creating u where value is 1, d where is -1 and empty string if something else what is added to column label.
df['label1'] = dif.where(dif.isnull(), np.where(dif == 1.0, 'u','d')).fillna('') + df.label.astype(str)
print (df)
a value label
0 1 28.3 S4
1 2 27.9 S4
2 3 22.4 S4
3 4 18.1 dS3
4 5 15.5 dS3
5 6 7.1 dS2
6 7 5.1 dS1
7 8 12.0 uS2
8 9 15.1 uS3
9 10 10.1 dS2

Resources