can DataFrames have "holes" and how to implement it? - python-3.x

Talking about pandas we can have a DataFrame of the form
ao a1 a2
1 2.3 3.4 2.1
2 3.1 2.1 2.2
3 3.4 4.5 3.3
but is it possible to have something like
ao a1 a2
1 2.3 3.4 2.1
2 2.1 2.2
3 3.4 3.3
meaning no values (holes) inside the DataFrame?
How can I achieve something like this?
(My plan is later I will plot this including plotting nothing in the holes)

Are you trying to use NaN ?
import pandas as pd
import numpy as np
df = pd.DataFrame({'a0': [2.3, np.nan, 3.4],
'a1': [3.4, 2.1, np.nan],
'a2': [2.1, 2.2, 3.3]})
print(df)
Output
a0 a1 a2
0 2.3 3.4 2.1
1 NaN 2.1 2.2
2 3.4 NaN 3.3

Related

python pandas data frame: single column to multiple columns based on values

I am new to pandas.
I am trying to split a single column to multiple columns based on index value using Groupby. Below is the program wrote.
import pandas as pd
data = [(0,1.1),
(1,1.2),
(2,1.3),
(0,2.1),
(1,2.2),
(0,3.1),
(1,3.2),
(2,3.3),
(3,3.4)]
df = pd.DataFrame(data, columns=['ID','test_data'])
df = df.groupby('ID',sort=True).apply(lambda g: pd.Series(g['test_data'].values))
print(df)
df=df.unstack(level=-1).rename(columns=lambda x: 'test_data%s' %x)
print(df)
I have to use unstack(level=-1) because when we have uneven column size, the groupie and series stores the result as shown below.
ID
0 0 1.1
1 2.1
2 3.1
1 0 1.2
1 2.2
2 3.2
2 0 1.3
1 3.3
3 0 3.4
dtype: float64
End result I am getting after unstack is like below
test_data0 test_data1 test_data2
ID
0 1.1 2.1 3.1
1 1.2 2.2 3.2
2 1.3 3.3 NaN
3 3.4 NaN NaN
but what I am expecting is
test_data0 test_data1 test_data2
ID
0 1.1 2.1 3.1
1 1.2 2.2 3.2
2 1.3 NAN 3.3
3 NAN NAN 3.4
Let me know if there is any better way to do this other than groupby.
This will work if your dataframe is sorted as you show
df['num_zeros_seen'] = df['ID'].eq(0).cumsum()
#reshape the table
df = df.pivot(
index='ID',
columns='num_zeros_seen',
values='test_data',
)
print(df)
Output:
num_zeros_seen 1 2 3
ID
0 1.1 2.1 3.1
1 1.2 2.2 3.2
2 1.3 NaN 3.3
3 NaN NaN 3.4

Two row header to one row header for a data frame in Pandas

I have a data set with a multi-index, 2-dimensional header. I would like to merge it into one header row by removing empty column names in the first row by previous non nan column name in the same row.
Below showing structure of dataframe I have.
First two rows are header.
id One Two
response X1 Y1 Z1 X2 Y2
0 0 1.1 1.2 1.4 1.11 1.22
1 1 1.1 1.2 1.3 1.11 1.22
2 2 1.1 1.2 1.1 1.11 1.22
I want to change above data frame to one in below,
id One 1.X1 One 2.Y1 One 3.Z1 Two 1.X2 Two 2.Y2
0 0 1.1 1.2 1.4 1.11 1.22
1 1 1.1 1.2 1.3 1.11 1.22
2 2 1.1 1.2 1.1 1.11 1.22
Actual data frame has more than 100 columns.
Hope someone can help me here.
Than you so much.
Mary Abin.
if your columns are indeed a MultiIndex
i.e
print(df.columns)
MultiIndex([( 'id', 'response'),
('One', 'X1'),
('One', 'Y1'),
('One', 'Z1'),
('Two', 'X2'),
('Two', 'Y2')],
)
then we can pass them into a new data frame and use a cumulative count on the first level before flattening the columns.
s = pd.DataFrame.from_records(df.columns)
s['col'] = (s.groupby(0).cumcount()+1).astype(str) + '.'
#skip the first row and re-order columns to match your desired order.
df.columns = ['id'] + s.iloc[1:, [0,2,1]].astype(str).agg(' '.join,1).tolist()
print(df)
id One 1. X1 One 2. Y1 One 3. Z1 Two 1. X2 Two 2. Y2
0 0 1.1 1.2 1.4 1.11 1.22
1 1 1.1 1.2 1.3 1.11 1.22
2 2 1.1 1.2 1.1 1.11 1.22
print(s)
0 1 col
0 id response 1.
1 One X1 1.
2 One Y1 2.
3 One Z1 3.
4 Two X2 1.
5 Two Y2 2.
df.columns = df.columns.droplevel(0)
Check this https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.droplevel.html

Cannot relate new groupby dataframe to original dataframe

I have a dataframe called differ_vt that I wanted to apply groupby and summarize on the columns 'CO_FIPS' and 'FLD_ZONE':
FLD_ZONE AWPOP_1 Area_SQMI_1 AWHU_1 AWPOP_2 Area_SQMI_2 AWHU_2 CO_FIPS
0 A 18.1 23.1 101.3 3.0 23.1 3.1 50001
1 AE 6.7 13.5 58.6 0.03 13.5 4.8 50001
2 N 1.3 1.2 23.0 7.1 1.2 8.3 50001
3 X 0.0 38.5 0.0 0.0 38.5 0.0 50001
4 X500 4.6 44.5 4.8 4.8 44.5 2.1 50001
I create a new dataframe for the grouped and summarized data:
vt_sum = differ_vt_outer.groupby(['CO_FIPS', 'FLD_ZONE']).agg({'AWPOP_1': 'sum', 'Area_SQMI_1': 'sum', 'AWHU_1': 'sum', 'AWPOP_2': 'sum', 'Area_SQMI_2': 'sum', 'AWHU_2': 'sum'})
The new dataframe looks something like this:
vt_sum.head()
>
AWPOP_1 Area_SQMI_1 AWHU_1 AWPOP_2 Area_SQMI_2 AWHU_2
CO_FIPS FLD_ZONE
50001 A 2335.8 79.7 1095.1 2334.0 79.7 1094.1
AE 2134.5 74.1 1179.5 2134.5 74.1 1179.5
N 96.8 0.2 13.1 94.0 0.2 11.7
X 68119.7 1333.2 30623.9 68115.5 1333.2 30621.9
X500 339.2 4.4 149.8 339.2 4.4 149.8
50003 A 1006.9 4.8 542.7 1006.9 4.8 542.7
AE 2441.6 2.3 1265.0 2441.6 2.3 1265.0
AO 3.1 0.0 3.5 3.1 0.0 3.5
X 34896.6 700.4 20075.2 34896.6 700.4 20075.2
Now, I want to relate the summarized dataframe back to the original diff_vt dataframe and create new columns based on the summarized values. For example, for CO_FIPS = 50001 and FLD_ZONE = A, I want to add a column called Tot_AWPOP_1 that has a value of 2335.8.
differ_vt_outer['Tot_AWPOP_1'] = vt_sum['AWPOP_1'].values
However, when I run this, I get the error:
ValueError: Length of values does not match length of index
How can I resolve this?
You can use transform instead of agg after the groupby and join the result to the orignal dataframe after add_prefix to the columns' names, try:
list_col_sum = ['AWPOP_1', 'Area_SQMI_1', 'AWHU_1', 'AWPOP_2', 'Area_SQMI_2', 'AWHU_2']
differ_vt_outer = differ_vt_outer.join(differ_vt_outer.groupby(['CO_FIPS', 'FLD_ZONE'])[list_col_sum ]\
.transform(sum).add_prefix('Tot_'))

How do I fill these `NaN` values properly?

Here's my original dataframe with NaN values which I'm trying to fill;
https://prnt.sc/i40j33
If I use df.interpolate(axis=1) to fill up the NaN values, only some of the rows fill up properly with a number.
For e.g
https://prnt.sc/i40mgq
As you can see in the screenshot column:1981 and row:3 which had a NaN value has filled up properly with a value other than NaN. I want to fill the rest of NaN as well like that? Any idea how do I do that?
Using DataFrame.interpolate()
In your case it is failing because there are no columns to the left, and therefore the interpolate method doesn't know what to interpolate it to: missing_value = (left_value + right_value)/2
So you could, for example, insert a column to the left with all 0's (if you would like to impute your missing values on the first column with half of the next value), as such:
df.insert(loc=0, column='allZeroes', value=0)
After this, you could interpolate as you are doing and remove the column
General missing value imputation
Either use df.fillna('DEFAULT-VALUE') as Alex mentioned in the comments to the question. Docs here
or do something like:
df.my_col[df.my_col.isnull()] = 'DEFAULT-VALUE'
I'd recommend using the fillna as you can use methods such as forward fill (ffill) -- impute the missings with the previous value -- and other similar methods.
It seems like you might want to interpolate on axis=0, column-wise:
>>> df = pd.DataFrame(np.arange(35, dtype=float).reshape(5,7),
columns=[1951, 1961, 1971, 1981, 1991, 2001, 2001],
index=range(0, 5))
>>> df.iloc[1:3, 0] = np.nan
>>> df.iloc[3, 3] = np.nan
>>> df.interpolate(axis=0)
1951 1961 1971 1981 1991 2001 2001
0 0.0 1.0 2.0 3.0 4.0 5.0 6.0
1 7.0 8.0 9.0 10.0 11.0 12.0 13.0
2 14.0 15.0 16.0 17.0 18.0 19.0 20.0
3 21.0 22.0 23.0 24.0 25.0 26.0 27.0
4 28.0 29.0 30.0 31.0 32.0 33.0 34.0
Currently you're interpolating row-wise. NaNs that "begin" a Series aren't padded by a value on either side, making interpolation impossible for them.
Update: pandas is adding some more optionality for this in v 0.23.0.

groupby python TypeError: unorderable types: tuple() < str()

I initially wrote some code in python 2.7, but now I switched to python 3.5.
I want to aggregate numeric data from a couple of columns and grouping by the rest of them or at least one.
Here's my initial dataframe "testdf":
testdf
PROD_TAG BRAND Market ('VAL', 'Per1') ('VAL', 'Per2')
P_1 A Modern Trade 4.3 0.155
P_2 A Traditional Trade 5.7 0
P_3 B Modern Trade 10.0 11.2
P_3 B Traditional Trade 8.7 6.3
P_4 C Modern Trade 12.1 12.3
P_5 D Modern Trade 8.0 7.0
Last two column headers are tuples (thanks captain obvious). Per1 and Per2 stands for respective periods.
I want to execute a line of code, which worked previously on python 2.7:
testdf=testdf.groupby(['BRAND'])[('VAL','P1'),('VAL','P2')].sum()
It doesn't work because of tuple type of column headers and raises:
TypeError: unorderable types: tuple() < str()
Now if I rename column headers like this:
testdf.columns=['PROD_TAG', 'BRAND', 'Market', 'VAL-P1', 'VAL-P2']
(removing tuples) I'll be able to execute the same line of code with new columns names:
testdf1=testdf.groupby(['BRAND'])['VAL-P1','VAL-P2'].sum()
and get in the end:
BRAND ('VAL', 'Per1') ('VAL', 'Per2')
A 10.0 0.155
B 18.7 17.5
C 12.1 12.3
D 8.0 7.0
Most wierd thing here is that if I use .mean() instead of .sum(), min() or .max(), everything works fine even with tuples.
Can anybody explain me how can I make such aggregations with tuple columns names work on python 3.5?
I think you need to use groupby.agg and pass a function to aggregate the sum of each group as shown:
df = pd.DataFrame({'PROD_TAG':["P_1", "P_2", "P_3", "P_3", "P_4", "P_5"],
'BRAND':["A", "A", "B", "B", "C", "D"],
'Market':["Modern Trade", "Traditional Trade", \
"Modern Trade", "Traditional Trade", "Modern Trade", "Modern Trade"],
('VAL','Per1'):[4.3, 5.7, 10.0, 8.7, 12.1, 8.0],
('VAL','Per2'):[0.155, 0, 11.2, 6.3, 12.3, 7.0]})
type(df[('VAL','Per1')].name)
#<class 'tuple'>
df.groupby(['BRAND'])[('VAL','Per1'), ('VAL','Per2')].agg(lambda x: x.sum())
(VAL, Per1) (VAL, Per2)
BRAND
A 10.0 0.155
B 18.7 17.500
C 12.1 12.300
D 8.0 7.000
Alternatively, the index is not reset and the grouper columns are transformed. So, you could get rid of the TypeError due to the name mismatch of the columns[tuple/str].
df.groupby(['BRAND'], as_index=False)[('VAL','Per1'), ('VAL','Per2')].sum()
BRAND (VAL, Per1) (VAL, Per2)
0 A 10.0 0.155
1 B 18.7 17.500
2 C 12.1 12.300
3 D 8.0 7.000
But if you rename the tuple columns to string, you can continue as before without using agg functions:
df.rename(index=str, columns={('VAL','Per1'): "('VAL','Per1')", \
('VAL','Per2'): "('VAL','Per2')"}, inplace=True)
type(df["('VAL','Per1')"].name)
#<class 'str'>
df.groupby(['BRAND'])["('VAL','Per1')","('VAL','Per2')"].sum()
('VAL','Per1') ('VAL','Per2')
BRAND
A 10.0 0.155
B 18.7 17.500
C 12.1 12.300
D 8.0 7.000
Note: Tested in Python 3.5

Resources