Cant assign value to cell in multiindex dataframe (assigning to copy / slice of df?) - python-3.x

I am trying to assign a value (mean of values in another column) to a cell in a multi-index Pandas dataframe over which I iterate to calculate means over a moving window in a different column. But, when I try to assign the value it doesn't change.
I am not used to working with multi-indexes and have solved several other problems but this one has me stumped for now...
Toy code that reproduces the problem:
tuples = [
('AFG', 1963), ('AFG', 1964), ('AFG', 1965), ('AFG', 1966), ('AFG', 1967), ('AFG', 1968),
('BRA', 1963), ('BRA', 1964), ('BRA', 1965), ('BRA', 1966), ('BRA', 1967), ('BRA', 1968)
]
index = pd.MultiIndex.from_tuples(tuples)
values = [[12, None], [0, None],
[12, None], [0, 4],
[12, 5], [0, 4],
[12, 2], [0, 4],
[12, 2], [0, 4],
[1, 4], [7, 1]]
df = pd.DataFrame(values, columns=['Oil', 'Pop'], index=index)
lag =-2
lead=0
indicator = 'Pop'
new_indicator = 'Mean_pop'
df[new_indicator] = np.nan
df
Gives:
Oil Pop Mean_pop
AFG 1963 12 NaN NaN
1964 0 NaN NaN
1965 12 NaN NaN
1966 0 4.0 NaN
1967 12 5.0 NaN
1968 0 4.0 NaN
BRA 1963 12 2.0 NaN
1964 0 4.0 NaN
1965 12 2.0 NaN
1966 0 4.0 NaN
1967 1 4.0 NaN
1968 7 1.0 NaN
Then to iterate over the df:
for country, country_df in df.groupby(level=0):
oldestyear = country_df[indicator].first_valid_index()[1]
latestyear = country_df[indicator].last_valid_index()[1]
for t in range(oldestyear, latestyear+1):
print (country, oldestyear, latestyear, t)
print (" For", country, ", calculate mean over ", t+lag, "to", t+lead,
"and add to row for year", t)
dftt = country_df.loc[(country, t+lag):(country, t+lead)]
print(dftt[indicator])
mean = dftt[indicator].mean(axis=0)
print("mean for ", indicator, "in", country, "during", t+lag, "to", t+lead, "is", mean)
df.loc[country, t][new_indicator] = mean
Diagnostic output not pasted, but df looks the same after iterating over it and I get the following warning on some iterations:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
if sys.path[0] == '':
Any pointers will be greatly appreciated.

I think it is a easy as setting last line to:
df.loc[(country, t), new_indicator] = mean

Related

Add Multiindex Dataframe and corresponding Series

I am failing to add a multiindex dataframe and a corresponding series. E.g.,
df = pd.DataFrame({
'a': [0, 0, 1, 1], 'b': [0, 1, 0, 1],
'c': [1, 2, 3, 4], 'd':[1, 1, 1, 1]}).set_index(['a', 'b'])
# Dataframe might contain records that are not in the series and vice versa
s = df['d'].iloc[1:]
df + s
produces
ValueError: cannot join with no overlapping index names
Does anyone know how to resolve this? I can work around the issue by adding each column separately, using e.g.
df['d'] + s
But I would like to add the two in a single operation. Any help is much appreciated.
By default, + tries to align along columns, the following would work with +:
s = df.iloc[:, 1:]
df + s
# c d
#a b
#0 0 NaN 2
# 1 NaN 2
#1 0 NaN 2
# 1 NaN 2
In your case, you need to align along index. You can explicitly specify axis=0 with add method for that:
df.add(s, axis=0)
# c d
#a b
#0 0 NaN NaN
# 1 3.0 2.0
#1 0 4.0 2.0
# 1 5.0 2.0

How to properly create a pandas dataframe with the given data?

I have the following experiment data:
experiment1_device = 'Dev2'
experiment1_values = [1, 2, 3, 4]
experiment1_timestamps = [1, 2, 3, 6]
experiment1_task = 'Oil Level'
experiment1_user = 'Sean'
experiment2_device = 'Dev1'
experiment2_values = [5, 6, 7, 8, 9]
experiment2_timestamps = [1, 2, 3, 4, 6]
experiment2_task = 'Ventilation'
experiment2_user = 'Martin'
experiment3_device = 'Dev1'
experiment3_values = [10, 11, 12, 13, 14]
experiment3_timestamps = [1, 2, 3, 4, 6]
experiment3_task = 'Ventilation'
experiment3_user = 'Sean'
Each experiment consists of:
A user who conducted the experiment
The task the user had to do
The device the user was using
A series of timestamps ...
and a series of data values which have been read at a certain timestamp, so len(experimentx_values ) == len(experimentx_timestamps )
The data is currently given in the above format, so to speak single variables, but I could change this output format if need. For example, if it would be better to put everything in a dict or so.
The expected output format I would like to achieve is the following:
Timestamp, User and Task should be a MultiIndex, whereas the device is supposed to be the column name, and empty (grey) cells should just contain NaN.
I tried multiple approaches with pd.Dataframe.from_records but couldn't get the desired output format.
Any help is highly appreciated!
Since the data are stored in all of those different variables it will be a lot of writing. However, you should try to store the results of each experiment in a DataFrame (directly from whatever outputs those values), and hold all of those DataFrames in a list which will cut down on the variables you have floating around.
Given your variables, construct DataFrames as follows:
df1 = pd.DataFrame({'Timestamp': experiment1_timestamps,
'User': experiment1_user,
'Task': experiment1_task,
experiment1_device: experiment1_values})
df2 = pd.DataFrame({'Timestamp': experiment2_timestamps,
'User': experiment2_user,
'Task': experiment2_task,
experiment2_device: experiment2_values})
df3 = pd.DataFrame({'Timestamp': experiment3_timestamps,
'User': experiment3_user,
'Task': experiment3_task,
experiment3_device: experiment3_values})
Now join them together into a single DataFrame, setting the index to the columns you want. It looks like your output wants the cartesian product of all possibilities, so we'll reindex to get the fully NaN rows back in:
df = pd.concat([df1, df2, df3]).set_index(['Timestamp', 'User', 'Task']).sort_index()
idx = pd.MultiIndex.from_product([df.index.get_level_values(i).unique()
for i in range(df.index.nlevels)])
df = df.reindex(idx)
Dev2 Dev1
Timestamp User Task
1 Martin Ventilation NaN 5.0
Oil Level NaN NaN
Sean Ventilation NaN 10.0
Oil Level 1.0 NaN
2 Martin Ventilation NaN 6.0
Oil Level NaN NaN
Sean Ventilation NaN 11.0
Oil Level 2.0 NaN
3 Martin Ventilation NaN 7.0
Oil Level NaN NaN
Sean Ventilation NaN 12.0
Oil Level 3.0 NaN
4 Martin Ventilation NaN 8.0
Oil Level NaN NaN
Sean Ventilation NaN 13.0
Oil Level NaN NaN
6 Martin Ventilation NaN 9.0
Oil Level NaN NaN
Sean Ventilation NaN 14.0
Oil Level 4.0 NaN
Maybe there is a simpler way, but you can create sub-dataframes:
import pandas as pd
import pandas as pd
df1 = pd.DataFrame(data=[[1, 2, 3, 4],[1, 2, 3, 6]]).T
df1.columns = ['values','timestamps']
df1['dev'] = 'Dev2'
df1['user'] = 'Sean'
df1['task'] = 'Oil Level'
df2 = pd.DataFrame(data=[[5, 6, 7, 8, 9],[1, 2, 3, 4, 6]]).T
df2.columns = ['values','timestamps']
df2['dev'] = 'Dev1'
df2['user'] = 'Martin'
df2['task'] = 'Ventilation'
df3 = pd.DataFrame(data=[[10, 11, 12, 13, 14],[1, 2, 3, 4, 6]]).T
df3.columns = ['values','timestamps']
df3['dev'] = 'Dev1'
df3['user'] = 'Sean'
df3['task'] = 'Ventilation'
merge them into one:
df = df1.merge(df2, on=['timestamps','dev','user','task','values'], how='outer')
df = df.merge(df3, on=['timestamps','dev','user','task','values'], how='outer')
and create the aggregation:
piv = df.groupby(['timestamps','user','task','dev']).sum()
piv = piv.unstack() # for the columns dev1, dev2

Pandas, how to dropna values using subset with multiindex dataframe?

I have a data frame with multi-index columns.
From this data frame I need to remove the rows with NaN values in a subset of columns.
I am trying to use the subset option of pd.dropna but I do not manage to find the way to specify the subset of columns. I have tried using pd.IndexSlice but this does not work.
In the example below I need to get ride of the last row.
import pandas as pd
# ---
a = [1, 1, 2, 2, 3, 3]
b = ["a", "b", "a", "b", "a", "b"]
col = pd.MultiIndex.from_arrays([a[:], b[:]])
val = [
[1, 2, 3, 4, 5, 6],
[None, None, 1, 2, 3, 4],
[None, 1, 2, 3, 4, 5],
[None, None, 5, 3, 3, 2],
[None, None, None, None, 5, 7],
]
# ---
df = pd.DataFrame(val, columns=col)
# ---
print(df)
# ---
idx = pd.IndexSlice
df.dropna(axis=0, how="all", subset=idx[1:2, :])
# ---
print(df)
Using the thresh option is an alternative but if possible I would like to use subset and how='all'
When dealing with a MultiIndex, each column of the MultiIndex can be specified as a tuple:
In [67]: df.dropna(axis=0, how="all", subset=[(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b')])
Out[67]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6
1 NaN NaN 1.0 2.0 3 4
2 NaN 1.0 2.0 3.0 4 5
3 NaN NaN 5.0 3.0 3 2
Or, to select all columns whose first level equals 1 or 2 you could use:
In [69]: df.dropna(axis=0, how="all", subset=df.loc[[], [1,2]].columns)
Out[69]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6
1 NaN NaN 1.0 2.0 3 4
2 NaN 1.0 2.0 3.0 4 5
3 NaN NaN 5.0 3.0 3 2
df[[1,2]].columns also works, but this returns a (possibly large) intermediate DataFrame. df.loc[[], [1,2]].columns is more memory-efficient since its intermediate DataFrame is empty.
If you want to apply the dropna to the columns which have 1 or 2 in level 1, you can do it as follows:
cols= [(c0, c1) for (c0, c1) in df.columns if c0 in [1,2]]
df.dropna(axis=0, how="all", subset=cols)
If applied to your data, it results in:
Out[446]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6
1 NaN NaN 1.0 2.0 3 4
2 NaN 1.0 2.0 3.0 4 5
3 NaN NaN 5.0 3.0 3 2
As you can see, the last line (index=4) is gone, because all columns below 1 and 2 were NaN for this line. If you rather want all rows to be removed, where any NaN occured in the column, you need:
df.dropna(axis=0, how="any", subset=cols)
Which results in:
Out[447]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6

Manipulate values in pandas DataFrame columns based on matching IDs from another DataFrame

I have two dataframes like the following examples:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': ['20', '50', '100'], 'b': [1, np.nan, 1],
'c': [np.nan, 1, 1]})
df_id = pd.DataFrame({'b': ['50', '4954', '93920', '20'],
'c': ['123', '100', '6', np.nan]})
print(df)
a b c
0 20 1.0 NaN
1 50 NaN 1.0
2 100 1.0 1.0
print(df_id)
b c
0 50 123
1 4954 100
2 93920 6
3 20 NaN
For each identifier in df['a'], I want to null the value in df['b'] if there is no matching identifier in any row in df_id['b']. I want to do the same for column df['c'].
My desired result is as follows:
result = pd.DataFrame({'a': ['20', '50', '100'], 'b': [1, np.nan, np.nan],
'c': [np.nan, np.nan, 1]})
print(result)
a b c
0 20 1.0 NaN
1 50 NaN NaN # df_id['c'] did not contain '50'
2 100 NaN 1.0 # df_id['b'] did not contain '100'
My attempt to do this is here:
for i, letter in enumerate(['b','c']):
df[letter] = (df.apply(lambda x: x[letter] if x['a']
.isin(df_id[letter].tolist()) else np.nan, axis = 1))
The error I get:
AttributeError: ("'str' object has no attribute 'isin'", 'occurred at index 0')
This is in Python 3.5.2, Pandas version 20.1
You can solve your problem using this instead:
for letter in ['b','c']: # took off enumerate cuz i didn't need it here, maybe you do for the rest of your code
df[letter] = df.apply(lambda row: row[letter] if row['a'] in (df_id[letter].tolist()) else np.nan,axis=1)
just replace isin with in.
The problem is that when you use apply on df, x will represent df rows, so when you select x['a'] you're actually selecting one element.
However, isin is applicable for series or list-like structures which raises the error so instead we just use in to check if that element is in the list.
Hope that was helpful. If you have any questions please ask.
Adapting a hard-to-find answer from Pandas New Column Calculation Based on Existing Columns Values:
for i, letter in enumerate(['b','c']):
mask = df['a'].isin(df_id[letter])
name = letter + '_new'
# for some reason, df[letter] = df.loc[mask, letter] does not work
df.loc[mask, name] = df.loc[mask, letter]
df[letter] = df[name]
del df[name]
This isn't pretty, but seems to work.
If you have a bigger Dataframe and performance is important to you, you can first build a mask df and then apply it to your dataframe.
First create the mask:
mask = df_id.apply(lambda x: df['a'].isin(x))
b c
0 True False
1 True False
2 False True
This can be applied to the original dataframe:
df.iloc[:,1:] = df.iloc[:,1:].mask(~mask, np.nan)
a b c
0 20 1.0 NaN
1 50 NaN NaN
2 100 NaN 1.0

Pandas Rows with missing values in multiple columns

I have a dataframe with columns age, date and location.
I would like to count how many rows are empty across ALL columns (not some but all in the same time). I have the following code, each line works independently, but how do I say age AND date AND location isnull?
df['age'].isnull().sum()
df['date'].isnull().sum()
df['location'].isnull().sum()
I would like to return a dataframe after removing the rows with missing values in ALL these three columns, so something like the following lines but combined in one statement:
df.mask(row['location'].isnull())
df[np.isfinite(df['age'])]
df[np.isfinite(df['date'])]
You basically can use your approach, but drop the column indices:
df.isnull().sum().sum()
The first .sum() returns a per-column value, while the second .sum() will return the sum of all NaN values.
Similar to Vaishali's answer, you can use df.dropna() to drop all values that are NaN or None and only return your cleaned DataFrame.
In [45]: df = pd.DataFrame({'age': [1, 2, 3, np.NaN, 4, None], 'date': [1, 2, 3, 4, None, 5], 'location': ['a', 'b', 'c', None, 'e', 'f']})
In [46]: df
Out[46]:
age date location
0 1.0 1.0 a
1 2.0 2.0 b
2 3.0 3.0 c
3 NaN 4.0 None
4 4.0 NaN e
5 NaN 5.0 f
In [47]: df.isnull().sum().sum()
Out[47]: 4
In [48]: df.dropna()
Out[48]:
age date location
0 1.0 1.0 a
1 2.0 2.0 b
2 3.0 3.0 c
You can find the no of rows with all NaNs by
len(df) - len(df.dropna(how = 'all'))
and drop by
df = df.dropna(how = 'all')
This will drop the rows with all the NaN values

Resources