How to properly create a pandas dataframe with the given data? - python-3.x

I have the following experiment data:
experiment1_device = 'Dev2'
experiment1_values = [1, 2, 3, 4]
experiment1_timestamps = [1, 2, 3, 6]
experiment1_task = 'Oil Level'
experiment1_user = 'Sean'
experiment2_device = 'Dev1'
experiment2_values = [5, 6, 7, 8, 9]
experiment2_timestamps = [1, 2, 3, 4, 6]
experiment2_task = 'Ventilation'
experiment2_user = 'Martin'
experiment3_device = 'Dev1'
experiment3_values = [10, 11, 12, 13, 14]
experiment3_timestamps = [1, 2, 3, 4, 6]
experiment3_task = 'Ventilation'
experiment3_user = 'Sean'
Each experiment consists of:
A user who conducted the experiment
The task the user had to do
The device the user was using
A series of timestamps ...
and a series of data values which have been read at a certain timestamp, so len(experimentx_values ) == len(experimentx_timestamps )
The data is currently given in the above format, so to speak single variables, but I could change this output format if need. For example, if it would be better to put everything in a dict or so.
The expected output format I would like to achieve is the following:
Timestamp, User and Task should be a MultiIndex, whereas the device is supposed to be the column name, and empty (grey) cells should just contain NaN.
I tried multiple approaches with pd.Dataframe.from_records but couldn't get the desired output format.
Any help is highly appreciated!

Since the data are stored in all of those different variables it will be a lot of writing. However, you should try to store the results of each experiment in a DataFrame (directly from whatever outputs those values), and hold all of those DataFrames in a list which will cut down on the variables you have floating around.
Given your variables, construct DataFrames as follows:
df1 = pd.DataFrame({'Timestamp': experiment1_timestamps,
'User': experiment1_user,
'Task': experiment1_task,
experiment1_device: experiment1_values})
df2 = pd.DataFrame({'Timestamp': experiment2_timestamps,
'User': experiment2_user,
'Task': experiment2_task,
experiment2_device: experiment2_values})
df3 = pd.DataFrame({'Timestamp': experiment3_timestamps,
'User': experiment3_user,
'Task': experiment3_task,
experiment3_device: experiment3_values})
Now join them together into a single DataFrame, setting the index to the columns you want. It looks like your output wants the cartesian product of all possibilities, so we'll reindex to get the fully NaN rows back in:
df = pd.concat([df1, df2, df3]).set_index(['Timestamp', 'User', 'Task']).sort_index()
idx = pd.MultiIndex.from_product([df.index.get_level_values(i).unique()
for i in range(df.index.nlevels)])
df = df.reindex(idx)
Dev2 Dev1
Timestamp User Task
1 Martin Ventilation NaN 5.0
Oil Level NaN NaN
Sean Ventilation NaN 10.0
Oil Level 1.0 NaN
2 Martin Ventilation NaN 6.0
Oil Level NaN NaN
Sean Ventilation NaN 11.0
Oil Level 2.0 NaN
3 Martin Ventilation NaN 7.0
Oil Level NaN NaN
Sean Ventilation NaN 12.0
Oil Level 3.0 NaN
4 Martin Ventilation NaN 8.0
Oil Level NaN NaN
Sean Ventilation NaN 13.0
Oil Level NaN NaN
6 Martin Ventilation NaN 9.0
Oil Level NaN NaN
Sean Ventilation NaN 14.0
Oil Level 4.0 NaN

Maybe there is a simpler way, but you can create sub-dataframes:
import pandas as pd
import pandas as pd
df1 = pd.DataFrame(data=[[1, 2, 3, 4],[1, 2, 3, 6]]).T
df1.columns = ['values','timestamps']
df1['dev'] = 'Dev2'
df1['user'] = 'Sean'
df1['task'] = 'Oil Level'
df2 = pd.DataFrame(data=[[5, 6, 7, 8, 9],[1, 2, 3, 4, 6]]).T
df2.columns = ['values','timestamps']
df2['dev'] = 'Dev1'
df2['user'] = 'Martin'
df2['task'] = 'Ventilation'
df3 = pd.DataFrame(data=[[10, 11, 12, 13, 14],[1, 2, 3, 4, 6]]).T
df3.columns = ['values','timestamps']
df3['dev'] = 'Dev1'
df3['user'] = 'Sean'
df3['task'] = 'Ventilation'
merge them into one:
df = df1.merge(df2, on=['timestamps','dev','user','task','values'], how='outer')
df = df.merge(df3, on=['timestamps','dev','user','task','values'], how='outer')
and create the aggregation:
piv = df.groupby(['timestamps','user','task','dev']).sum()
piv = piv.unstack() # for the columns dev1, dev2

Related

pandas pivot dataframe, but add unseen values

I have 2 dataframes:
purchases = pd.DataFrame([['Alice', 'sweeties', 4],
['Bob', 'chocolate', 5],
['Alice', 'chocolate', 3],
['Claudia', 'juice', 2]],
columns=['client', 'item', 'quantity'])
goods = pd.DataFrame([['sweeties', 15],
['chocolate', 7],
['juice', 8],
['lemons', 3]], columns=['good', 'price'])
and I want to transform purchases with cols and indexes alike at this photo:
My first thought was to use pivot:
purchases.pivot(columns="item", values="quantity")
Output:
The problem is: I also need the lemons column in the pivot result because it's present in the goods dataframe (just filled with None values).
How can I accomplish that?
You can chain with reindex:
purchases.pivot(columns="item", values="quantity").reindex(goods['good'], axis=1)
Output:
good sweeties chocolate juice lemons
0 4.0 NaN NaN NaN
1 NaN 5.0 NaN NaN
2 NaN 3.0 NaN NaN
3 NaN NaN 2.0 NaN
You can use df.merge with df.pivot:
In [3626]: x = goods.merge(purchases, left_on='good', right_on='item', how='left')
In [3628]: x['total'] = x.price * x.quantity # you can tweak this calculation
In [3634]: res = x[['good', 'client', 'total']].pivot('client', 'good', 'total').dropna(how='all').fillna(0)
In [3635]: res
Out[3635]:
good chocolate juice lemons sweeties
client
Alice 21.0 0.0 0.0 60.0
Bob 35.0 0.0 0.0 0.0
Claudia 0.0 16.0 0.0 0.0

Cant assign value to cell in multiindex dataframe (assigning to copy / slice of df?)

I am trying to assign a value (mean of values in another column) to a cell in a multi-index Pandas dataframe over which I iterate to calculate means over a moving window in a different column. But, when I try to assign the value it doesn't change.
I am not used to working with multi-indexes and have solved several other problems but this one has me stumped for now...
Toy code that reproduces the problem:
tuples = [
('AFG', 1963), ('AFG', 1964), ('AFG', 1965), ('AFG', 1966), ('AFG', 1967), ('AFG', 1968),
('BRA', 1963), ('BRA', 1964), ('BRA', 1965), ('BRA', 1966), ('BRA', 1967), ('BRA', 1968)
]
index = pd.MultiIndex.from_tuples(tuples)
values = [[12, None], [0, None],
[12, None], [0, 4],
[12, 5], [0, 4],
[12, 2], [0, 4],
[12, 2], [0, 4],
[1, 4], [7, 1]]
df = pd.DataFrame(values, columns=['Oil', 'Pop'], index=index)
lag =-2
lead=0
indicator = 'Pop'
new_indicator = 'Mean_pop'
df[new_indicator] = np.nan
df
Gives:
Oil Pop Mean_pop
AFG 1963 12 NaN NaN
1964 0 NaN NaN
1965 12 NaN NaN
1966 0 4.0 NaN
1967 12 5.0 NaN
1968 0 4.0 NaN
BRA 1963 12 2.0 NaN
1964 0 4.0 NaN
1965 12 2.0 NaN
1966 0 4.0 NaN
1967 1 4.0 NaN
1968 7 1.0 NaN
Then to iterate over the df:
for country, country_df in df.groupby(level=0):
oldestyear = country_df[indicator].first_valid_index()[1]
latestyear = country_df[indicator].last_valid_index()[1]
for t in range(oldestyear, latestyear+1):
print (country, oldestyear, latestyear, t)
print (" For", country, ", calculate mean over ", t+lag, "to", t+lead,
"and add to row for year", t)
dftt = country_df.loc[(country, t+lag):(country, t+lead)]
print(dftt[indicator])
mean = dftt[indicator].mean(axis=0)
print("mean for ", indicator, "in", country, "during", t+lag, "to", t+lead, "is", mean)
df.loc[country, t][new_indicator] = mean
Diagnostic output not pasted, but df looks the same after iterating over it and I get the following warning on some iterations:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
if sys.path[0] == '':
Any pointers will be greatly appreciated.
I think it is a easy as setting last line to:
df.loc[(country, t), new_indicator] = mean

Pandas, how to dropna values using subset with multiindex dataframe?

I have a data frame with multi-index columns.
From this data frame I need to remove the rows with NaN values in a subset of columns.
I am trying to use the subset option of pd.dropna but I do not manage to find the way to specify the subset of columns. I have tried using pd.IndexSlice but this does not work.
In the example below I need to get ride of the last row.
import pandas as pd
# ---
a = [1, 1, 2, 2, 3, 3]
b = ["a", "b", "a", "b", "a", "b"]
col = pd.MultiIndex.from_arrays([a[:], b[:]])
val = [
[1, 2, 3, 4, 5, 6],
[None, None, 1, 2, 3, 4],
[None, 1, 2, 3, 4, 5],
[None, None, 5, 3, 3, 2],
[None, None, None, None, 5, 7],
]
# ---
df = pd.DataFrame(val, columns=col)
# ---
print(df)
# ---
idx = pd.IndexSlice
df.dropna(axis=0, how="all", subset=idx[1:2, :])
# ---
print(df)
Using the thresh option is an alternative but if possible I would like to use subset and how='all'
When dealing with a MultiIndex, each column of the MultiIndex can be specified as a tuple:
In [67]: df.dropna(axis=0, how="all", subset=[(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b')])
Out[67]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6
1 NaN NaN 1.0 2.0 3 4
2 NaN 1.0 2.0 3.0 4 5
3 NaN NaN 5.0 3.0 3 2
Or, to select all columns whose first level equals 1 or 2 you could use:
In [69]: df.dropna(axis=0, how="all", subset=df.loc[[], [1,2]].columns)
Out[69]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6
1 NaN NaN 1.0 2.0 3 4
2 NaN 1.0 2.0 3.0 4 5
3 NaN NaN 5.0 3.0 3 2
df[[1,2]].columns also works, but this returns a (possibly large) intermediate DataFrame. df.loc[[], [1,2]].columns is more memory-efficient since its intermediate DataFrame is empty.
If you want to apply the dropna to the columns which have 1 or 2 in level 1, you can do it as follows:
cols= [(c0, c1) for (c0, c1) in df.columns if c0 in [1,2]]
df.dropna(axis=0, how="all", subset=cols)
If applied to your data, it results in:
Out[446]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6
1 NaN NaN 1.0 2.0 3 4
2 NaN 1.0 2.0 3.0 4 5
3 NaN NaN 5.0 3.0 3 2
As you can see, the last line (index=4) is gone, because all columns below 1 and 2 were NaN for this line. If you rather want all rows to be removed, where any NaN occured in the column, you need:
df.dropna(axis=0, how="any", subset=cols)
Which results in:
Out[447]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6

Pandas Rows with missing values in multiple columns

I have a dataframe with columns age, date and location.
I would like to count how many rows are empty across ALL columns (not some but all in the same time). I have the following code, each line works independently, but how do I say age AND date AND location isnull?
df['age'].isnull().sum()
df['date'].isnull().sum()
df['location'].isnull().sum()
I would like to return a dataframe after removing the rows with missing values in ALL these three columns, so something like the following lines but combined in one statement:
df.mask(row['location'].isnull())
df[np.isfinite(df['age'])]
df[np.isfinite(df['date'])]
You basically can use your approach, but drop the column indices:
df.isnull().sum().sum()
The first .sum() returns a per-column value, while the second .sum() will return the sum of all NaN values.
Similar to Vaishali's answer, you can use df.dropna() to drop all values that are NaN or None and only return your cleaned DataFrame.
In [45]: df = pd.DataFrame({'age': [1, 2, 3, np.NaN, 4, None], 'date': [1, 2, 3, 4, None, 5], 'location': ['a', 'b', 'c', None, 'e', 'f']})
In [46]: df
Out[46]:
age date location
0 1.0 1.0 a
1 2.0 2.0 b
2 3.0 3.0 c
3 NaN 4.0 None
4 4.0 NaN e
5 NaN 5.0 f
In [47]: df.isnull().sum().sum()
Out[47]: 4
In [48]: df.dropna()
Out[48]:
age date location
0 1.0 1.0 a
1 2.0 2.0 b
2 3.0 3.0 c
You can find the no of rows with all NaNs by
len(df) - len(df.dropna(how = 'all'))
and drop by
df = df.dropna(how = 'all')
This will drop the rows with all the NaN values

Filling in nans for numbers in a column-specific way

Given a DataFrame and a list of indexes, is there an efficient pandas function that put nan value for all values vertically preceeding each of the entries of the list?
For example, suppose we have the list [4,8] and the following DataFrame:
index 0 1
5 1 2
2 9 3
4 3.2 3
8 9 8.7
The desired output is simply:
index 0 1
5 nan nan
2 nan nan
4 3.2 nan
8 9 8.7
Any suggestions for such a function that does this fast?
Here's one NumPy approach based on np.searchsorted -
s = [4,8]
a = df.values
idx = df.index.values
sidx = np.argsort(idx)
matching_row_indx = sidx[np.searchsorted(idx, s, sorter = sidx)]
mask = np.arange(a.shape[0])[:,None] < matching_row_indx
a[mask] = np.nan
Sample run -
In [107]: df
Out[107]:
0 1
index
5 1.0 2.0
2 9.0 3.0
4 3.2 3.0
8 9.0 8.7
In [108]: s = [4,8]
In [109]: a = df.values
...: idx = df.index.values
...: sidx = np.argsort(idx)
...: matching_row_indx = sidx[np.searchsorted(idx, s, sorter = sidx)]
...: mask = np.arange(a.shape[0])[:,None] < matching_row_indx
...: a[mask] = np.nan
...:
In [110]: df
Out[110]:
0 1
index
5 NaN NaN
2 NaN NaN
4 3.2 NaN
8 9.0 8.7
It was a bit tricky to recreate your example but this should do it:
import pandas as pd
import numpy as np
df = pd.DataFrame({'index': [5, 2, 4, 8], 0: [1, 9, 3.2, 9], 1: [2, 3, 3, 8.7]})
df.set_index('index', inplace=True)
for i, item in enumerate([4,8]):
for index, row in df.iterrows():
if index != item:
row[i] = np.nan
else:
break

Resources