I'm reading this data from an excel file:
a b
0 x y x y
1 0 1 2 3
2 0 1 2 3
3 0 1 2 3
4 0 1 2 3
5 0 1 2 3
For each a and b categories (a.k.a samples), there two colums of x and y values. I want to convert this excel data into a dataframe that looks like this (concatenating vertically data from samples a and b):
sample x y
0 a 0.0 1.0
1 a 0.0 1.0
2 a 0.0 1.0
3 a 0.0 1.0
4 a 0.0 1.0
5 b 2.0 3.0
6 b 2.0 3.0
7 b 2.0 3.0
8 b 2.0 3.0
9 b 2.0 3.0
I've written the following code:
x=np.arange(0,4,2) # create a variable that allows to select even columns
sample_df=pd.DataFrame() # create an empty dataFrame
for i in x: # looping through the excel data
sample = pd.read_excel(xls2, usecols=[i,i], nrows=0, header=0)
values_df= pd.read_excel(xls2, usecols=[i,i+1], nrows=5, header=1)
values_df.insert(loc=0, column='sample', value=sample.columns[0])
sample_df=pd.concat([sample_df, values_df], ignore_index=True)
display(sample_df)
But, this is the Output I obtain:
sample x y x.1 y.1
0 a 0.0 1.0 NaN NaN
1 a 0.0 1.0 NaN NaN
2 a 0.0 1.0 NaN NaN
3 a 0.0 1.0 NaN NaN
4 a 0.0 1.0 NaN NaN
5 b NaN NaN 2.0 3.0
6 b NaN NaN 2.0 3.0
7 b NaN NaN 2.0 3.0
8 b NaN NaN 2.0 3.0
9 b NaN NaN 2.0 3.0
I have the following dataframe:
data = pd.DataFrame({
'ID': [1, 1, 1, 1, 2, 2, 3, 4, 4, 5, 6, 6],
'Date_Time': ['2010-01-01 12:01:00', '2010-01-01 01:27:33',
'2010-04-02 12:01:00', '2010-04-01 07:24:00', '2011-01-01 12:01:00',
'2011-01-01 01:27:33', '2013-01-01 12:01:00', '2014-01-01 12:01:00',
'2014-01-01 01:27:33', '2015-01-01 01:27:33', '2016-01-01 01:27:33',
'2011-01-01 01:28:00'],
'order': [2, 4, 5, 6, 7, 8, 9, 2, 3, 5, 6, 8],
'sort': [1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0]})
An would like to get the following columns:
1- sum_order_total_1 which sums up the values in the column order grouped by the column sort so in this case for the value 1 from column sort for each ID and returns Nans for zeros form column sort
2- sum_order_total_0 which sums up the values in the column order grouped by the column sort so in this case for the value 0 from column sort for each ID and returns Nans for oness form column sort
3- count_order_date_1 which sums up the values in column order of each ID grouped by column Date_Time for 1 in column sort and returns Nans for 0 from column sort
4- count_order_date_0 which sums up the values in column order of each ID grouped by column Date_Time for 0 in column sort and returns Nans for 1 from column sort
The expected reults should look like that attached photo here:
The problem with groupby (and pd.pivot_table) is that only provide half of the job. They give you the numbers but not in the format that you want. To finalize the format you can use apply.
For the total counts I used:
# Retrieve your data, similar as in the groupby query you provided.
data_total = pd.pivot_table(df, values='order', index=['ID'], columns=['sort'], aggfunc=np.sum)
data_total.reset_index(inplace=True)
Which results in the table:
sort ID 0 1
0 1 6.0 11.0
1 2 15.0 NaN
2 3 NaN 9.0
3 4 3.0 2.0
4 5 5.0 NaN
5 6 8.0 6.0
Now using this as an index ('ID' and 0 or 1 for the sort.) We can write a small function that will input the right value:
def filter_count(data, row, sort_value):
""" Select the count that belongs to the correct ID and sort combination. """
if row['sort'] == sort_value:
return data[data['ID'] == row['ID']][sort_value].values[0]
return np.NaN
# Applying the above function for both sort values 0 and 1.
df['total_0'] = df.apply(lambda row: filter_count(data_total, row, 0), axis=1, result_type='expand')
df['total_1'] = df.apply(lambda row: filter_count(data_total, row, 1), axis=1, result_type='expand')
This leads to:
ID Date_Time order sort total_1 total_0
0 1 2010-01-01 12:01:00 2 1 11.0 NaN
1 1 2010-01-01 01:27:33 4 1 11.0 NaN
2 1 2010-04-02 12:01:00 5 1 11.0 NaN
3 1 2010-04-01 07:24:00 6 0 NaN 6.0
4 2 2011-01-01 12:01:00 7 0 NaN 15.0
5 2 2011-01-01 01:27:33 8 0 NaN 15.0
6 3 2013-01-01 12:01:00 9 1 9.0 NaN
7 4 2014-01-01 12:01:00 2 1 2.0 NaN
8 4 2014-01-01 01:27:33 3 0 NaN 3.0
9 5 2015-01-01 01:27:33 5 0 NaN 5.0
10 6 2016-01-01 01:27:33 6 1 6.0 NaN
11 6 2011-01-01 01:28:00 8 0 NaN 8.0
Now we can apply the same logic to the date, except that the date also contains information about the hours, minutes and seconds. Which can be filtered out using:
# Since we are interesting on a per day bases, we remove the hour/minute/seconds part
df['order_day'] = pd.to_datetime(df['Date_Time']).dt.strftime('%Y/%m/%d')
Now applying the same trick as above, we create a new pivot table, based on the 'ID' and 'order_day':
data_date = pd.pivot_table(df, values='order', index=['ID', 'order_day'], columns=['sort'], aggfunc=np.sum)
data_date.reset_index(inplace=True)
Which is:
sort ID order_day 0 1
0 1 2010/01/01 NaN 6.0
1 1 2010/04/01 6.0 NaN
2 1 2010/04/02 NaN 5.0
3 2 2011/01/01 15.0 NaN
4 3 2013/01/01 NaN 9.0
5 4 2014/01/01 3.0 2.0
6 5 2015/01/01 5.0 NaN
7 6 2011/01/01 8.0 NaN
Writing a second function to fill in the correct value based on 'ID' and 'date':
def filter_date(data, row, sort_value):
if row['sort'] == sort_value:
return data[(data['ID'] == row['ID']) & (data['order_day'] == row['order_day'])][sort_value].values[0]
return np.NaN
# Applying the above function for both sort values 0 and 1.
df['total_1'] = df.apply(lambda row: filter_count(data_total, row, 1), axis=1, result_type='expand')
df['total_0'] = df.apply(lambda row: filter_count(data_total, row, 0), axis=1, result_type='expand')
Now we only have to drop the temporary column 'order_day':
df.drop(labels=['order_day'], axis=1, inplace=True)
And the final answer becomes:
ID Date_Time order sort total_1 total_0 date_0 date_1
0 1 2010-01-01 12:01:00 2 1 11.0 NaN NaN 6.0
1 1 2010-01-01 01:27:33 4 1 11.0 NaN NaN 6.0
2 1 2010-04-02 12:01:00 5 1 11.0 NaN NaN 5.0
3 1 2010-04-01 07:24:00 6 0 NaN 6.0 6.0 NaN
4 2 2011-01-01 12:01:00 7 0 NaN 15.0 15.0 NaN
5 2 2011-01-01 01:27:33 8 0 NaN 15.0 15.0 NaN
6 3 2013-01-01 12:01:00 9 1 9.0 NaN NaN 9.0
7 4 2014-01-01 12:01:00 2 1 2.0 NaN NaN 2.0
8 4 2014-01-01 01:27:33 3 0 NaN 3.0 3.0 NaN
9 5 2015-01-01 01:27:33 5 0 NaN 5.0 5.0 NaN
10 6 2016-01-01 01:27:33 6 1 6.0 NaN NaN 6.0
11 6 2011-01-01 01:28:00 8 0 NaN 8.0 8.0 NaN
The last 2 real numbers in each row of my data were measured with error. I want to replace them with np.NAN. The number of real numbers differs by row (i.e., each row already has some NAN's at differing amount). Column headers indicate measurement number, index was a experimental trial.Values in a cell equal a measurement reading. Some trials had more measurement readings than others;thus, some rows have more NAN's than others. The below code creates a data frame similar to mine.
import pandas as pd
import numpy as np
data = np.array(([1,2,3,4,5,2,np.NaN],
[2,2,3,2,3,np.NaN,np.NaN],[4,4,5,1,np.NaN,np.NaN,np.nan]))
df1 = pd.DataFrame(data, columns = ['0','1','2','3','4','5','6'])
The data frame yielded from code that looks similar to mine:
0 1 2 3 4 5 6
0 1.0 2.0 3.0 4.0 5.0 2.0 NAN
1 2.0 2.0 3.0 2.0 3.0 NAN NAN
2 4.0 4.0 5.0 1.0 NAN NAN NAN
This is what I want the new data frame to look like:
0 1 2 3 4 5 6
0 1.0 2.0 3.0 4.0 NAN NAN NAN
1 2.0 2.0 3.0 NAN NAN NAN NAN
2 4.0 4.0 NAN NAN NAN NAN NAN
I have tryed counting the NAN and using that to locate the position of the last and second to last numeric values, but it gets me no where.
Ultimately, what I want to do is ignore the NAN's in the original data frame and take the last two real values (i.e., the integers) in a row and replace them with np.NAN. One of the main issues is the position of the last 2 real numbers in a row can differ by row. Making the original data frame look like the new data frame in the above examples.
Method #1 would be simply to shift everything over by 2 and keep the values which remain non-null:
In [61]: df.where(df.shift(-2, axis=1).notnull())
Out[61]:
0 1 2 3 4 5 6
0 1.0 2.0 3.0 4.0 NaN NaN NaN
1 2.0 2.0 3.0 NaN NaN NaN NaN
2 4.0 4.0 NaN NaN NaN NaN NaN
Method #2 would be to count the number of non-null values from the right, and only keep non-null values after the second:
In [62]: df.where((df.notnull().iloc[:, ::-1].cumsum(axis=1) > 2))
Out[62]:
0 1 2 3 4 5 6
0 1.0 2.0 3.0 4.0 NaN NaN NaN
1 2.0 2.0 3.0 NaN NaN NaN NaN
2 4.0 4.0 NaN NaN NaN NaN NaN
This isn't as pretty, but would allow for finer levels of customization if we needed to shift differently for each row, for example if it weren't true that we had a row of non-null values followed by null values.
I am looking to use pd.rolling_mean in a groupby operation. I want to have in each group a rolling mean of the previous elemnets within the same group. Here is an example:
id val
0 1
0 2
0 3
1 4
1 5
2 6
Grouping by id, this should be transformed into:
id val
0 nan
0 1
0 1.5
1 nan
1 4
2 nan
I believe you want pd.Series.expanding
df.groupby('id').val.apply(lambda x: x.expanding().mean().shift())
0 NaN
1 1.0
2 1.5
3 NaN
4 4.0
5 NaN
Name: val, dtype: float64
I think you need groupby with shift and rolling, window size can be set to scalar:
df['val']=df.groupby('id')['val'].apply(lambda x: x.shift().rolling(2, min_periods=1).mean())
print (df)
id val
0 0 NaN
1 0 1.0
2 0 1.5
3 1 NaN
4 1 4.0
5 2 NaN
Thank you 3novak for comment - you can set window size by max length of group:
f = lambda x: x.shift().rolling(df['id'].value_counts().iloc[0], min_periods=1).mean()
df['val'] = df.groupby('id')['val'].apply(f)
print (df)
id val
0 0 NaN
1 0 1.0
2 0 1.5
3 1 NaN
4 1 4.0
5 2 NaN
This is an example of the existing Data Frame:
A B C t
0 2.0 NaN NaN 0.2
1 NaN 1.0 NaN 0.2
2 NaN NaN 3.0 0.2
3 2.0 NaN NaN 0.2
4 NaN 1.0 NaN 0.2
5 NaN NaN 3.0 0.2
What I would like to have as a result looks like this:
A B C t
0 2 1 3 0.2
1 2 1 3 0.6
In this case the rows with the index 1&2 are inserted in the first row. This should also be possible for longer DataFrames with the same shape.
In addition the timestamp (Column 't') is the relative timestamp between the rows. This means there has to be an addition with the timestamps.
Thanks for the answers and sorry for the bad english :)