I'm trying to reshape this sample dataframe from long to wide format, without aggregating any of the data.
import numpy as np
import pandas as pd
df = pd.DataFrame({'SubjectID': ['A', 'A', 'A', 'B', 'B', 'C', 'A'], 'Date':
['2010-03-14', '2010-03-15', '2010-03-16', '2010-03-14', '2010-05-15',
'2010-03-14', '2010-03-14'], 'Var1': [1 , 12, 4, 7, 90, 1, 9], 'Var2': [ 0,
0, 1, 1, 1, 0, 1], 'Var3': [np.nan, 1, 0, np.nan, 0, 1, np.nan]})
df['Date'] = pd.to_datetime(df['Date']); df
Date SubjectID Var1 Var2 Var3
0 2010-03-14 A 1 0 NaN
1 2010-03-15 A 12 0 1.0
2 2010-03-16 A 4 1 0.0
3 2010-03-14 B 7 1 NaN
4 2010-05-15 B 90 1 0.0
5 2010-03-14 C 1 0 1.0
6 2010-03-14 A 9 1 NaN
To get around the duplicate values, I'm grouping by the "Date" column and getting the cumulative count for each value. Then I make a pivot table
df['idx'] = df.groupby('Date').cumcount()
dfp = df.pivot_table(index = 'SubjectID', columns = 'idx'); dfp
Var1 Var2 Var3
idx 0 1 2 3 0 1 2 3 0 2
SubjectID
A 5.666667 NaN NaN 9.0 0.333333 NaN NaN 1.0 0.5 NaN
B 90.000000 7.0 NaN NaN 1.000000 1.0 NaN NaN 0.0 NaN
C NaN NaN 1.0 NaN NaN NaN 0.0 NaN NaN 1.0
However, I want the idx column index to be the values from the "Date" column and I don't want to aggregate any data. The expected output is
Var1_2010-03-14 Var1_2010-03-14 Var1_2010-03-15 Var1_2010-03-16 Var1_2010-05-15 Var2_2010-03-14 Var2_2010-03-15 Var2_2010-03-16 Var2_2010-05-15 Var3_2010-03-14 Var3_2010-03-15 Var3_2010-03-16 Var3_2010-05-15
SubjectID
A 1 9 12 4 NaN 0 1 0 1.0 NaN NaN NaN 1.0 0.0 NaN
B 7.0 NaN NaN NaN 90 1 NaN NaN 1.0 NaN NaN NaN NaN NaN 0.0
C 1 NaN NaN NaN NaN 0 NaN NaN NaN NaN 1.0 NaN NaN NaN NaN
How can I do this? Eventually, I'll merge the two column indexes by dfp.columns = [col[0]+ '_' + str(col[1]) for col in dfp.columns].
You are on the correct path:
# group
df['idx'] = df.groupby('Date').cumcount()
# set index and unstack
new = df.set_index(['idx','Date', 'SubjectID']).unstack(level=[0,1])
# drop idx column
new.columns = new.columns.droplevel(1)
new.columns = [f'{val}_{date}' for val, date in new.columns]
I think this is your expected output
Using map looks like it will be a little faster:
df['idx'] = df.groupby('Date').cumcount()
df['Date'] = df['Date'].astype(str)
new = df.set_index(['idx','Date', 'SubjectID']).unstack(level=[0,1])
new.columns = new.columns.droplevel(1)
#new.columns = [f'{val}_{date}' for val, date in new.columns]
new.columns = new.columns.map('_'.join)
Here is a 50,000 row test example:
#data
data = pd.DataFrame(pd.date_range('2000-01-01', periods=50000, freq='D'))
data['a'] = list('abcd')*12500
data['b'] = 2
data['c'] = list('ABCD')*12500
data.rename(columns={0:'date'}, inplace=True)
# list comprehension:
%%timeit -r 3 -n 200
new = data.set_index(['a','date','c']).unstack(level=[0,1])
new.columns = new.columns.droplevel(0)
new.columns = [f'{x}_{y}' for x,y in new.columns]
# 98.2 ms ± 13.3 ms per loop (mean ± std. dev. of 3 runs, 200 loops each)
# map with join:
%%timeit -r 3 -n 200
data['date'] = data['date'].astype(str)
new = data.set_index(['a','date','c']).unstack(level=[0,1])
new.columns = new.columns.droplevel(0)
new.columns = new.columns.map('_'.join)
# 84.6 ms ± 3.87 ms per loop (mean ± std. dev. of 3 runs, 200 loops each)
Related
For a df like below, I use pct_change() to calculate the rolling percentage changes:
price = [np.NaN, 10, 13, np.NaN, np.NaN, 9]
df = pd. DataFrame(price, columns = ['price'])
df
Out[75]:
price
0 NaN
1 10.0
2 13.0
3 NaN
4 NaN
5 9.0
But I get these unexpected results:
df.price.pct_change(periods = 1, fill_method='bfill')
Out[76]:
0 NaN
1 0.000000
2 0.300000
3 -0.307692
4 0.000000
5 0.000000
Name: price, dtype: float64
df.price.pct_change(periods = 1, fill_method='pad')
Out[77]:
0 NaN
1 NaN
2 0.300000
3 0.000000
4 0.000000
5 -0.307692
Name: price, dtype: float64
df.price.pct_change(periods = 1, fill_method='ffill')
Out[78]:
0 NaN
1 NaN
2 0.300000
3 0.000000
4 0.000000
5 -0.307692
Name: price, dtype: float64
I hope that while calculating with NaNs, the results will be NaNs instead of being filled forward or backward and then calculated.
May I ask how to achieve it? Thanks.
The expected result:
0 NaN
1 NaN
2 0.300000
3 NaN
4 NaN
5 NaN
Name: price, dtype: float64
Reference:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pct_change.html
Maybe you can compute the pct manually with diff and shift:
period = 1
pct = df.price.diff().div(df.price.shift(period))
print(pct)
# Output
0 NaN
1 NaN
2 0.3
3 NaN
4 NaN
5 NaN
Name: price, dtype: float64
Update: you can pass fill_method=None
period = 1
pct = df.price.pct_change(periods=period, fill_method=None)
print(pct)
# Output
0 NaN
1 NaN
2 0.3
3 NaN
4 NaN
5 NaN
Name: price, dtype: float64
I'm trying to split a dataframe when NaN rows are found using grps = dfs.isnull().all(axis=1).cumsum().
But this is not working when some of the rows have NaN entry in a single column.
import pandas as pd
from pprint import pprint
import numpy as np
d = {
't': [0, 1, 2, 0, 2, 0, 1],
'input': [2, 2, 2, 2, 2, 2, 4],
'type': ['A', 'A', 'A', 'B', 'B', 'B', 'A'],
'value': [0.1, 0.2, 0.3, np.nan, 2, 3, 1],
}
df = pd.DataFrame(d)
dup = df['t'].diff().lt(0).cumsum()
dfs = (
df.groupby(dup, as_index=False, group_keys=False)
.apply(lambda x: pd.concat([x, pd.Series(index=x.columns, name='').to_frame().T]))
)
pprint(dfs)
grps = dfs.isnull().all(axis=1).cumsum()
temp = [dfs.dropna() for _, dfs in dfs.groupby(grps)]
i = 0
dfm = pd.DataFrame()
for df in temp:
df["name"] = f'name{i}'
i=i+1
df = df.append(pd.Series(dtype='object'), ignore_index=True)
dfm = dfm.append(df, ignore_index=True)
print(dfm)
Input df:
t input type value
0 0.0 2.0 A 0.1
1 1.0 2.0 A 0.2
2 2.0 2.0 A 0.3
NaN NaN NaN NaN
3 0.0 2.0 B NaN
4 2.0 2.0 B 2.0
NaN NaN NaN NaN
5 0.0 2.0 B 3.0
6 1.0 4.0 A 1.0
Output obtained:
t input type value name
0 0.0 2.0 A 0.1 name0
1 1.0 2.0 A 0.2 name0
2 2.0 2.0 A 0.3 name0
3 NaN NaN NaN NaN NaN
4 2.0 2.0 B 2.0 name1
5 NaN NaN NaN NaN NaN
6 0.0 2.0 B 3.0 name2
7 1.0 4.0 A 1.0 name2
8 NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN
Expected:
t input type value name
0 0.0 2.0 A 0.1 name0
1 1.0 2.0 A 0.2 name0
2 2.0 2.0 A 0.3 name0
3 NaN NaN NaN NaN NaN
4 0.0 2.0 B NaN name1
5 2.0 2.0 B 2.0 name1
6 NaN NaN NaN NaN NaN
7 0.0 2.0 B 3.0 name2
8 1.0 4.0 A 1.0 name2
9 NaN NaN NaN NaN NaN
I am basically doing this to append names to the last column of the dataframe after splitting df
using
dfs = (
df.groupby(dup, as_index=False, group_keys=False)
.apply(lambda x: pd.concat([x, pd.Series(index=x.columns, name='').to_frame().T]))
)
and appending NaN rows.
Again, I use the NaN rows to split the df into a list and add new column. But dfs.isnull().all(axis=1).cumsum() isn't working for me. And I also get an additional NaN row in the last row fo the output obtained.
Suggestions on how to get the expected output will be really helpful.
Setup
df = pd.DataFrame(d)
print(df)
t input type value
0 0 2 A 0.1
1 1 2 A 0.2
2 2 2 A 0.3
3 0 2 B NaN
4 2 2 B 2.0
5 0 2 B 3.0
6 1 4 A 1.0
Simplify your approach
# assign name column before splitting
m = df['t'].diff().lt(0)
df['name'] = 'name' + m.cumsum().astype(str)
# Create null dataframes to concat
nan_rows = pd.DataFrame(index=m[m].index)
last_nan_row = pd.DataFrame(index=df.index[[-1]])
# Concat and sort index
df_out = pd.concat([nan_rows, df, last_nan_row]).sort_index(ignore_index=True)
Result
t input type value name
0 0.0 2.0 A 0.1 name0
1 1.0 2.0 A 0.2 name0
2 2.0 2.0 A 0.3 name0
3 NaN NaN NaN NaN NaN
4 0.0 2.0 B NaN name1
5 2.0 2.0 B 2.0 name1
6 NaN NaN NaN NaN NaN
7 0.0 2.0 B 3.0 name2
8 1.0 4.0 A 1.0 name2
9 NaN NaN NaN NaN NaN
Alternatively if you still want to start with the initial input as dfs, here is another approach:
dfs = dfs.reset_index(drop=True)
m = dfs.isna().all(1)
dfs.loc[~m, 'name'] = 'name' + m.cumsum().astype(str)
I have the following dataframe:
data = pd.DataFrame({
'ID': [1, 1, 1, 1, 2, 2, 3, 4, 4, 5, 6, 6],
'Date_Time': ['2010-01-01 12:01:00', '2010-01-01 01:27:33',
'2010-04-02 12:01:00', '2010-04-01 07:24:00', '2011-01-01 12:01:00',
'2011-01-01 01:27:33', '2013-01-01 12:01:00', '2014-01-01 12:01:00',
'2014-01-01 01:27:33', '2015-01-01 01:27:33', '2016-01-01 01:27:33',
'2011-01-01 01:28:00'],
'order': [2, 4, 5, 6, 7, 8, 9, 2, 3, 5, 6, 8],
'sort': [1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0]})
An would like to get the following columns:
1- sum_order_total_1 which sums up the values in the column order grouped by the column sort so in this case for the value 1 from column sort for each ID and returns Nans for zeros form column sort
2- sum_order_total_0 which sums up the values in the column order grouped by the column sort so in this case for the value 0 from column sort for each ID and returns Nans for oness form column sort
3- count_order_date_1 which sums up the values in column order of each ID grouped by column Date_Time for 1 in column sort and returns Nans for 0 from column sort
4- count_order_date_0 which sums up the values in column order of each ID grouped by column Date_Time for 0 in column sort and returns Nans for 1 from column sort
The expected reults should look like that attached photo here:
The problem with groupby (and pd.pivot_table) is that only provide half of the job. They give you the numbers but not in the format that you want. To finalize the format you can use apply.
For the total counts I used:
# Retrieve your data, similar as in the groupby query you provided.
data_total = pd.pivot_table(df, values='order', index=['ID'], columns=['sort'], aggfunc=np.sum)
data_total.reset_index(inplace=True)
Which results in the table:
sort ID 0 1
0 1 6.0 11.0
1 2 15.0 NaN
2 3 NaN 9.0
3 4 3.0 2.0
4 5 5.0 NaN
5 6 8.0 6.0
Now using this as an index ('ID' and 0 or 1 for the sort.) We can write a small function that will input the right value:
def filter_count(data, row, sort_value):
""" Select the count that belongs to the correct ID and sort combination. """
if row['sort'] == sort_value:
return data[data['ID'] == row['ID']][sort_value].values[0]
return np.NaN
# Applying the above function for both sort values 0 and 1.
df['total_0'] = df.apply(lambda row: filter_count(data_total, row, 0), axis=1, result_type='expand')
df['total_1'] = df.apply(lambda row: filter_count(data_total, row, 1), axis=1, result_type='expand')
This leads to:
ID Date_Time order sort total_1 total_0
0 1 2010-01-01 12:01:00 2 1 11.0 NaN
1 1 2010-01-01 01:27:33 4 1 11.0 NaN
2 1 2010-04-02 12:01:00 5 1 11.0 NaN
3 1 2010-04-01 07:24:00 6 0 NaN 6.0
4 2 2011-01-01 12:01:00 7 0 NaN 15.0
5 2 2011-01-01 01:27:33 8 0 NaN 15.0
6 3 2013-01-01 12:01:00 9 1 9.0 NaN
7 4 2014-01-01 12:01:00 2 1 2.0 NaN
8 4 2014-01-01 01:27:33 3 0 NaN 3.0
9 5 2015-01-01 01:27:33 5 0 NaN 5.0
10 6 2016-01-01 01:27:33 6 1 6.0 NaN
11 6 2011-01-01 01:28:00 8 0 NaN 8.0
Now we can apply the same logic to the date, except that the date also contains information about the hours, minutes and seconds. Which can be filtered out using:
# Since we are interesting on a per day bases, we remove the hour/minute/seconds part
df['order_day'] = pd.to_datetime(df['Date_Time']).dt.strftime('%Y/%m/%d')
Now applying the same trick as above, we create a new pivot table, based on the 'ID' and 'order_day':
data_date = pd.pivot_table(df, values='order', index=['ID', 'order_day'], columns=['sort'], aggfunc=np.sum)
data_date.reset_index(inplace=True)
Which is:
sort ID order_day 0 1
0 1 2010/01/01 NaN 6.0
1 1 2010/04/01 6.0 NaN
2 1 2010/04/02 NaN 5.0
3 2 2011/01/01 15.0 NaN
4 3 2013/01/01 NaN 9.0
5 4 2014/01/01 3.0 2.0
6 5 2015/01/01 5.0 NaN
7 6 2011/01/01 8.0 NaN
Writing a second function to fill in the correct value based on 'ID' and 'date':
def filter_date(data, row, sort_value):
if row['sort'] == sort_value:
return data[(data['ID'] == row['ID']) & (data['order_day'] == row['order_day'])][sort_value].values[0]
return np.NaN
# Applying the above function for both sort values 0 and 1.
df['total_1'] = df.apply(lambda row: filter_count(data_total, row, 1), axis=1, result_type='expand')
df['total_0'] = df.apply(lambda row: filter_count(data_total, row, 0), axis=1, result_type='expand')
Now we only have to drop the temporary column 'order_day':
df.drop(labels=['order_day'], axis=1, inplace=True)
And the final answer becomes:
ID Date_Time order sort total_1 total_0 date_0 date_1
0 1 2010-01-01 12:01:00 2 1 11.0 NaN NaN 6.0
1 1 2010-01-01 01:27:33 4 1 11.0 NaN NaN 6.0
2 1 2010-04-02 12:01:00 5 1 11.0 NaN NaN 5.0
3 1 2010-04-01 07:24:00 6 0 NaN 6.0 6.0 NaN
4 2 2011-01-01 12:01:00 7 0 NaN 15.0 15.0 NaN
5 2 2011-01-01 01:27:33 8 0 NaN 15.0 15.0 NaN
6 3 2013-01-01 12:01:00 9 1 9.0 NaN NaN 9.0
7 4 2014-01-01 12:01:00 2 1 2.0 NaN NaN 2.0
8 4 2014-01-01 01:27:33 3 0 NaN 3.0 3.0 NaN
9 5 2015-01-01 01:27:33 5 0 NaN 5.0 5.0 NaN
10 6 2016-01-01 01:27:33 6 1 6.0 NaN NaN 6.0
11 6 2011-01-01 01:28:00 8 0 NaN 8.0 8.0 NaN
I'm relatively new to pandas and I don't know the best approach to solve my problem.
Well, I have a df with: an index, and the data in a column called 'Data' and an empty column called 'sum'.
I need help to create a function to add the sum of the variable group of rows of the 'Data' column in the column 'sum'. The grouping criteria is that there should not be empty rows in the group.
Here an example:
index Data Sum
0 1
1 1 2
2
3
4 1
5 1
6 1 3
7
8 1
9 1 2
10
11 1
12 1
13 1
14 1
15 1 5
16
17 1 1
18
19 1 1
20
As you see, the length of each group of data in 'Data' is variable, could be only one row or any number of rows. Always the sum must be at the end of the group. As an example: the sum of the group of rows 4,5,6 of the 'Data' column should be at row 6 in the 'sum' column.
any insight will be appreciated.
UPDATE
The problem was solved by implementing the Method 3 suggested by ansev. However due to a change in the main program, the sum of each block, now need to be at the beggining of each one (in case the block has more than one row). Then I use the df = df.iloc[::-1] instruction twice in order to reverse the column and back again to normal. Thank you very much!!!!!
df = df.iloc[::-1]
blocks = df['Data'].isnull().cumsum()
m = blocks.duplicated(keep='last')
df['Sum'] = df.groupby(blocks)['Data'].cumsum().mask(m)
df = df.iloc[::-1]
print(df)
Data Sum
0 1.0 2.0
1 1.0 NaN
2 NaN NaN
3 NaN NaN
4 1.0 3.0
5 1.0 NaN
6 1.0 NaN
7 NaN NaN
8 1.0 2.0
9 1.0 NaN
10 NaN NaN
11 1.0 5.0
12 1.0 NaN
13 1.0 NaN
14 1.0 NaN
15 1.0 NaN
16 NaN NaN
17 1.0 1.0
18 NaN NaN
19 1.0 1.0
20 NaN NaN
We can use GroupBy.cumsum:
# if you need replace blanks
#df = df.replace(r'^\s*$', np.nan, regex=True)
s = df['Data'].isnull()
df['sum'] = df.groupby(s.cumsum())['Data'].cumsum().where((~s) & (s.shift(-1)))
print(df)
index Data sum
0 0 1.0 NaN
1 1 1.0 2.0
2 2 NaN NaN
3 3 NaN NaN
4 4 1.0 NaN
5 5 1.0 NaN
6 6 1.0 3.0
7 7 NaN NaN
8 8 1.0 NaN
9 9 1.0 2.0
10 10 NaN NaN
11 11 1.0 NaN
12 12 1.0 NaN
13 13 1.0 NaN
14 14 1.0 NaN
15 15 1.0 5.0
16 16 NaN NaN
17 17 1.0 1.0
18 18 NaN NaN
19 19 1.0 1.0
20 20 NaN NaN
Method 2
#df = df.drop(columns='index') #if neccesary
g = df.reset_index().groupby(df['Data'].isnull().cumsum())
df['sum'] = g['Data'].cumsum().where(lambda x: x.index == g['index'].transform('idxmax'))
Method 3
Series.duplicated and Series.mask
blocks = df['Data'].isnull().cumsum()
m = blocks.duplicated(keep='last')
df['sum'] = df.groupby(blocks)['Data'].cumsum().mask(m)
as you can see the methods only differ in the way of masking the values we don't need from the sum column.
We can also use .transform('sum') instead .cumsum()
performance with the sample dataframe
%%timeit
s = df['Data'].isnull()
df['sum'] = df.groupby(s.cumsum())['Data'].cumsum().where((~s) & (s.shift(-1)))
4.52 ms ± 901 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
g = df.reset_index().groupby(df['Data'].isnull().cumsum())
df['sum'] = g['Data'].cumsum().where(lambda x: x.index == g['index'].transform('idxmax'))
8.52 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
blocks = df['Data'].isnull().cumsum()
m = blocks.duplicated(keep='last')
df['sum'] = df.groupby(blocks)['Data'].cumsum().mask(m)
3.02 ms ± 172 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Code Used for replication
import numpy as np
data = {'Data': [1,1, np.nan , np.nan,1, 1, 1,np.nan , 1,1,np.nan,1,1,1,1,1,np.nan,1,np.nan,1,np.nan]}
df = pd.DataFrame (data)
Iterative Approach Solution
count = 0
for i in range(df.shape[0]):
if df.iloc[i, 0] == 1:
count += 1
elif i != 0 and count != 0:
df.at[i - 1, 'Sum'] = count
print(count)
count = 0
Create a new column that is equal to the index in the data gaps and undefined, otherwise:
df.loc[:, 'Sum'] = np.where(df.Data.isnull(), df.index, np.nan)
Fill the column backward, count the lengths of the identically labeled spans, redefine the column:
df.Sum = df.groupby(df.Sum.bfill()).count()
Align the new column with the original data:
df.Sum = df.Sum.shift(-1)
Eliminate 0-length spans:
df.loc[df.Sum == 0, 'Sum'] = np.nan
I have a couple of toy dataframes I can stack using df.append, but I need to keep the source dataframes as a column, as well. I can't seem to find anything about how to do that. Here's what I do have:
d2005 = pd.DataFrame({"A": [1,2,3,4], "B": [2,4,5,6], "C": [3,5,7,8],
"G": [7,8,9,10]})
d2006 = pd.DataFrame({"A": [2,1,4,5], "B": [3,1,5,6], "D": ["a","c","d","e"],
"F": [7,8,10,12]})
d2005
A B C G
0 1 2 3 7
1 2 4 5 8
2 3 5 7 9
3 4 6 8 10
d2006
A B D F
0 2 3 a 7
1 1 1 c 8
2 4 5 d 10
3 5 6 e 12
Then I can stack them like this:
d_combined = d2005.append(d2006, ignore_index = True, sort = True)
d_combined
A B C D F G
0 1 2 3.0 NaN NaN 7.0
1 2 4 5.0 NaN NaN 8.0
2 3 5 7.0 NaN NaN 9.0
3 4 6 8.0 NaN NaN 10.0
4 2 3 NaN a 7.0 NaN
5 1 1 NaN c 8.0 NaN
6 4 5 NaN d 10.0 NaN
7 5 6 NaN e 12.0 NaN
But what I really need is another column with the source dataframe added to the right end of d_combined. Something like this:
A B C D G F From
0 1 2 3.0 NaN 7.0 NaN d2005
1 2 4 5.0 NaN 8.0 NaN d2005
2 3 5 7.0 NaN 9.0 NaN d2005
3 4 6 8.0 NaN 10.0 NaN d2005
4 2 3 NaN a NaN 7.0 d2006
5 1 1 NaN c NaN 8.0 d2006
6 4 5 NaN d NaN 10.0 d2006
7 5 6 NaN e NaN 12.0 d2006
Hopefully someone has a quick trick they can share.
Thanks.
This gets what you want but there should be a more elegant way:
df_list = [d2005, d2006]
name_list = ['2005', '2006']
for df, name in zip(df_list, name_list):
df['from'] = name
Then
d_combined = d2005.append(d2006, ignore_index=True)
d_combined
A B C D F G from
0 1 2 3.0 NaN NaN 7.0 2005
1 2 4 5.0 NaN NaN 8.0 2005
2 3 5 7.0 NaN NaN 9.0 2005
3 4 6 8.0 NaN NaN 10.0 2005
4 2 3 NaN a 7.0 NaN 2006
5 1 1 NaN c 8.0 NaN 2006
6 4 5 NaN d 10.0 NaN 2006
7 5 6 NaN e 12.0 NaN 2006
Alternatively, you can set df.name at the time of creation of the df and use it in the for loop.
d2005 = pd.DataFrame({"A": [1,2,3,4], "B": [2,4,5,6], "C": [3,5,7,8],
"G": [7,8,9,10]} )
d2005.name = 2005
d2006 = pd.DataFrame({"A": [2,1,4,5], "B": [3,1,5,6], "D": ["a","c","d","e"],
"F": [7,8,10,12]})
d2006.name = 2006
df_list = [d2005, d2006]
for df in df_list:
df['from'] = df.name
I believe this can be simply achieved by adding the From column to the original dataframes itself.
So effectively,
d2005 = pd.DataFrame({"A": [1,2,3,4], "B": [2,4,5,6], "C": [3,5,7,8],
"G": [7,8,9,10]})
d2006 = pd.DataFrame({"A": [2,1,4,5], "B": [3,1,5,6], "D": ["a","c","d","e"],
"F": [7,8,10,12]})
Then,
d2005['From'] = 'd2005'
d2006['From'] = 'd2006'
And then you append,
d_combined = d2005.append(d2006, ignore_index = True, sort = True)
gives you something like this: