Pandas: How to extra only latest date in pivot table dataframe - python-3.x

How do I create a new dataframe which only include as index the latest date of the column 'txn_date' for each 'day' based on the pivot table in the picture?
Thank you

d1 = pd.to_datetime(['2016-06-25'] *2 + ['2016-06-28']*4)
df = pd.DataFrame({'txn_date':pd.date_range('2012-03-05 10:20:03', periods=6),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'day':d1}).set_index(['day','txn_date'])
print (df)
B C D E
day txn_date
2016-06-25 2012-03-05 10:20:03 4 7 1 5
2012-03-06 10:20:03 5 8 3 3
2016-06-28 2012-03-07 10:20:03 4 9 5 6
2012-03-08 10:20:03 5 4 7 9
2012-03-09 10:20:03 5 2 1 2
2012-03-10 10:20:03 4 3 0 4
1.
I think you need first sort_index if necessary first, then groupby by level day and aggregate last:
df1 = df.sort_index().reset_index(level=1).groupby(level='day').last()
print (df1)
txn_date B C D E
day
2016-06-25 2012-03-06 10:20:03 5 8 3 3
2016-06-28 2012-03-10 10:20:03 4 3 0 4
2.
Filter by boolean indexing with duplicated:
#if necessary
df = df.sort_index()
df2 = df[~df.index.get_level_values('day').duplicated(keep='last')]
print(df2)
B C D E
day txn_date
2016-06-25 2012-03-06 10:20:03 5 8 3 3
2016-06-28 2012-03-10 10:20:03 4 3 0 4

Related

How to replenish a data frame based on another one?

Given two data frames. One contains a column of repeated values (a, in this case). The other contains what this value corresponds to (in this example, it corresponds to some "d" values). How do I efficiently replenish the first data frame with a new column, values in which correspond to some existent column, according to a rule recorded in the other data frame. Here is an example code that works really slow:
import pandas as pd
import numpy as np
d1 = pd.DataFrame(np.asarray([[1,2,3], [2,4,5], [3,4,5], [2,1,4], [3,4,5]]), columns = ['a', 'b', 'c'])
d2 = pd.DataFrame(np.asarray([[1,7], [2,8], [3,11]]), columns = ['a', 'd'])
d = np.empty((d1.shape[0],))
for i in range(d1.shape[0]):
temp = d2.loc[d2['a'] == d1.at[i,'a']]
d[i] = temp['d'].array[0]
d1['d'] = d
This is d1 original:
a b c
0 1 2 3
1 2 4 5
2 3 4 5
3 2 1 4
4 3 4 5
This is d2:
a d
0 1 7
1 2 8
2 3 11
This is a resultant d1:
a b c d
0 1 2 3 7
1 2 4 5 8
2 3 4 5 11
3 2 1 4 8
4 3 4 5 11
You're probably looking for pd.merge.
In your case, d1 = d1.merge(d2, on=['a'], how='left') should do the trick.
Another way is to use map and make only the values you need.
d1['d'] = d1['a'].map(d2.set_index('a')['d'])
d1
Output:
a b c d
0 1 2 3 7
1 2 4 5 8
2 3 4 5 11
3 2 1 4 8
4 3 4 5 11

Create a new column with the minimum of other columns on same row

I have the following DataFrame
Input:
A B C D E
2 3 4 5 6
1 1 2 3 2
2 3 4 5 6
I want to add a new column that has the minimum of A, B and C for that row
Output:
A B C D E Goal
2 3 4 5 6 2
1 1 2 3 2 1
2 3 4 5 6 2
I have tried to use
df = df[['A','B','C]].min()
but I get errors about hashing lists and also I think this will be the min of the whole column I only want the min of the row for those specific columns.
How can I best accomplish this?
Use min along the columns with axis=1
Inline solution that produces copy that doesn't alter the original
df.assign(Goal=lambda d: d[['A', 'B', 'C']].min(1))
A B C D E Goal
0 2 3 4 5 6 2
1 1 1 2 3 2 1
2 2 3 4 5 6 2
Same answer put different
Add column to existing dataframe
new = df[['A', 'B', 'C']].min(axis=1)
df['Goal'] = new
df
A B C D E Goal
0 2 3 4 5 6 2
1 1 1 2 3 2 1
2 2 3 4 5 6 2
Add axis = 1 to your min
df['Goal'] = df[['A','B','C']].min(axis = 1)
you have to define an axis across which you are applying the min function, which would be 1 (columns).
df['ABC_row_min'] = df[['A', 'B', 'C']].min(axis = 1)

In Pandas, how to filter against other dataframe with Multi-Index

I have two dataframes. The first one (df1) has a Multi-Index A,B.
The second one (df2) has those fields A and B as columns.
How do I filter df2 for a large dataset (2 million rows in each) to get only the rows in df2 where A and B are not in the multi index of df1
import pandas as pd
df1 = pd.DataFrame([(1,2,3),(1,2,4),(1,2,4),(2,3,4),(2,3,1)],
columns=('A','B','C')).set_index(['A','B'])
df2 = pd.DataFrame([(7,7,1,2,3),(7,7,1,2,4),(6,6,1,2,4),
(5,5,6,3,4),(2,7,2,2,1)],
columns=('X','Y','A','B','C'))
df1:
C
A B
1 2 3
2 4
2 4
2 3 4
3 1
df2 before filtering:
X Y A B C
0 7 7 1 2 3
1 7 7 1 2 4
2 6 6 1 2 4
3 5 5 6 3 4
4 2 7 2 2 1
df2 wanted result:
X Y A B C
3 5 5 6 3 4
4 2 7 2 2 1
Create MultiIndex in df2 by A,B columns and filter by Index.isin with ~ for invert boolean mask with boolean indexing:
df = df2[~df2.set_index(['A','B']).index.isin(df1.index)]
print (df)
X Y A B C
3 5 5 6 3 4
4 2 7 2 2 1
Another similar solution with MultiIndex.from_arrays:
df = df2[~pd.MultiIndex.from_arrays([df2['A'],df2['B']]).isin(df1.index)]
Another solution by #Sandeep Kadapa:
df = df2[df2[['A','B']].ne(df1.reset_index()[['A','B']]).any(axis=1)]

Column name and index of max value

I currently have a pandas dataframe where values between 0 and 1 are saved. I am looking for a function which can provide me the top 5 values of a column, together with the name of the column and the associated index of the values.
Sample Input: data frame with column names a:z, index 1:23, entries are values between 0 and 1
Sample Output: array of 5 highest entries in each column, each with column name and index
Edit:
For the following data frame:
np.random.seed([3,1415])
df = pd.DataFrame(np.random.randint(10, size=(10, 4)), list('abcdefghij'), list('ABCD'))
df
A B C D
a 0 2 7 3
b 8 7 0 6
c 8 6 0 2
d 0 4 9 7
e 3 2 4 3
f 3 6 7 7
g 4 5 3 7
h 5 9 8 7
i 6 4 7 6
j 2 6 6 5
I would like to get an output like (for example for the first column):
[[8,b,A], [8, c, A], [6,i,A], [5, h, A], [4,g,A]].
consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(
np.random.randint(10, size=(10, 4)), list('abcdefghij'), list('ABCD'))
df
A B C D
a 0 2 7 3
b 8 7 0 6
c 8 6 0 2
d 0 4 9 7
e 3 2 4 3
f 3 6 7 7
g 4 5 3 7
h 5 9 8 7
i 6 4 7 6
j 2 6 6 5
I'm going to use np.argpartition to separate each column into the 5 smallest and 10 - 5 (also 5) largest
v = df.values
i = df.index.values
k = len(v) - 5
pd.DataFrame(
i[v.argpartition(k, 0)[-k:]],
np.arange(k), df.columns
)
A B C D
0 g f i i
1 b c a d
2 h h f h
3 i b d f
4 c j h g
print(your_dataframe.sort_values(ascending=False)[0:4])

Pandas use variable for column names part 2

Given the following data frame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
df
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
How can one assign column names to variables for use in referring to said column names?
For example, if I do this:
cols=['A','B']
cols2=['C','D']
I then want to do something like this:
df[cols,'F',cols2]
But the result is this:
TypeError: unhashable type: 'list'
I think you need add column F to list:
allcols = cols + ['F'] + cols2
print df[allcols]
A B F C D
0 1 4 7 7 1
1 2 5 4 8 3
2 3 6 3 9 5
Or:
print df[cols + ['F'] +cols2]
A B F C D
0 1 4 7 7 1
1 2 5 4 8 3
2 3 6 3 9 5
Need give a list with columns for reference.
In [48]: df[cols+['F']+cols2]
Out[48]:
A B F C D
0 1 4 7 7 1
1 2 5 4 8 3
2 3 6 3 9 5
and, consider using df.loc[:, cols+['F']+cols2], df.ix[:, cols+['F']+cols2] for slicing.
Python 3 solution:
In [154]: df[[*cols,'F',*cols2]]
Out[154]:
A B F C D
0 1 4 7 7 1
1 2 5 4 8 3
2 3 6 3 9 5

Resources