Pandas use variable for column names part 2 - python-3.x

Given the following data frame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
df
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
How can one assign column names to variables for use in referring to said column names?
For example, if I do this:
cols=['A','B']
cols2=['C','D']
I then want to do something like this:
df[cols,'F',cols2]
But the result is this:
TypeError: unhashable type: 'list'

I think you need add column F to list:
allcols = cols + ['F'] + cols2
print df[allcols]
A B F C D
0 1 4 7 7 1
1 2 5 4 8 3
2 3 6 3 9 5
Or:
print df[cols + ['F'] +cols2]
A B F C D
0 1 4 7 7 1
1 2 5 4 8 3
2 3 6 3 9 5

Need give a list with columns for reference.
In [48]: df[cols+['F']+cols2]
Out[48]:
A B F C D
0 1 4 7 7 1
1 2 5 4 8 3
2 3 6 3 9 5
and, consider using df.loc[:, cols+['F']+cols2], df.ix[:, cols+['F']+cols2] for slicing.

Python 3 solution:
In [154]: df[[*cols,'F',*cols2]]
Out[154]:
A B F C D
0 1 4 7 7 1
1 2 5 4 8 3
2 3 6 3 9 5

Related

How to replenish a data frame based on another one?

Given two data frames. One contains a column of repeated values (a, in this case). The other contains what this value corresponds to (in this example, it corresponds to some "d" values). How do I efficiently replenish the first data frame with a new column, values in which correspond to some existent column, according to a rule recorded in the other data frame. Here is an example code that works really slow:
import pandas as pd
import numpy as np
d1 = pd.DataFrame(np.asarray([[1,2,3], [2,4,5], [3,4,5], [2,1,4], [3,4,5]]), columns = ['a', 'b', 'c'])
d2 = pd.DataFrame(np.asarray([[1,7], [2,8], [3,11]]), columns = ['a', 'd'])
d = np.empty((d1.shape[0],))
for i in range(d1.shape[0]):
temp = d2.loc[d2['a'] == d1.at[i,'a']]
d[i] = temp['d'].array[0]
d1['d'] = d
This is d1 original:
a b c
0 1 2 3
1 2 4 5
2 3 4 5
3 2 1 4
4 3 4 5
This is d2:
a d
0 1 7
1 2 8
2 3 11
This is a resultant d1:
a b c d
0 1 2 3 7
1 2 4 5 8
2 3 4 5 11
3 2 1 4 8
4 3 4 5 11
You're probably looking for pd.merge.
In your case, d1 = d1.merge(d2, on=['a'], how='left') should do the trick.
Another way is to use map and make only the values you need.
d1['d'] = d1['a'].map(d2.set_index('a')['d'])
d1
Output:
a b c d
0 1 2 3 7
1 2 4 5 8
2 3 4 5 11
3 2 1 4 8
4 3 4 5 11

Pandas how to turn each group into a dataframe using groupby

I have a dataframe looks like,
A B
1 2
1 3
1 4
2 5
2 6
3 7
3 8
If I df.groupby('A'), how do I turn each group into sub-dataframes, so it will look like, for A=1
A B
1 2
1 3
1 4
for A=2,
A B
2 5
2 6
for A=3,
A B
3 7
3 8
By using get_group
g=df.groupby('A')
g.get_group(1)
Out[367]:
A B
0 1 2
1 1 3
2 1 4
You are close, need convert groupby object to dictionary of DataFrames:
dfs = dict(tuple(df.groupby('A')))
print (dfs[1])
A B
0 1 2
1 1 3
2 1 4
print (dfs[2])
A B
3 2 5
4 2 6

Repeating elements in a dataframe

Hi all I have the following dataframe:
A | B | C
1 2 3
2 3 4
3 4 5
4 5 6
And I am trying to only repeat the last two rows of the data so that it looks like this:
A | B | C
1 2 3
2 3 4
3 4 5
3 4 5
4 5 6
4 5 6
I have tried using append, concat and repeat to no avail.
repeated = lambda x:x.repeat(2)
df.append(df[-2:].apply(repeated),ignore_index=True)
This returns the following dataframe, which is incorrect:
A | B | C
1 2 3
2 3 4
3 4 5
4 5 6
3 4 5
3 4 5
4 5 6
4 5 6
You can use numpy.repeat for repeating index and then create df1 by loc, last append to original, but before filter out last 2 rows by iloc:
df1 = df.loc[np.repeat(df.index[-2:].values, 2)]
print (df1)
A B C
2 3 4 5
2 3 4 5
3 4 5 6
3 4 5 6
print (df.iloc[:-2])
A B C
0 1 2 3
1 2 3 4
df = df.iloc[:-2].append(df1,ignore_index=True)
print (df)
A B C
0 1 2 3
1 2 3 4
2 3 4 5
3 3 4 5
4 4 5 6
5 4 5 6
If want use your code add iloc for filtering only last 2 rows:
repeated = lambda x:x.repeat(2)
df = df.iloc[:-2].append(df.iloc[-2:].apply(repeated),ignore_index=True)
print (df)
A B C
0 1 2 3
1 2 3 4
2 3 4 5
3 3 4 5
4 4 5 6
5 4 5 6
Use pd.concat and index slicing with .iloc:
pd.concat([df,df.iloc[-2:]]).sort_values(by='A')
Output:
A B C
0 1 2 3
1 2 3 4
2 3 4 5
2 3 4 5
3 4 5 6
3 4 5 6
I'm partial to manipulating the index into the pattern we are aiming for then asking the dataframe to take the new form.
Option 1
Use pd.DataFrame.reindex
df.reindex(df.index[:-2].append(df.index[-2:].repeat(2)))
A B C
0 1 2 3
1 2 3 4
2 3 4 5
2 3 4 5
3 4 5 6
3 4 5 6
Same thing in multiple lines
i = df.index
idx = i[:-2].append(i[-2:].repeat(2))
df.reindex(idx)
Could also use loc
i = df.index
idx = i[:-2].append(i[-2:].repeat(2))
df.loc[idx]
Option 2
Reconstruct from values. Only do this is all dtypes are the same.
i = np.arange(len(df))
idx = np.append(i[:-2], i[-2:].repeat(2))
pd.DataFrame(df.values[idx], df.index[idx])
0 1 2
0 1 2 3
1 2 3 4
2 3 4 5
2 3 4 5
3 4 5 6
3 4 5 6
Option 3
Can also use np.array in iloc
i = np.arange(len(df))
idx = np.append(i[:-2], i[-2:].repeat(2))
df.iloc[idx]
A B C
0 1 2 3
1 2 3 4
2 3 4 5
2 3 4 5
3 4 5 6
3 4 5 6

Column name and index of max value

I currently have a pandas dataframe where values between 0 and 1 are saved. I am looking for a function which can provide me the top 5 values of a column, together with the name of the column and the associated index of the values.
Sample Input: data frame with column names a:z, index 1:23, entries are values between 0 and 1
Sample Output: array of 5 highest entries in each column, each with column name and index
Edit:
For the following data frame:
np.random.seed([3,1415])
df = pd.DataFrame(np.random.randint(10, size=(10, 4)), list('abcdefghij'), list('ABCD'))
df
A B C D
a 0 2 7 3
b 8 7 0 6
c 8 6 0 2
d 0 4 9 7
e 3 2 4 3
f 3 6 7 7
g 4 5 3 7
h 5 9 8 7
i 6 4 7 6
j 2 6 6 5
I would like to get an output like (for example for the first column):
[[8,b,A], [8, c, A], [6,i,A], [5, h, A], [4,g,A]].
consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(
np.random.randint(10, size=(10, 4)), list('abcdefghij'), list('ABCD'))
df
A B C D
a 0 2 7 3
b 8 7 0 6
c 8 6 0 2
d 0 4 9 7
e 3 2 4 3
f 3 6 7 7
g 4 5 3 7
h 5 9 8 7
i 6 4 7 6
j 2 6 6 5
I'm going to use np.argpartition to separate each column into the 5 smallest and 10 - 5 (also 5) largest
v = df.values
i = df.index.values
k = len(v) - 5
pd.DataFrame(
i[v.argpartition(k, 0)[-k:]],
np.arange(k), df.columns
)
A B C D
0 g f i i
1 b c a d
2 h h f h
3 i b d f
4 c j h g
print(your_dataframe.sort_values(ascending=False)[0:4])

pandas moving aggregate string

from pandas import *
import StringIO
df = read_csv(StringIO.StringIO('''id months state
1 1 C
1 2 3
1 3 6
1 4 9
2 1 C
2 2 C
2 3 3
2 4 6
2 5 9
2 6 9
2 7 9
2 8 C
'''), delimiter= '\t')
I want to create a column show the cumulative state of column state, by id.
id months state result
1 1 C C
1 2 3 C3
1 3 6 C36
1 4 9 C369
2 1 C C
2 2 C CC
2 3 3 CC3
2 4 6 CC36
2 5 9 CC69
2 6 9 CC699
2 7 9 CC6999
2 8 C CC6999C
Basically the cum concatenation of string columns. What is the best way to do it?
So long as the dtype is str then you can do the following:
In [17]:
df['result']=df.groupby('id')['state'].apply(lambda x: x.cumsum())
df
Out[17]:
id months state result
0 1 1 C C
1 1 2 3 C3
2 1 3 6 C36
3 1 4 9 C369
4 2 1 C C
5 2 2 C CC
6 2 3 3 CC3
7 2 4 6 CC36
8 2 5 9 CC369
9 2 6 9 CC3699
10 2 7 9 CC36999
11 2 8 C CC36999C
Essentially we groupby on 'id' column and then apply a lambda with a transform to return the cumsum. This will perform a cumulative concatenation of the string values and return a Series with it's index aligned to the original df so you can add it as a column

Resources