Slicing a pandas dataframe - python-3.x

import pandas as pd
x = pd.DataFrame([[1,2,3],[4,5,6]])
x[::2]
what does the above command mean and how does it function?

Better is more data, it return even rows only by slicing:
x = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9],[0,1,2]])
print (x)
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
3 0 1 2
print (x[::2])
0 1 2
0 1 2 3
2 7 8 9

Related

Removing Suffix From Dataframe Column Names - Python

I am trying to remove a suffix from all columns in a dataframe, however I am getting error messages. Any suggestions would be appreciated.
df = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=list('ABCD'))
df.add_suffix('_x')
def strip_right(df.columns, _x):
if not text.endswith("_x"):
return text
# else
return text[:len(df.columns)-len("_x")]
Error:
def strip_right(tmp, "_x"):
^
SyntaxError: invalid syntax
I've also tried removing the quotations.
def strip_right(df.columns, _x):
if not text.endswith(_x):
return text
# else
return text[:len(df.columns)-len(_x)]
Error:
def strip_right(df.columns, _x):
^
SyntaxError: invalid syntax
Here is a more concrete example:.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=list('ABCD'))
df = df.add_suffix('_x')
print ("With Suffix")
print(df.head())
def strip_right(df, suffix='_x'):
df.columns = df.columns.str.rstrip(suffix)
strip_right(df)
print ("\n\nWithout Suffix")
print(df.head())
Output:
With Suffix
A_x B_x C_x D_x
0 0 7 0 2
1 5 1 8 5
2 6 2 0 1
3 6 6 5 6
4 8 6 5 8
Without Suffix
A B C D
0 0 7 0 2
1 5 1 8 5
2 6 2 0 1
3 6 6 5 6
4 8 6 5 8
I found a bug in the implementation of the accepted answer. The docs for pandas.Series.str.rstrip() reference str.rstrip(), which states:
"The chars argument is not a suffix; rather, all combinations of its values are stripped."
Instead I had to use pandas.Series.str.replace to remove the actual suffix from my column names. See the modified example below.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=list('ABCD'))
df = df.add_suffix('_x')
df['Ex_'] = np.random.randint(0,10,size=(10, 1))
df1 = pd.DataFrame(df, copy=True)
print ("With Suffix")
print(df1.head())
def strip_right(df, suffix='_x'):
df.columns = df.columns.str.rstrip(suffix)
strip_right(df1)
print ("\n\nAfter .rstrip()")
print(df1.head())
def replace_right(df, suffix='_x'):
df.columns = df.columns.str.replace(suffix+'$', '', regex=True)
print ("\n\nWith Suffix")
print(df.head())
replace_right(df)
print ("\n\nAfter .replace()")
print(df.head())
Output:
With Suffix
A_x B_x C_x D_x Ex_
0 4 9 2 3 4
1 1 6 5 8 6
2 2 5 2 3 6
3 1 4 7 6 4
4 3 9 3 5 8
After .rstrip()
A B C D E
0 4 9 2 3 4
1 1 6 5 8 6
2 2 5 2 3 6
3 1 4 7 6 4
4 3 9 3 5 8
After .replace()
A B C D Ex_
0 4 9 2 3 4
1 1 6 5 8 6
2 2 5 2 3 6
3 1 4 7 6 4
4 3 9 3 5 8

Pandas how to turn each group into a dataframe using groupby

I have a dataframe looks like,
A B
1 2
1 3
1 4
2 5
2 6
3 7
3 8
If I df.groupby('A'), how do I turn each group into sub-dataframes, so it will look like, for A=1
A B
1 2
1 3
1 4
for A=2,
A B
2 5
2 6
for A=3,
A B
3 7
3 8
By using get_group
g=df.groupby('A')
g.get_group(1)
Out[367]:
A B
0 1 2
1 1 3
2 1 4
You are close, need convert groupby object to dictionary of DataFrames:
dfs = dict(tuple(df.groupby('A')))
print (dfs[1])
A B
0 1 2
1 1 3
2 1 4
print (dfs[2])
A B
3 2 5
4 2 6

In Python Pandas using cumsum with groupby and reset of cumsum when value is 0

I'm rather new at python.
I try to have a cumulative sum for each client to see the consequential months of inactivity (flag: 1 or 0). The cumulative sum of the 1's need therefore to be reset when we have a 0. The reset need to happen as well when we have a new client. See below with example where a is the column of clients and b are the dates.
After some research, I found the question 'Cumsum reset at NaN' and 'In Python Pandas using cumsum with groupby'. I assume that I kind of need to put them together.
Adapting the code of 'Cumsum reset at NaN' to the reset towards 0, is successful:
cumsum = v.cumsum().fillna(method='pad')
reset = -cumsum[v.isnull() !=0].diff().fillna(cumsum)
result = v.where(v.notnull(), reset).cumsum()
However, I don't succeed at adding a groupby. My count just goes on...
So, a dataset would be like this:
import pandas as pd
df = pd.DataFrame({'a' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'b' : [1/15,2/15,3/15,4/15,5/15,6/15,1/15,2/15,3/15,4/15,5/15,6/15],
'c' : [1,0,1,0,1,1,0,1,1,0,1,1,1,1]})
this should result in a dataframe with the columns a, b, c and d with
'd' : [1,0,1,0,1,2,0,1,2,0,1,2,3,4]
Please note that I have a very large dataset, so calculation time is really important.
Thank you for helping me
Use groupby.apply and cumsum after finding contiguous values in the groups. Then groupby.cumcount to get the integer counting upto each contiguous value and add 1 later.
Multiply with the original row to create the AND logic cancelling all zeros and only considering positive values.
df['d'] = df.groupby('a')['c'] \
.apply(lambda x: x * (x.groupby((x != x.shift()).cumsum()).cumcount() + 1))
print(df['d'])
0 1
1 0
2 1
3 0
4 1
5 2
6 0
7 1
8 2
9 0
10 1
11 2
12 3
13 4
Name: d, dtype: int64
Another way of doing would be to apply a function after series.expanding on the groupby object which basically computes values on the series starting from the first index upto that current index.
Use reduce later to apply function of two args cumulatively to the items of iterable so as to reduce it to a single value.
from functools import reduce
df.groupby('a')['c'].expanding() \
.apply(lambda i: reduce(lambda x, y: x+1 if y==1 else 0, i, 0))
a
1 0 1.0
1 0.0
2 1.0
3 0.0
4 1.0
5 2.0
6 0.0
2 7 1.0
8 2.0
9 0.0
10 1.0
11 2.0
12 3.0
13 4.0
Name: c, dtype: float64
Timings:
%%timeit
df.groupby('a')['c'].apply(lambda x: x * (x.groupby((x != x.shift()).cumsum()).cumcount() + 1))
100 loops, best of 3: 3.35 ms per loop
%%timeit
df.groupby('a')['c'].expanding().apply(lambda s: reduce(lambda x, y: x+1 if y==1 else 0, s, 0))
1000 loops, best of 3: 1.63 ms per loop
I think you need custom function with groupby:
#change row with index 6 to 1 for better testing
df = pd.DataFrame({'a' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'b' : [1/15,2/15,3/15,4/15,5/15,6/15,1/15,2/15,3/15,4/15,5/15,6/15,7/15,8/15],
'c' : [1,0,1,0,1,1,1,1,1,0,1,1,1,1],
'd' : [1,0,1,0,1,2,3,1,2,0,1,2,3,4]})
print (df)
a b c d
0 1 0.066667 1 1
1 1 0.133333 0 0
2 1 0.200000 1 1
3 1 0.266667 0 0
4 1 0.333333 1 1
5 1 0.400000 1 2
6 1 0.066667 1 3
7 2 0.133333 1 1
8 2 0.200000 1 2
9 2 0.266667 0 0
10 2 0.333333 1 1
11 2 0.400000 1 2
12 2 0.466667 1 3
13 2 0.533333 1 4
def f(x):
x.ix[x.c == 1, 'e'] = 1
a = x.e.notnull()
x.e = a.cumsum()-a.cumsum().where(~a).ffill().fillna(0).astype(int)
return (x)
print (df.groupby('a').apply(f))
a b c d e
0 1 0.066667 1 1 1
1 1 0.133333 0 0 0
2 1 0.200000 1 1 1
3 1 0.266667 0 0 0
4 1 0.333333 1 1 1
5 1 0.400000 1 2 2
6 1 0.066667 1 3 3
7 2 0.133333 1 1 1
8 2 0.200000 1 2 2
9 2 0.266667 0 0 0
10 2 0.333333 1 1 1
11 2 0.400000 1 2 2
12 2 0.466667 1 3 3
13 2 0.533333 1 4 4

Pandas use variable for column names part 2

Given the following data frame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
df
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
How can one assign column names to variables for use in referring to said column names?
For example, if I do this:
cols=['A','B']
cols2=['C','D']
I then want to do something like this:
df[cols,'F',cols2]
But the result is this:
TypeError: unhashable type: 'list'
I think you need add column F to list:
allcols = cols + ['F'] + cols2
print df[allcols]
A B F C D
0 1 4 7 7 1
1 2 5 4 8 3
2 3 6 3 9 5
Or:
print df[cols + ['F'] +cols2]
A B F C D
0 1 4 7 7 1
1 2 5 4 8 3
2 3 6 3 9 5
Need give a list with columns for reference.
In [48]: df[cols+['F']+cols2]
Out[48]:
A B F C D
0 1 4 7 7 1
1 2 5 4 8 3
2 3 6 3 9 5
and, consider using df.loc[:, cols+['F']+cols2], df.ix[:, cols+['F']+cols2] for slicing.
Python 3 solution:
In [154]: df[[*cols,'F',*cols2]]
Out[154]:
A B F C D
0 1 4 7 7 1
1 2 5 4 8 3
2 3 6 3 9 5

Pandas use variable for column names [duplicate]

This question already has answers here:
Pandas Passing Variable Names into Column Name
(3 answers)
Closed 2 years ago.
Given the following data frame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
df
A B C
0 1 4 7
1 2 5 8
2 3 6 9
How can I access columns via a variable?
I tried this:
cols='A','B'
df[cols]
...which resulted in this:
KeyError: ('A', 'B')
Bonus Question:
What if my data frame were like this?:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
df
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
and I wanted to do this?:
cols=['A','B']
cols2=['C','D']
df[cols,'F',cols2]
Thanks in advance!
You can try subset by list of column names:
cols=['A','B']
print df[cols]
A B
0 1 4
1 2 5
2 3 6
It is same as:
print df[['A','B']]
A B
0 1 4
1 2 5
2 3 6
Bonus answer:
cols=['A','B']
cols2=['C','D']
allcols = cols + ['F'] + cols2
print df[allcols]
A B F C D
0 1 4 7 7 1
1 2 5 4 8 3
2 3 6 3 9 5

Resources