Selecting columns by their position while using a function in pandas - python-3.x

My dataframe looks somthing like this
frame = pd.DataFrame({'id':[1,2,3,4,5],
'week1_values':[0,0,13,39,64],
'week2_values':[32,35,25,78,200]})
I am trying to apply a function to calculate the Week over Week percentage difference between two columns('week1_values' and 'week2_values') whose names are being generated dynamically.
I want to create a function to calculate the percentage difference between weeks keeping in mind the zero values in the 'week1_values' column.
My function is something like this:
def WoW(df):
if df.iloc[:,1] == 0:
return (df.iloc[:,1] - df.iloc[:,2])
else:
return ((df.iloc[:,1] - df.iloc[:,2]) / df.iloc[:,1]) *100
frame['WoW%'] = frame.apply(WoW,axis=1)
When i try to do that, i end up with this error
IndexingError: ('Too many indexers', 'occurred at index 0')
How is it that one is supposed to specify columns by their positions inside a function?
PS: Just want to clarify that since the column names are being generated dynamically, i am trying to select them by their position with iloc function.

Because working with Series, remove indexing columns:
def WoW(df):
if df.iloc[1] == 0:
return (df.iloc[1] - df.iloc[2])
else:
return ((df.iloc[1] - df.iloc[2]) / df.iloc[1]) *100
frame['WoW%'] = frame.apply(WoW,axis=1)
Vectorized alternative:
s = frame.iloc[:,1] - frame.iloc[:,2]
frame['WoW%1'] = np.where(frame.iloc[:, 1] == 0, s, (s / frame.iloc[:,1]) *100)
print (frame)
id week1_values week2_values WoW% WoW%1
0 1 0 32 -32.000000 -32.000000
1 2 0 35 -35.000000 -35.000000
2 3 13 25 -92.307692 -92.307692
3 4 39 78 -100.000000 -100.000000
4 5 64 200 -212.500000 -212.500000

You can use pandas pct_change method to automatically compute the percent change.
s = (frame.iloc[:, 1:].pct_change(axis=1).iloc[:, -1]*100)
frame['WoW%'] = s.mask(np.isinf(s), frame.iloc[:, -1])
output:
id week1_values week2_values WoW
0 1 0 32 32.000000
1 2 0 35 35.000000
2 3 13 25 92.307692
3 4 39 78 100.000000
4 5 64 200 212.500000
Note however that the way you currently do it in your custom function is biased. Changes from 0->20, or 10->12, or 100->120 would all produce 20 as output, which seems ambiguous.
suggested alternative
use a classical percent increase, even if it leads to infinite:
frame['WoW'] = frame.iloc[:, 1:].pct_change(axis=1).iloc[:, -1]*100
output:
id week1_values week2_values WoW
0 1 0 32 inf
1 2 0 35 inf
2 3 13 25 92.307692
3 4 39 78 100.000000
4 5 64 200 212.500000

Related

Finding which rows have duplicates in a .csv, but only if they have a certain amount of duplicates

I am trying to determine which sequential rows have at least 50 duplicates within one column. Then I would like to be able to read which rows have the duplicates in a summarized manner, ie
start end total
9 60 51
200 260 60
I'm trying to keep the start and end separate so I can call on them independently later.
I have this to open the .csv file and read its contents:
df = pd.read_csv("BN4 A4-F4, H4_row1_column1_watershed_label.csv", header=None)
df.groupby(0).filter(lambda x: len(x) > 0)
Which gives me this:
0
0 52.0
1 65.0
2 52.0
3 52.0
4 52.0
... ...
4995 8.0
4996 8.0
4997 8.0
4998 8.0
4999 8.0
5000 rows × 1 columns
I'm having a number of problems with this. 1) I'm not sure I totally understand the second function. It seems like it is supposed to group the numbers in my column together. This code:
df.groupby(0).count()
gives me this:
0
0.0
1.0
2.0
3.0
4.0
...
68.0
69.0
70.0
71.0
73.0
65 rows × 0 columns
Which I assume means that there are a total of 65 different unique identities in my column. This just doesn't tell me what they are or where they are. I thought that's what this one would do
df.groupby(0).filter(lambda x: len(x) > 0)
but if I change the 0 to anything else then it screws up my generated list.
Problem 2) I think in order to get the number of duplicates in a sequence, and which rows they are in, I would probably need to use a for loop, but I'm not sure how to build it. So far, I've been pulling my hair out all day trying to figure it out but I just don't think I know Python well enough yet.
Can I get some help, please?
UPDATE
Thanks! So this is what I have thanks to #piterbarg:
#function to identify which behaviors have at least 49 frames, and give the starting, ending, and number of frames
def behavior():
df2 = (df
.reset_index()
.shift(periods=-1)
.groupby((df[0].diff() != 0).cumsum()) #if the diff between a row and the prev row is not 0, increase cumulative sum
.agg({0 : 'mean', 'index':['first','last',len]})) #mean is the behavior category
df3 = (df2.where(df2[('index','len')]>49)
.dropna() #drop N/A
.astype(int) #type = int
.reset_index(drop = True))
print(df3)
out:
0 index
mean first last len
0 7 32 87 56
1 19 277 333 57
2 1 785 940 156
3 30 4062 4125 64
4 29 4214 4269 56
5 7 4450 4599 150
6 1 4612 4775 164
7 7 4778 4882 105
8 8 4945 4999 56
The current issue is trying to make it so the dataframe includes the last row of my .csv. If anyone happens to see this, I would love your input!
Let's start by mocking a df:
import numpy as np
np.random.seed(314)
df=pd.DataFrame({0:np.random.randint(10,size = 5000)})
# make sure we have a couple of large blocks
df.loc[300:400,0] = 5
df.loc[600:660,0] = 4
First we identify where the changes to the consecutive numbers occur, and groupby each of such groups. We record where it starts, where it finishes, and the size of each group
df2 = (df.reset_index()
.groupby((df[0].diff() != 0).cumsum())
.agg({'index':['first','last',len]})
)
Then we only pick those groups that are longer than 50
(df2.where(df2[('index','len')]>50)
.dropna()
.astype(int)
.reset_index(drop = True)
)
output:
index
first last len
0 300 400 101
1 600 660 61
For your question as to what df.groupby(0).filter(lambda x: len(x) > 0) does, as far as I can tell it does nothing. It groups by different values in column 0 and then discard those groups whose size is 0, which is none of them by definition. So this returns your full df
Edit
Your code is not quite right, should be
def behavior():
df2 = (df.reset_index()
.groupby((df[0].diff() != 0).cumsum())
.agg({0 : 'mean', 'index':['first','last',len]}))
df3 = (df2.where(df2[('index','len')]>50)
.dropna()
.astype(int)
.reset_index(drop = True))
print(df3)
note that we define and return df3 not df2, and also I amended the code to return the value that is repeated in the mean column (sorry names are not very intuitive but you can change them if you want)
first is the index when the repetition starts, last is the last index, and len is how many elements there.
#function to identify which behaviors have at least 49 frames, and give the starting, ending, and number of frames
def behavior():
df2 = (df.reset_index()
.groupby((df[0].diff() != 0).cumsum()) #if the diff between a row and the prev row is not 0, increase cumulative sum
.agg({0 : 'mean', 'index':['first','last',len]})) #mean is the behavior category
.shift(-1)
df3 = (df2.where(df2[('index','len')]>49)
.dropna() #drop N/A
.astype(int) #type = int
.reset_index(drop = True))
print(df3)
yields this:
0 index
mean first last len
0 7 31 86 56
1 19 276 332 57
2 1 784 939 156
3 31 4061 4124 64
4 29 4213 4268 56
5 8 4449 4598 150
6 1 4611 4774 164
7 8 4777 4881 105
8 8 4944 4999 56
Which I love. I did notice that the group with 56x duplicates of '7' actually starts on row 32, and ends on row 87 (just one later in both cases, and the pattern is consistent throughout the sheet). Am I right in believing that this can be fixed with the shift() function somehow? I'm toying around with this still :D

loops application in dataframe to find output

I have the following data:
dict={'A':[1,2,3,4,5],'B':[10,20,233,29,2],'C':[10,20,3040,230,238]...................}
and
df= pd.Dataframe(dict)
In this manner I have 20 columns with 5 numerical entry in each column
I want to have a new column where the value should come as the following logic:
0 A[0]*B[0]+A[0]*C[0] + A[0]*D[0].......
1 A[1]*B[1]+A[1]*C[1] + A[1]*D[1].......
2 A[2]*B[2]+A[2]*B[2] + A[2]*D[2].......
I tried in the following manner but manually I can not put 20 columns, so I wanted to know the way to apply a loop to get the desired output
:
lst=[]
for i in range(0,5):
j=df.A[i]*df.B[i]+ df.A[i]*df.C[i]+.......
lst.append(j)
i=i+1
A potential solution is the following. I am only taking the example you posted but is works fine for more. Your data is df
A B C
0 1 10 10
1 2 20 20
2 3 233 3040
3 4 29 230
4 5 2 238
You can create a new column, D by first subsetting your dataframe
add = df.loc[:, df.columns != 'A']
and then take the sum over all multiplications of the columns in D with column A in the following way:
df['D'] = df['A']*add.sum(axis=1)
which returns
A B C D
0 1 10 10 20
1 2 20 20 80
2 3 233 3040 9819
3 4 29 230 1036
4 5 2 238 1200

Mark sudden changes in prices in a dataframe time series and color them

I have a Pandas dataframe of prices for different months and years (timeseries), 80 columns. I want to be able to detect significant changes in prices either up or down and color them differently in a dataframe. Is that possible and what would be the best approach?
Jan-2001 Feb-2001 Jan-2002 Feb-2002 ....
100 30 10 ...
110 25 1 ...
40 5 50
70 11 4
120 35 2
Here in the first column 40 and 70 should be marked, in the second column 5 and 11 should be marked, in the third column not really sure but probably 1, 50, 4, 2...
Your question involves 2 problems I can see.
Printing the highlighting depends on the output method your trying to get to, be it STDOUT, file, or some program specific.
Identification of outliers based on the Column data. Its hard to interpret if you want it based on the entire dataset, vice the previous data in the column like a rolling outlier, ie the data previous is calculated to identify if the next thing is out of wack.
In the below instance I provide a method to go at the data with std dev/zscoring based on the mean of the data in the entire column. You will have to tweak the > < items to get to your desired state, there is many intricacies dealing with this concept and I would suggest taking a look at a few resources about this subject.
For your data:
Jan-2001,Feb-2001,Jan-2002
100,30,10
110,25,1
40,5,50
70,11,4
120,35,20000
I am aware of methods to highlight, but not in the terminal. The https://pandas.pydata.org/pandas-docs/stable/style.html method works in a few programs.
To get at the original item, identification of outliers in your data, you could use something like below to identify based on standard deviation and zscore.
Sample Code:
df = pd.read_csv("full.txt")
original = df.columns
print(df)
for col in df.columns:
col_zscore = col + "_zscore"
df[col_zscore] = (df[col] - df[col].mean())/df[col].std(ddof=0)
print(df[col].loc[(df[col_zscore] > 1.5) | (df[col_zscore] < -.5)])
print(df)
Output 1: # prints the original dataframe
Jan-2001 Feb-2001 Jan-2002
100 30 10
110 25 1
40 5 50
70 11 4
120 35 20000
Output 2: # Identifies the outliers
2 40
3 70
Name: Jan-2001, dtype: int64
2 5
3 11
Name: Feb-2001, dtype: int64
0 10
1 1
3 4
4 20000
Name: Jan-2002, dtype: int64
Output 3: # Prints the full dataframe created, with zscore of each item based on the column
Jan-2001 Feb-2001 Jan-2002 Jan-2001_std Jan-2001_zscore \
0 100 30 10 32.710854 0.410152
1 110 25 1 32.710854 0.751945
2 40 5 50 32.710854 -1.640606
3 70 11 4 32.710854 -0.615227
4 120 35 2 32.710854 1.093737
Feb-2001_std Feb-2001_zscore Jan-2002_std Jan-2002_zscore
0 12.735776 0.772524 20.755722 -0.183145
1 12.735776 0.333590 20.755722 -0.667942
2 12.735776 -1.422147 20.755722 1.971507
3 12.735776 -0.895426 20.755722 -0.506343
4 12.735776 1.211459 20.755722 -0.614076
Resources for zscore are here:
https://statistics.laerd.com/statistical-guides/standard-score-2.php

Assigning integers to dataframe fields ` OverflowError: Python int too large to convert to C unsigned long`

I have a dataframe df that looks like this:
var val
0 clump_thickness 5
1 unif_cell_size 1
2 unif_cell_shape 1
3 marg_adhesion 1
4 single_epith_cell_size 2
5 bare_nuclei 1
6 bland_chrom 3
7 norm_nucleoli 1
8 mitoses 1
9 class 2
11 unif_cell_size 4
12 unif_cell_shape 4
13 marg_adhesion 5
14 single_epith_cell_size 7
15 bare_nuclei 10
17 norm_nucleoli 2
20 clump_thickness 3
25 bare_nuclei 2
30 clump_thickness 6
31 unif_cell_size 8
32 unif_cell_shape 8
34 single_epith_cell_size 3
35 bare_nuclei 4
37 norm_nucleoli 7
40 clump_thickness 4
43 marg_adhesion 3
50 clump_thickness 8
51 unif_cell_size 10
52 unif_cell_shape 10
53 marg_adhesion 8
... ... ...
204 single_epith_cell_size 5
211 unif_cell_size 5
215 bare_nuclei 7
216 bland_chrom 7
217 norm_nucleoli 10
235 bare_nuclei -99999
257 norm_nucleoli 6
324 single_epith_cell_size 8
I want to create a new column that holds the values of the var and val columns, converted to a number. I wrote the following code:
df['id'] = df.apply(lambda row: int.from_bytes('{}{}'.format(row.var, row.val).encode(), 'little'), axis = 1)
When I run this code I get the following error:
df['id'] = df.apply(lambda row: int.from_bytes('{}{}'.format(row.var, row.val).encode(), 'little'), axis = 1)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 4262, in apply
ignore_failures=ignore_failures)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 4384, in _apply_standard
result = Series(results)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/series.py", line 205, in __init__
default=np.nan)
File "pandas/_libs/src/inference.pyx", line 1701, in pandas._libs.lib.fast_multiget (pandas/_libs/lib.c:68371)
File "pandas/_libs/src/inference.pyx", line 1165, in pandas._libs.lib.maybe_convert_objects (pandas/_libs/lib.c:58498)
OverflowError: Python int too large to convert to C unsigned long
I don't understand why. If I run
for column in df['var'].unique():
for value in df['val'].unique():
if int.from_bytes('{}{}'.format(column, value).encode(), 'little') > maximum:
maximum = int.from_bytes('{}{}'.format(column, value).encode(), 'little')
print(int.from_bytes('{}{}'.format(column, value), 'little'))
print()
print(maximum)
I get the following result:
65731626445514392434127804442952952931
67060854441299308307031611503233297507
65731626445514392434127804442952952931
68390082437084224179935418563513642083
69719310432869140052839225623793986659
73706994420223887671550646804635020387
16399285238650560638676108961167827102819
67060854441299308307031611503233297507
72377766424438971798646839744354675811
75036222416008803544454453864915364963
69719310432869140052839225623793986659
16399285238650560638676108961167827102819
76365450411793719417358260925195709539
68390082437084224179935418563513642083
76365450411793719417358260925195709539
73706994420223887671550646804635020387
83632281929131549175300318205721294812263623257187
71048538428654055925743032684074331235
75036222416008803544454453864915364963
72377766424438971798646839744354675811
277249955343544548646026928445812341
256480767909405238131904943128931957
266865361626474893388965935787372149
287634549060614203903087921104252533
64059424565585367137514643836585471605
261673064767940065760435439458152053
282442252202079376274557424775032437
.....
60968996531299
69179002195346541894528099
58769973275747
62068508159075
59869484903523
6026341019714892551838472781928948268513458935618931750446847388019
Based on these results I would say that the conversion to integers works fine. Furthermore, the largest created integer is not so big that it should cause problems when being inserted into the dataframe right?
Question: How can I successfully create a new column with the newly created integers? What am I doing wrong here?
Edit: Although bws's solution
str(int.from_bytes('{}{}'.format(column, value).encode(), 'little'))
solves the error, I now have a new problem: the ids are all unique.. I don't understand why this happens but I suddenly have 3000 unique ids, while there are only 92 unique var/val combinations.
I dont know the why. Maybe lamda use by default int in front of int64?
I have a workaround that maybe is useful for you.
Convert the result to string (object):df['id'] = df.apply(lambda row: str(int.from_bytes('{}{}'.format(row["var"], row["val"]).encode(), 'little')), axis = 1)
This is interesting to know: https://docs.scipy.org/doc/numpy-1.10.1/user/basics.types.html
uint64 Unsigned integer (0 to 18446744073709551615)
edit:
After read the last link I asume that when you use a loop, you are using the int python type, not the int that use pandas (come from numpy). So, when you work with a Dataframe you are using the types that numpy provide...
Int type from numpy come from Object so I think that the correct way to work with large integer is use object.
Its my conclusion but maybe I am wrong.
Edit second question:
Simple example works:
d2 = {'val': [2, 1, 1, 2],
'var': ['clump_thickness', 'unif_cell_size', 'unif_cell_size', 'clump_thickness']
}
df2 = pd.DataFrame(data=d2)
df2['id'] = df2.apply(lambda row: str(int.from_bytes('{}{}'.format(row["var"], row["val"]).encode(), 'little')), axis = 1)
Result of df2:
print (df2)
val var id
0 2 clump_thickness 67060854441299308307031611503233297507
1 1 unif_cell_size 256480767909405238131904943128931957
2 1 unif_cell_size 256480767909405238131904943128931957
3 2 clump_thickness 67060854441299308307031611503233297507

Pandas multi-index subtract from value based on value in other column part 2

Based on a thorough and accurate response to this question, I am now faced with a new issue based on slightly different data.
Given this data frame:
df = pd.DataFrame({
('A', 'a'): [23,3,54,7,32,76],
('B', 'b'): [23,'n/a',54,7,32,76],
('possible','possible'):[100,100,100,100,100,100]
})
df
A B possible
a b possible
0 23 23 100
1 3 n/a 100
2 54 54 100
3 7 n/a 100
4 32 32 100
5 76 76 100
I'd like to subtract 4 from 'possible', per row, for any instance (column) where the value is 'n/a' for that row (and then change all 'n/a' values to 0).
A B possible
a b possible
0 23 23 100
1 3 n/a 96
2 54 54 100
3 7 n/a 96
4 32 32 100
5 76 76 100
Some conditions:
It may occur that a column is all floats (though they appear to be integers upon inspection). This was not factored into the original question.
It may also occur that a row contains two instances (columns) of 'n/a' values. This was addressed by the previous solution.
Here is the previous solution:
idx = pd.IndexSlice
df.loc[:, idx['possible', 'possible']] -= (df.loc[:, idx[('A','B'),:]] == 'n/a').sum(axis=1) * 4
df.replace({'n/a':0}, inplace=True)
It works, except for where a column (A or B) contains all floats (seemingly integers). When that's the case, this error occurs:
TypeError: Could not compare ['n/a'] with block values
I think you can add casting to string by astype to condition:
idx = pd.IndexSlice
df.loc[:, idx['possible', 'possible']] -=
(df.loc[:, idx[('A','B'),:]].astype(str) == 'n/a').sum(axis=1) * 4
df.replace({'n/a':0}, inplace=True)
print df
A B possible
a b possible
0 23 23 100
1 3 0 96
2 54 54 100
3 7 0 96
4 32 32 100
5 76 76 100

Resources