How to filter a dataframe using a cumulative sum of a column as parameter - python-3.x

I have this df:
df=pd.DataFrame({'Name':['John','Mike','Lucy','Mary','Andy'],
'Age':[10,23,13,12,15],
'%':[20,20,10,25,25]})
I want to filter this df by taking from row 0 to row n until the sum of column % = 50
I don't want to sort the % column or the df, I just need to get it's first row where % column sums 50
The output is:
filtered=pd.DataFrame({'Name':['John','Mike','Lucy'],'Age':[10,23,13],'%':[20,20,10]})

cumsum, boolean index and slice using the loc or iloc accessor
df.iloc[:(df['%'].cumsum()==50).idxmax()+1,:]
Name Age %
0 John 10 20
1 Mike 23 20
2 Lucy 13 10

Related

How to drop records containing cell values equals to the header in pandas

I have read in this dataframe (called df):
As you can see there is a record that contains the same values as the header (ltv and age).
How do I drop that record in pandas?
Data:
df = pd.DataFrame({'ltv':[34.56, 50, 'ltv', 12.3], 'age':[45,56,'age',45]})
Check with
out = df[~df.eq(df.columns).any(1)]
Out[203]:
ltv age
0 34.56 45
1 50 56
3 12.3 45
One way is to just filter it out (assuming the strings match the column name they are in):
out = df[df['ltv']!='ltv']
Another could be to use to_numeric + dropna:
out = df.apply(pd.to_numeric, errors='coerce').dropna()
Output:
ltv age
0 34.56 45
1 50 56
3 12.3 45

Finding intervals in pandas dataframe based on values in another dataframe

I have two data frames. One dataframe (A) looks like:
Name. gender start_coordinate end_coordinate ID
Peter M 30 150 1
Hugo M 4500 6000 2
Jennie F 300 700 3
The other dataframe (B) looks like
ID_sim. position string
1 89 aa
4 568 bb
5 938437 cc
I want to accomplish two tasks here:
I want to get a list of indices for rows (from dataframe B) for which position column falls in the interval (specified by start_coordinate and end_coordinate column) in dataframe A.
The result for this task will be:
lst = [0,1]. ### because row 0 of B falls in interval of row 1 in A and row 1 of B falls in interval of row 3 of A.
The indices that I get from task 1, I want to keep it from dataframe B to create a new dataframe. Thus, the new dataframe will look like:
position string
89 aa
568 bb
I used .between() to accomplish this task. The code is as follows:
lst=dfB[dfB['position'].between(dfA.loc[0,'start_coordinate'],dfA.loc[len(dfA)-1,'end_coordinate'])].index.tolist()
result=dfB[dfB.index.isin(lst)]
result.shape
However, when I run this piece of code I get the following error:
KeyError: 0
What could possibly be raising this error? And how can I solve this?
We can try numpy broadcasting here
s, e = dfA[['start_coordinate', 'end_coordinate']].to_numpy().T
p = dfB['position'].to_numpy()[:, None]
dfB[((p >= s) & (p <= e)).any(1)]
ID_sim. position string
0 1 89 aa
1 4 568 bb
You could use Pandas IntervalIndex to get the positions, and afterwards, use a boolean to pull the relevant rows from B :
Create IntervalIndex:
intervals = pd.IntervalIndex.from_tuples([*zip(A['start_coordinate'],
A['end_coordinate'])
],
closed='both')
Get indexers for B.position, create a boolean array with the values and filter B:
# get_indexer returns -1 if an index is not found.
B.loc[intervals.get_indexer(B.position) >= 0]
Out[140]:
ID_sim. position string
0 1 89 aa
1 4 568 bb
This should work. Less elegant but easier to comprehend.
import pandas as pd
data = [['Name.','gender', 'start_coordinate','end_coordinate','ID'],
['Peter','M',30,150,1],
['Hugo','M',4500,6000,2],
['Jennie','F',300,700,3]]
data2 = [['ID_sim.','position','string'],
['1',89,'aa'],
['4',568,'bb'],
['5',938437,'cc']]
df1 = pd.DataFrame(data[1:], columns=data[0])
df2 = pd.DataFrame(data2[1:], columns=data2[0])
merged = pd.merge(df1, df2, left_index=True, right_index=True)
print (merged[(merged['position'] > merged['start_coordinate']) & (merged['position'] < merged['end_coordinate'])])

if specific value/string occurs in the entire dataframe I want to sum its index values

i have a dataframe in which I need to find a specific image name in the entire dataframe and sum its index values every time they are found. SO my data frame looks like:
c 1 2 3 4
g
0 180731-1-61.jpg 180731-1-61.jpg 180731-1-61.jpg 180731-1-61.jpg
1 1209270004-2.jpg 180609-2-31.jpg 1209270004-2.jpg 1209270004-2.jpg
2 1209270004-1.jpg 180414-2-38.jpg 180707-1-31.jpg 1209050002-1.jpg
3 1708260004-1.jpg 1209270004-2.jpg 180609-2-31.jpg 1209270004-1.jpg
4 1108220001-5.jpg 1209270004-1.jpg 1108220001-5.jpg 1108220001-2.jpg
I need to find the 1209270004-2.jpg in entire dataframe. And as it is found at index 1 and 3 I want to add the index values so it should be
1+3+1+1=6.
I tried the code:
img_fname = '1209270004-2.jpg'
df2 = df1[df1.eq(img_fname).any(1)]
sum = int(np.sum(df2.index.values))
print(sum)
I am getting the answer of sum 4 i.e 1+3=4. But it should be 6.
If the string occurence is only once or twice or thrice or four times like for eg 180707-1-31 is in column 3. then the sum should be 45+45+3+45 = 138. Which signifies that if the string is not present in the dataframe take vallue as 45 instead the index value.
You can multiple boolean mask by index values and then sum:
img_fname = '1209270004-1.jpg'
s = df1.eq(img_fname).mul(df1.index.to_series(), 0).sum()
print (s)
1 2
2 4
3 0
4 3
dtype: int64
out = np.where(s == 0, 45, s).sum()
print (out)
54
If dataset does not have many columns, this can also work with your original question
df1 = pd.DataFrame({"A":["aa","ab", "cd", "ab", "aa"], "B":["ab","ab", "ab", "aa", "ab"]})
s = 0
for i in df1.columns:
s= s+ sum(df1.index[df1.loc[:,i] == "ab"].tolist())
Input :
A B
0 aa ab
1 ab ab
2 cd ab
3 ab aa
4 aa ab
Output :11
Based on second requirement:

Dataframe: Computed row based on cell above and cell on the left

I have a dataframe with a bunch of integer values. I then compute the column totals and append it as a new row to the dataframe. So far so good.
Now I want to append another computed row where the value of each cell is the sum of cell above and the cell on the left. You can see what I mean below:
----------------------------------------------------------------
|250000 |0 |145000 |145000 |220000 |165000 |145000 |145000 |
----------------------------------------------------------------
|250000 |250000 |395000 |540000 |760000 |925000 |1070000|1215000 |
----------------------------------------------------------------
How can this be done?
I think you need Series.cumsum with select last row (total row) by DataFrame.iloc:
df = pd.DataFrame({
'B':[4,5,4],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
})
df.loc['sum'] = df.sum()
df.loc['cumsum'] = df.iloc[-1].cumsum()
#if need only cumsum row
#df.loc['cumsum'] = df.sum().cumsum()
print (df)
B C D E
0 4 7 1 5
1 5 8 3 3
2 4 9 5 6
sum 13 24 9 14
cumsum 13 37 46 60

Pandas Pivot Table Conditional Counting

I have a simple dataframe:
df = pd.DataFrame({'id': ['a','a','a','b','b'],'value':[0,15,20,30,0]})
df
id value
0 a 0
1 a 15
2 a 20
3 b 30
4 b 0
And I want a pivot table with the number of values greater than zero.
I tried this:
raw = pd.pivot_table(df, index='id',values='value',aggfunc=lambda x:len(x>0))
But returned this:
value
id
a 3
b 2
What I need:
value
id
a 2
b 1
I read lots of solutions with groupby and filter. Is it possible to achieve this only with pivot_table command? If it is not, which is the best approach?
Thanks in advance
UPDATE
Just to make it clearer why I am avoinding filter solution. In my real and complex df, I have other columns, like this:
df = pd.DataFrame({'id': ['a','a','a','b','b'],'value':[0,15,20,30,0],'other':[2,3,4,5,6]})
df
id other value
0 a 2 0
1 a 3 15
2 a 4 20
3 b 5 30
4 b 6 0
I need to sum the column 'other', but when i filter I got this:
df=df[df['value']>0]
raw = pd.pivot_table(df, index='id',values=['value','other'],aggfunc={'value':len,'other':sum})
other value
id
a 7 2
b 5 1
Instead of:
other value
id
a 9 2
b 11 1
Need sum for count Trues created by condition x>0:
raw = pd.pivot_table(df, index='id',values='value',aggfunc=lambda x:(x>0).sum())
print (raw)
value
id
a 2
b 1
As #Wen mentioned, another solution is:
df = df[df['value'] > 0]
raw = pd.pivot_table(df, index='id',values='value',aggfunc=len)
You can filter the dataframe before pivoting:
pd.pivot_table(df.loc[df['value']>0], index='id',values='value',aggfunc='count')

Resources