How to drop records containing cell values equals to the header in pandas - python-3.x

I have read in this dataframe (called df):
As you can see there is a record that contains the same values as the header (ltv and age).
How do I drop that record in pandas?
Data:
df = pd.DataFrame({'ltv':[34.56, 50, 'ltv', 12.3], 'age':[45,56,'age',45]})

Check with
out = df[~df.eq(df.columns).any(1)]
Out[203]:
ltv age
0 34.56 45
1 50 56
3 12.3 45

One way is to just filter it out (assuming the strings match the column name they are in):
out = df[df['ltv']!='ltv']
Another could be to use to_numeric + dropna:
out = df.apply(pd.to_numeric, errors='coerce').dropna()
Output:
ltv age
0 34.56 45
1 50 56
3 12.3 45

Related

How to filter a dataframe using a cumulative sum of a column as parameter

I have this df:
df=pd.DataFrame({'Name':['John','Mike','Lucy','Mary','Andy'],
'Age':[10,23,13,12,15],
'%':[20,20,10,25,25]})
I want to filter this df by taking from row 0 to row n until the sum of column % = 50
I don't want to sort the % column or the df, I just need to get it's first row where % column sums 50
The output is:
filtered=pd.DataFrame({'Name':['John','Mike','Lucy'],'Age':[10,23,13],'%':[20,20,10]})
cumsum, boolean index and slice using the loc or iloc accessor
df.iloc[:(df['%'].cumsum()==50).idxmax()+1,:]
Name Age %
0 John 10 20
1 Mike 23 20
2 Lucy 13 10

filtering rows in one dataframe based on two columns of another dataframe

I have two data frames. One dataframe (dfA) looks like:
Name gender start_coordinate end_coordinate ID
Peter M 30 150 1
Hugo M 4500 6000 2
Jennie F 300 700 3
The other dataframe (dfB) looks like
Name position string
Peter 89 aa
Jennie 568 bb
Jennie 90 cc
I want to filter data from dfA such that position from dfB falls in the interval of dfA (start coordinate and end coordinate) and names should be same as well. For example, position value of row # 1 of dfB falls in interval specified by row # 1 of dfA and the corresponding name value is also the same therefore, I want this row. In contrast, row # 3 of dfB also falls in interval of row # 1 of dfA but the name value is different therefore, I don't want this record.
The expected out therefore, becomes:
##new_dfA
Name gender start_coordinate end_coordinate ID
Peter M 30 150 1
Jennie F 300 700 3
##new_dfB
Name position string
Peter 89 aa
Jennie 568 bb
In reality, dfB is of size (443068765,10) and dfA is of size (100000,3) therefore, I don't want to use numpy broadcasting because I run into memory error. Is there a way to deal with this problem within pandas framework. Insights will be appreciated.
If you have that many rows, pandas might not be well suited for your application.
That said, if there aren't many rows with identical "Name", you could merge on "Name" and then filter the rows matching your condition:
dfC = dfA.merge(dfB, on='Name')
dfC = dfC[dfC['position'].between(dfC['start_coordinate'], dfC['end_coordinate'])]
dfA_new = dfC[df1.columns]
dfB_new = dfC[df2.columns]
output:
>>> dfA_new
Name gender start_coordinate end_coordinate ID
0 Peter M 30 150 1
1 Jennie F 300 700 3
>>> dfB_new
Name position string
0 Peter 89 aa
1 Jennie 568 bb
use pandasql
pd.sql("select df1.* from df1 inner join df2 on df2.name=df1.name and df2.position between df1.start_coordinate and df1.end_coordinate",globals())
Name gender start_coordinate end_coordinate ID
0 Peter M 30 150 1
1 Jennie F 300 700 3
pd.sql("select df2.* from df1 inner join df2 on df2.name=df1.name and df2.position between df1.start_coordinate and df1.end_coordinate",globals())
Name position string
0 Peter 89 aa
1 Jennie 568 bb

Get total of Pandas column and row

I have a Pandas data frame, as shown below,
a b c
A 100 60 60
B 90 44 44
A 70 50 50
Now, I would like to get the total of column and row, skip c, as shown below,
a b sum
A 170 110 280
B 90 44 134
So I do not know how to do, I'm in trouble, please help me, thank you, guys.
My example dataframe is:
df = pd.DataFrame(dict(a=[100, 90,70], b=[60, 44,50],c=[60, 44,50]),index=["A", "B","A"])
(
df.groupby(level=0)['a','b'].sum()
.assign(sum=lambda x: x.sum(1))
)
Use:
#remove unnecessary column
df = df.drop('c', 1)
#get sum of rows
df['sum'] = df.sum(1)
#get sum per index
df = df.sum(level=0)
print (df)
a b sum
A 170 110 280
B 90 44 134
df["sum"] = df[["a","b"]].sum(axis=1) #Column-wise sum of "a" and "b"
df[["a", "b", "sum"]] #show all columns but not "c"
The pandas way is:
#create sum column
df['sum'] = df['a']+df['b']
#remove colimn c
df = df[['a', 'b', 'sum']]

How to take values in the column as the columns in the DataFrame in pandas

My current DataFrame is:
Term value
Name
A 1 35
A 2 40
A 3 50
B 1 20
B 2 45
B 3 50
I want to get a dataframe as:
Term 1 2 3
Name
A 35 40 50
B 20 45 50
How can i get it?I've tried using pivot_table but i didn't get my expected output.Is there any way to get my expected output?
Use:
df = df.set_index('Term', append=True)['value'].unstack()
Or:
df = pd.pivot(df.index, df['Term'], df['value'])
print (df)
Term 1 2 3
Name
A 35 40 50
B 20 45 50
EDIT: If duplicates in pairs Name with Term is necessary aggretion, e.g. sum or mean:
df = df.groupby(['Name','Term'])['value'].sum().unstack(fill_value=0)

row substraction in lambda pandas dataframe

I have a dataframe with multiple columns. One of the column is the cumulative revenue column. If the year is not ended then the revenue will be constant for the rest of the period because the coming daily revenue is 0.
The dataframe looks like this
Now I want to create a new column where the row is substracted by the last row and if the result is 0 then print 0 for that row in the new column. If not zero then use the row value. The new dataframe should look like this:
My idea was to do this with the apply lambda method. So this is the thinking:
{df['2017new'] = df['2017'].apply(lambda x: 0 if row - lastrow == 0 else x)}
But i do not know how to write the row - lastrow part of the code. How to do this? Thanks in advance!
By using np.where
df2['New']=np.where(df2['2017'].diff().eq(0),0,df2['2017'])
df2
Out[190]:
2016 2017 New
0 10 21 21
1 15 34 34
2 70 40 40
3 90 53 53
4 93 53 0
5 99 53 0
We can shift the data and fill the values based on condition using np.where i.e
df['new'] = np.where(df['2017']-df['2017'].shift(1)==0,0,df['2017'])
or with df.where i.e
df['new'] = df['2017'].where(df['2017']-df['2017'].shift(1)!=0,0)
2016 2017 new
0 10 21 21
1 15 34 34
2 70 40 40
3 90 53 53
4 93 53 0
5 99 53 0

Resources