filtering rows in one dataframe based on two columns of another dataframe - python-3.x

I have two data frames. One dataframe (dfA) looks like:
Name gender start_coordinate end_coordinate ID
Peter M 30 150 1
Hugo M 4500 6000 2
Jennie F 300 700 3
The other dataframe (dfB) looks like
Name position string
Peter 89 aa
Jennie 568 bb
Jennie 90 cc
I want to filter data from dfA such that position from dfB falls in the interval of dfA (start coordinate and end coordinate) and names should be same as well. For example, position value of row # 1 of dfB falls in interval specified by row # 1 of dfA and the corresponding name value is also the same therefore, I want this row. In contrast, row # 3 of dfB also falls in interval of row # 1 of dfA but the name value is different therefore, I don't want this record.
The expected out therefore, becomes:
##new_dfA
Name gender start_coordinate end_coordinate ID
Peter M 30 150 1
Jennie F 300 700 3
##new_dfB
Name position string
Peter 89 aa
Jennie 568 bb
In reality, dfB is of size (443068765,10) and dfA is of size (100000,3) therefore, I don't want to use numpy broadcasting because I run into memory error. Is there a way to deal with this problem within pandas framework. Insights will be appreciated.

If you have that many rows, pandas might not be well suited for your application.
That said, if there aren't many rows with identical "Name", you could merge on "Name" and then filter the rows matching your condition:
dfC = dfA.merge(dfB, on='Name')
dfC = dfC[dfC['position'].between(dfC['start_coordinate'], dfC['end_coordinate'])]
dfA_new = dfC[df1.columns]
dfB_new = dfC[df2.columns]
output:
>>> dfA_new
Name gender start_coordinate end_coordinate ID
0 Peter M 30 150 1
1 Jennie F 300 700 3
>>> dfB_new
Name position string
0 Peter 89 aa
1 Jennie 568 bb

use pandasql
pd.sql("select df1.* from df1 inner join df2 on df2.name=df1.name and df2.position between df1.start_coordinate and df1.end_coordinate",globals())
Name gender start_coordinate end_coordinate ID
0 Peter M 30 150 1
1 Jennie F 300 700 3
pd.sql("select df2.* from df1 inner join df2 on df2.name=df1.name and df2.position between df1.start_coordinate and df1.end_coordinate",globals())
Name position string
0 Peter 89 aa
1 Jennie 568 bb

Related

How to drop records containing cell values equals to the header in pandas

I have read in this dataframe (called df):
As you can see there is a record that contains the same values as the header (ltv and age).
How do I drop that record in pandas?
Data:
df = pd.DataFrame({'ltv':[34.56, 50, 'ltv', 12.3], 'age':[45,56,'age',45]})
Check with
out = df[~df.eq(df.columns).any(1)]
Out[203]:
ltv age
0 34.56 45
1 50 56
3 12.3 45
One way is to just filter it out (assuming the strings match the column name they are in):
out = df[df['ltv']!='ltv']
Another could be to use to_numeric + dropna:
out = df.apply(pd.to_numeric, errors='coerce').dropna()
Output:
ltv age
0 34.56 45
1 50 56
3 12.3 45

Pandas Aggregate columns dynamically

My goal is to aggregate data similar to SAS's "proc summary using types" My starting pandas dataframe could look like this where the database has already done the original group by all dimensions/classification variables and done some aggregate function on the measures.
So in sql this would look like
select gender, age, sum(height), sum(weight)
from db.table
group by gender, age
gender
age
height
weight
F
19
70
123
M
24
72
172
I then would like to summarize the data using pandas to calculate summary rows based on different group bys to come out with this.
gender
age
height
weight
.
.
142
295
.
19
70
123
.
24
72
172
F
.
70
123
M
.
72
172
F
19
70
123
M
24
72
172
Where the first row is agg with no group by
2 and 3 row are agg grouped by age
4 and 5 agg by just gender
and then just the normal rows
My current code looks like this
# normally dynamic just hard coded for this example
measures = {'height':{'stat':'sum'}, 'age':{'stat':'sum'}}
msr_config_dict = {}
for measure in measures:
if measure in message_measures:
stat = measures[measure]['stat']
msr_config_dict[measure] = pd.NamedAgg(measure, stat)
# compute agg with no group by as starting point
df=self.df.agg(**msr_config_dict)
dimensions = ['gender','age'] # also dimensions is dynamic in real life
dim_vars = []
for dim in dimensions:
dim_vars.append(dim)
if len(dim_vars) > 1:
# compute agg of compound dimensions
df_temp = self.df.groupby(dim_vars, as_index=False).agg(msr_config_dict)
df = df.append(df_temp, ignore_index=True)
# always compute agg of solo dimension
df_temp = self.df.groupby(dim, as_index=False).agg(msr_config_dict)
df = df.append(df_temp, ignore_index=True)
With this code I get AttributeError: 'height' is not a valid function for 'Series' object
For the input to agg function I have also tried
{'height':[('height', 'sum')], 'weight':[('weight', 'sum')]} where I am trying to compute the sum of all heights and name the output height. Which also had an attribute error.
I know I will only ever be computing one aggregate function per measure so I would like to dynamically build the input to the pandas agg functon and always rename the stat to itself so I can just append it to the dataframe that I am building with the summary rows.
I am new to pandas coming from SAS background.
Any help would be much appreciated.
IIUC:
cols = ['height', 'weight']
out = pd.concat([df[cols].sum(0).to_frame().T,
df.groupby('age')[cols].sum().reset_index(),
df.groupby('gender')[cols].sum().reset_index(),
df], ignore_index=True)[df.columns].fillna('.')
Output:
>>> out
gender age height weight
0 . . 142 295
1 . 19.0 70 123
2 . 24.0 72 172
3 F . 70 123
4 M . 72 172
5 F 19.0 70 123
6 M 24.0 72 172
Here is a more flexible solution, extending the solution of #Corralien. You can use itertools.combinations to create all the combinations of dimensions and for all length of combination possible.
from itertools import combinations
# your input
measures = {'height':{'stat':'sum'}, 'weight':{'stat':'min'}}
dimensions = ['gender','age']
# change the nested dictionary
msr_config_dict = {key:val['stat'] for key, val in measures.items()}
# concat all possible aggregation
res = pd.concat(
# case with all aggregated
[df.agg(msr_config_dict).to_frame().T]
# cases at least one column to aggregate over
+ [df.groupby(list(_dimCols)).agg(msr_config_dict).reset_index()
# for combinations of length 1, 2.. depending on the number of dimensions
for nb_cols in range(1, len(dimensions))
# all combinations of the specific lenght
for _dimCols in combinations(dimensions, nb_cols)]
# original dataframe
+ [df],
ignore_index=True)[df.columns].fillna('.')
print(res)
# gender age height weight
# 0 . . 142 123
# 1 F . 70 123
# 2 M . 72 172
# 3 . 19.0 70 123
# 4 . 24.0 72 172
# 5 F 19.0 70 123
# 6 M 24.0 72 172

Finding intervals in pandas dataframe based on values in another dataframe

I have two data frames. One dataframe (A) looks like:
Name. gender start_coordinate end_coordinate ID
Peter M 30 150 1
Hugo M 4500 6000 2
Jennie F 300 700 3
The other dataframe (B) looks like
ID_sim. position string
1 89 aa
4 568 bb
5 938437 cc
I want to accomplish two tasks here:
I want to get a list of indices for rows (from dataframe B) for which position column falls in the interval (specified by start_coordinate and end_coordinate column) in dataframe A.
The result for this task will be:
lst = [0,1]. ### because row 0 of B falls in interval of row 1 in A and row 1 of B falls in interval of row 3 of A.
The indices that I get from task 1, I want to keep it from dataframe B to create a new dataframe. Thus, the new dataframe will look like:
position string
89 aa
568 bb
I used .between() to accomplish this task. The code is as follows:
lst=dfB[dfB['position'].between(dfA.loc[0,'start_coordinate'],dfA.loc[len(dfA)-1,'end_coordinate'])].index.tolist()
result=dfB[dfB.index.isin(lst)]
result.shape
However, when I run this piece of code I get the following error:
KeyError: 0
What could possibly be raising this error? And how can I solve this?
We can try numpy broadcasting here
s, e = dfA[['start_coordinate', 'end_coordinate']].to_numpy().T
p = dfB['position'].to_numpy()[:, None]
dfB[((p >= s) & (p <= e)).any(1)]
ID_sim. position string
0 1 89 aa
1 4 568 bb
You could use Pandas IntervalIndex to get the positions, and afterwards, use a boolean to pull the relevant rows from B :
Create IntervalIndex:
intervals = pd.IntervalIndex.from_tuples([*zip(A['start_coordinate'],
A['end_coordinate'])
],
closed='both')
Get indexers for B.position, create a boolean array with the values and filter B:
# get_indexer returns -1 if an index is not found.
B.loc[intervals.get_indexer(B.position) >= 0]
Out[140]:
ID_sim. position string
0 1 89 aa
1 4 568 bb
This should work. Less elegant but easier to comprehend.
import pandas as pd
data = [['Name.','gender', 'start_coordinate','end_coordinate','ID'],
['Peter','M',30,150,1],
['Hugo','M',4500,6000,2],
['Jennie','F',300,700,3]]
data2 = [['ID_sim.','position','string'],
['1',89,'aa'],
['4',568,'bb'],
['5',938437,'cc']]
df1 = pd.DataFrame(data[1:], columns=data[0])
df2 = pd.DataFrame(data2[1:], columns=data2[0])
merged = pd.merge(df1, df2, left_index=True, right_index=True)
print (merged[(merged['position'] > merged['start_coordinate']) & (merged['position'] < merged['end_coordinate'])])

How to take values in the column as the columns in the DataFrame in pandas

My current DataFrame is:
Term value
Name
A 1 35
A 2 40
A 3 50
B 1 20
B 2 45
B 3 50
I want to get a dataframe as:
Term 1 2 3
Name
A 35 40 50
B 20 45 50
How can i get it?I've tried using pivot_table but i didn't get my expected output.Is there any way to get my expected output?
Use:
df = df.set_index('Term', append=True)['value'].unstack()
Or:
df = pd.pivot(df.index, df['Term'], df['value'])
print (df)
Term 1 2 3
Name
A 35 40 50
B 20 45 50
EDIT: If duplicates in pairs Name with Term is necessary aggretion, e.g. sum or mean:
df = df.groupby(['Name','Term'])['value'].sum().unstack(fill_value=0)

Sorting and Grouping in Pandas data frame column alphabetically

I want to sort and group by a pandas data frame column alphabetically.
a b c
0 sales 2 NaN
1 purchase 130 230.0
2 purchase 10 20.0
3 sales 122 245.0
4 purchase 103 320.0
I want to sort column "a" such that it is in alphabetical order and is grouped as well i.e., the output is as follows:
a b c
1 purchase 130 230.0
2 10 20.0
4 103 320.0
0 sales 2 NaN
3 122 245.0
How can I do this?
I think you should use the sort_values method of pandas :
result = dataframe.sort_values('a')
It will sort your dataframe by the column a and it will be grouped either because of the sorting. See ya !

Resources