Pandas - Fastest way indexing with 2 dataframes - python-3.x

I am developing a software in Python 3 with Pandas library.
Time is very important but memory not so much.
For better visualization I am using the names a and b with few values, although there are many more:
a -> 50000 rows
b -> 5000 rows
I need to select from dataframe a and b (using multiples conditions)
a = pd.DataFrame({
'a1': ['x', 'y', 'z'] ,
'a2': [1, 2, 3],
'a3': [3.14, 2.73, -23.00],
'a4': [pd.np.nan, pd.np.nan, pd.np.nan]
})
a
a1 a2 a3 a4
0 x 1 3.14 NaN
1 y 2 2.73 NaN
2 z 3 -23.00 NaN
b = pd.DataFrame({
'b1': ['x', 'y', 'z', 'k', 'l'],
'b2': [2018, 2019, 2020, 2015, 2012]
})
b
b1 b2
0 x 2018
1 y 2019
2 z 2020
3 k 2015
4 l 2012
So far my code is like this:
for index, row in a.iterrows():
try:
# create a key
a1 = row["a1"]
mask = b.loc[(b['b1'] == a1) & (b['b2'] != 2019)]
# check if exists
if (len(mask.index) != 0): #not empty
a.loc[[index], ['a4']] = mask.iloc[0]['b2']
except KeyError: #not found
pass
But as you can see, I'm using for iterrows that is very slow compared to other methods and I'm changing the value of the DataFrame I'm iterating, that is not recommended.
Could you help me find a better way? The results should be like this:
a
a1 a2 a3 a4
0 x 1 3.14 2018
1 y 2 2.73 NaN
2 z 3 -23.00 2020
I tried things like this below, but I didnt made it work.
a.loc[ (a['a1'] == b['b1']) , 'a4'] = b.loc[b['b2'] != 2019]
*the real code has more conditions
Thanks!
EDIT
I benchmark using: iterrows, merge, set_index/loc. Here is the code:
import timeit
import pandas as pd
def f_iterrows():
for index, row in a.iterrows():
try:
# create a key
a1 = row["a1"]
a3 = row["a3"]
mask = b.loc[(b['b1'] == a1) & (b['b2'] != 2019)]
# check if exists
if len(mask.index) != 0: # not empty
a.loc[[index], ['a4']] = mask.iloc[0]['b2']
except: # not found
pass
def f_merge():
a.merge(b[b.b2 != 2019], left_on='a1', right_on='b1', how='left').drop(['a4', 'b1'], 1).rename(columns={'b2': 'a4'})
def f_lock():
df1 = a.set_index('a1')
df2 = b.set_index('b1')
df1.loc[:, 'a4'] = df2.b2[df2.b2 != 2019]
#variables for testing
number_rows = 100
number_iter = 100
a = pd.DataFrame({
'a1': ['x', 'y', 'z'] * number_rows,
'a2': [1, 2, 3] * number_rows,
'a3': [3.14, 2.73, -23.00] * number_rows,
'a4': [pd.np.nan, pd.np.nan, pd.np.nan] * number_rows
})
b = pd.DataFrame({
'b1': ['x', 'y', 'z', 'k', 'l'] * number_rows,
'b2': [2018, 2019, 2020, 2015, 2012] * number_rows
})
print('For: %s s' % str(timeit.timeit(f_iterrows, number=number_iter)))
print('Merge: %s s' % str(timeit.timeit(f_merge, number=number_iter)))
print('Loc: %s s' % str(timeit.timeit(f_iterrows, number=number_iter)))
They all worked :) and the time to run is:
For: 277.9994369489998 s
Loc: 274.04929955067564 s
Merge: 2.195712725706926 s
So far Merge is the fastest.
If another option appears I will update here, thanks again.

IIUC
a.merge(b[b.b2!=2019],left_on='a1',right_on='b1',how='left').drop(['a4','b1'],1).rename(columns={'b2':'a4'})
Out[263]:
a1 a2 a3 a4
0 x 1 3.14 2018.0
1 y 2 2.73 NaN
2 z 3 -23.00 2020.0

Related

Identify the latest series of Continuous same value in Python Pandas DataFrame

I have the following DataFrame (Date in dd-mm-yyyy format):
import pandas as pd
data={'Id':['A', 'B', 'C', 'A', 'B', 'C', 'B', 'C', 'A', 'C', 'B', 'C', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
'Date':['20-10-2022', '20-10-2022', '20-10-2022', '21-10-2022', '21-10-2022', '21-10-2022',
'22-10-2022', '22-10-2022', '23-10-2022', '23-10-2022', '24-10-2022', '24-10-2022',
'25-10-2022', '25-10-2022', '26-10-2022', '26-10-2022', '26-10-2022', '27-10-2022',
'27-10-2022', '27-10-2022']}
df=pd.DataFrame.from_dict(data)
df
Id Date
0 A 20-10-2022
1 B 20-10-2022
2 C 20-10-2022
3 A 21-10-2022
4 B 21-10-2022
5 C 21-10-2022
6 B 22-10-2022
7 C 22-10-2022
8 A 23-10-2022
9 C 23-10-2022
10 B 24-10-2022
11 C 24-10-2022
12 B 25-10-2022
13 C 25-10-2022
14 A 26-10-2022
15 B 26-10-2022
16 C 26-10-2022
17 A 27-10-2022
18 B 27-10-2022
19 C 27-10-2022
This is the Final DataFrame that I want:
I have tried the following code:
# Find first occurance and last occurance of any given Id.
df_first_duplicate = df.drop_duplicates(subset=['Id'], keep='first')
df_first_duplicate.rename(columns = {'Date':'DateOfFirstOccurance'}, inplace = True)
df_first_duplicate.reset_index(inplace = True, drop = True)
df_last_duplicate = df.drop_duplicates(subset=['Id'], keep='last')
df_last_duplicate.rename(columns = {'Date':'DateOfLastOccurance'}, inplace = True)
df_last_duplicate.reset_index(inplace = True, drop = True)
# Merge the above two df's on key
df_merged = pd.merge(df_first_duplicate, df_last_duplicate, on='Id')
df_merged
But this is the output that I get:
Id DateOfFirstOccurance DateOfLastOccurance
0 A 20-10-2022 27-10-2022
1 B 20-10-2022 27-10-2022
2 C 20-10-2022 27-10-2022
What should I do to get the desired output?
df['Date'] = pd.to_datetime(df['Date'], format='%d-%m-%Y')
records = []
for key, group in df.groupby(by='Id'):
filt = group['Date'].diff(-1).dt.days >= -1
filt.iloc[filt.shape[0]-1] = True
max_false_index = filt[~filt].index.max()
min_date = group['Date'].min() if type(max_false_index) == float else group.loc[max_false_index+1:, 'Date'].min()
records.append([key, min_date, group['Date'].max()])
pd.DataFrame(records, columns=['Id', 'DateOfFirstOccurance', 'DateOfLastOccurance'])
Here is one way to do it.
Sort your data by Id and Date. Use pandas.Series.diff to get the difference of each row compared to the last one, change it with dt.days to a floating number and create a boolean Series by comparing each value if it is greater/equal to 1. Convert the boolean Series from True/False to 1/0 with astype(int) and build the cumulative sum. The idx with the biggest value is the first/last occurence of your data.
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=True)
df = df.sort_values(['Id', 'Date'])
out = (
df
.groupby('Id')['Date']
.agg(
first_occurence = lambda x: x[
(x.diff().dt.days>1)
.astype(int)
.cumsum()
.idxmax()
],
last_occurence = lambda x: x[
(x.diff().dt.days==1)
.astype(int)
.cumsum()
.idxmax()
],
)
)
print(out)

The output of my code comes out too slowly.. How can i speed up my process

Thanks to the help from some users of this sites.
My code seems to work fine, but it's taking too long..
I'm trying to compare two data frames.(df1 has 1,291,250 rows / df2 has 1,286,692 rows)
if df1.iloc[0,0] == df2.iloc[0,0] and df1.iloc[0,1] == df2.iloc[0,1], then compare df1.iloc[0,2], df2.iloc[0,2].
If the first(df1.iloc[0,2]) is larger, I want to put the first index into the list, and if the second(df2.iloc[0,2]) is larger, I want to put the second index into the list.
Example DataFrame
In [1]: df1 = pd.DataFrame([[0, 1, 98], [1, 1, 198], [2, 2, 228]], columns = ['A1', 'B1', 'C1'])
In [2]: df1
Out[3]:
A1 B1 C1
0 0 1 98
1 1 1 198
2 2 2 228
In [4]: df2 = pd.DataFrame([[0, 1, 228], [1, 2, 110], [2, 2, 130]], columns = ['A2', 'B2', 'C2'])
In [5]: df2
Out[6]:
A2 B2 C2
0 0 1 228
1 1 2 110
2 2 2 130
In [7]: def find_high(df1, df2) # def function code is below
Out[8]: ([2], [0]) # The result what i want
This is just simple example. my data is bigger than this
my code is:
for i in range(60):
setattr(mod, f'df_1_{i}', np.array_split(df1, 60)[i])
getattr(mod, f'df_1_{i}').to_pickle(f'df_1_{i}')
import glob
files = glob.glob('df_1_*')
def find_high_pre(df1 = files, df2):
subtract_df2 = []
subtract_df1 = []
same_data = []
for df1_index, line in enumerate(df1.to_numpy()):
for df2_idx, row in enumerate(df2.to_numpy()):
if (line[0:2] == row[0:2]).all():
if line[2] < row[2]:
subtract_df2.append(df2_idx)
break
elif line[2] > row[2]:
subtract_df1.append(df1_idx)
break
else:
continue
break
return df1.iloc[subtract_df1].index.tolist(), df2.iloc[subtract_df2].index.tolist(), df1.iloc[same_data].index.to_list();
data_1 = []
for i in files:
e_data = pd.read_pickle(i)
num_cores = 30
df_split = np.array_split(e_data, num_cores)
data_1 += parmap.map(find_high_pre, df_split, pm_pbar=True, pm_processes =num_cores)
My code seems to work fine, but it's taking too long..
Chances are that replacing your nested for loops with a DataFrame.merge operation will take less time:
keys = ['A', 'B']
df1.columns = [*keys, 'C1']
df2.columns = [*keys, 'C2']
df = df1.reset_index().set_index(keys).merge(
df2.reset_index().set_index(keys), on=keys)
# now we have a merged dataframe like this:
# index_x C1 index_y C2
# A B
# 0 1 0 98 0 228
# 2 2 2 228 2 130
# therefrom we can easily extract the wanted indexes
data = [df.loc[df['C1'] > df['C2'], 'index_x'].values,
df.loc[df['C1'] < df['C2'], 'index_y'].values]

Matching subset of two columns of two different dataframes

Comparing specific columns from two different dataframes. Counting if subset of both dataframe is matching or not matching.
Condition:
If any element of file small['genes of cluster'] is matching with the big['genes of cluster'], output should be: match: 1.
For below example only OR4F16 is matching to both dataframes.
So Output: match: 1; unmatch: 3.
file1: big <tab separated>
cl nP genes of cluster
1 11 DDX11L1, MIR6859-3, WASH7P, MIR1302-2, FAM138C, FAM138F, FAM138A, OR4F5, LOC729737, LOC102725121, FAM138D
2 4 OR4F16, OR4F3, OR4F29, LOC100132287
3 64 LOC100133331, LOC100288069, FAM87B, LINC00115, LINC01128, FAM41C, LINC02593, SAMD11
4 7 GNB1, CALML6, TMEM52, CFAP74, GABRD, LOC105378591, PRKCZ
file2: small <tab separated>
cl nP genes of cluster
1 11 A, B, C, D
2 4 OR4F16, X, Y, Z
My Code: Python3
def genes_coordinates(big, small):
b = pd.read_csv(big, header=0, sep="\t")
s = pd.read_csv(small, header=0, sep="\t")
match = 0
unmatch = 0
for index, row in b.iterrows():
if row[row['genes of cluster'].isin(s['genes of cluster'])]:
match+1
else:
unmatch+1
print("match: ", match, "\nunmatch: ", unmatch)
genes_coordinates('big','small')
I would go with a pandas.merge() followed by counting by list comprehension.
import pandas as pd
df1 = pd.DataFrame({'cl':[1,2], 'nP':[11,4], 'gene of cluster':[['A', 'B', 'C', 'D'], ['OR4F16', 'X', 'Y', 'Z']]})
df2 = pd.DataFrame({'cl':[1,2,3,4], 'nP':[11,4,64,7], 'gene of cluster':[['DDX11L1', 'MIR6859-3', 'WASH7P', 'MIR1302-2', 'FAM138C', 'FAM138F', 'FAM138A', 'OR4F5', 'LOC729737', 'LOC102725121', 'FAM138D'], ['OR4F16', 'OR4F3', 'OR4F29', 'LOC100132287'], ['LOC100133331', 'LOC100288069', 'FAM87B', 'LINC00115', 'LINC01128', 'FAM41C', 'LINC02593', 'SAMD11'], ['GNB1', 'CALML6', 'TMEM52', 'CFAP74', 'GABRD', 'LOC105378591', 'PRKCZ']]})
df_m = df1.merge(df2, on=['cl', 'nP'], how='outer')
>>>df_m
cl nP gene of cluster_x gene of cluster_y
0 1 11 [A, B, C, D] [DDX11L1, MIR6859-3, WASH7P, MIR1302-2, FAM138...
1 2 4 [OR4F16, X, Y, Z] [OR4F16, OR4F3, OR4F29, LOC100132287]
2 3 64 NaN [LOC100133331, LOC100288069, FAM87B, LINC00115...
3 4 7 NaN [GNB1, CALML6, TMEM52, CFAP74, GABRD, LOC10537...
# An np.nan value is an outright 'unmatch'
found = []
for x in df_m.index:
if isinstance(df_m.iloc[x]['gene of cluster_x'], float):
found.append(0)
else:
if isinstance(df_m.iloc[x]['gene of cluster_y'], float):
found.append(0)
elif any([y in df_m.iloc[x]['gene of cluster_y'] for y in df_m.iloc[x]['gene of cluster_x']]):
found.append(1)
else:
found.append(0)
# The counts
match = sum(found)
unmatch = len(found) - match

Pandas, concatenating values of columns.

I have found answers to this question on here before, but none of them seem to work for me. Right now I have a data frame with a list of clients and their address. However, each address is separated into many columns and i'm trying to put them all under one.
The code I have so far read as so:
data1_df['Address'] = data1_df['Address 1'].map(str) + ", " + data1_df['Address 2'].map(str) + ", " + data1_df['Address 3'].map(str) + ", " + data1_df['city'].map(str) + ", " + data1_df['city'].map(str) + ", " + data1_df['Province/State'].map(str) + ", " + data1_df['Country'].map(str) + ", " + data1_df['Postal Code'].map(str)
However, the error I get is:
TypeError: Unary plus expects numeric dtype, not object
I'm not sure why it's not accepting the strings as they are and using the + operator. Shouldn't the plus accommodate objects?
Hopefully you'll find this example helpful:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1,2,3],
'B': list('ABC'),
'C': [4,5,np.nan],
'D': ['One', np.nan, 'Three']})
addColumns = ['B', 'C', 'D']
df['Address'] = df[addColumns].astype(str).apply(lambda x: ', '.join([i for i in x if i != 'nan']), axis=1)
df
# A B C D Address
#0 1 A 4.0 One A, 4.0, One
#1 2 B 5.0 NaN B, 5.0
#2 3 C NaN Three C, Three
The above will work as str representation of NaN is nan.
Or you can make it with filling NaN with empty strings:
df['Address'] = df[addColumns].fillna('').astype(str).apply(lambda x: ', '.join([i for i in x if i]), axis=1)
In the case of columns with NaN values that you need to add together, here's some logic:
def add_cols_w_nan(df, col_list, space_char, new_col_name):
""" Add together multiple columns where some of the columns
may contain NaN, with the appropriate amount of spacing between columns.
Examples:
'Mr.' + NaN + 'Smith' becomes 'Mr. Smith'
'Mrs.' + 'J.' + 'Smith' becomes 'Mrs. J. Smith'
NaN + 'J.' + 'Smith' becomes 'J. Smith'
Args:
df: pd.DataFrame
DataFrame for which strings are added together.
col_list: ORDERED list of column names, eg. ['first_name',
'middle_name', 'last_name']. The columns will be added in order.
space_char: str
Character to insert between concatenation of columns.
new_col_name: str
Name of the new column after adding together strings.
Returns: pd.DataFrame with a string addition column
"""
df2 = df[col_list].copy()
# Convert to strings, leave nulls alone
df2 = df2.where(df2.isnull(), df2.astype('str'))
# Add space character, NaN remains NaN, which is important
df2.loc[:, col_list[1:]] = space_char + df2.loc[:, col_list[1:]]
# Fix rows where leading columns are null
to_fix = df2.notnull().idxmax(1)
for col in col_list[1:]:
m = to_fix == col
df2.loc[m, col] = df2.loc[m, col].str.replace(space_char, '')
# So that summation works
df2[col_list] = df2[col_list].replace(np.NaN, '')
# Add together all columns
df[new_col_name] = df2[col_list].sum(axis=1)
# If all are missing replace with missing
df[new_col_name] = df[new_col_name].replace('', np.NaN)
del df2
return df
Sample Data:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Address 1': ['AAA', 'ABC', np.NaN, np.NaN, np.NaN],
'Address 2': ['foo', 'bar', 'baz', None, np.NaN],
'Address 3': [np.NaN, np.NaN, 17, np.NaN, np.NaN],
'city': [np.NaN, 'here', 'there', 'anywhere', np.NaN],
'state': ['NY', 'TX', 'WA', 'MI', np.NaN]})
# Address 1 Address 2 Address 3 city state
#0 AAA foo NaN NaN NY
#1 ABC bar NaN here TX
#2 NaN baz 17.0 there WA
#3 NaN None NaN anywhere MI
#4 NaN NaN NaN NaN NaN
df = add_cols_w_nan(
df,
col_list = ['Address 1', 'Address 2', 'Address 3', 'city', 'state'],
space_char = ', ',
new_col_name = 'full_address')
df.full_address.tolist()
#['AAA, foo, NY',
# 'ABC, bar, here, TX',
# 'baz, 17.0, there, WA',
# 'anywhere, MI',
# nan]

New column based on a row with conditions in Pandas

I'm trying to do an operation with Dataframes but i'm not sure how I can solve the problem using the built-in Pandas Operations (Actualy my code is based on a for so I'm trying to build a more elegant solution).
Given the following Dataframes, defined with the columns described below
original_df = [o1, o2, o3, o4]
weights_df = [w1, w2, w3, w4]
conditions_df = [c1, c2, c3, c4]
I need to built a new column on original_df based on the division of o1/w1 but depending on the value of c1, with takes the values ["+" or "-" I need to do the -o1/w1 operation.
As long as I did was:
orignal_df['newcolumn'] = original_df / weights_df
Where of course I divided the two terms but without applying the condition, I'm trying to do with map and apply functions but I'm not sure how I can add the third column into the function.
original_df = [100, 200, 300, 400]
weights_df = [10, 20, 30, 40]
conditions_df = [1, 2, 3, 4]
df = pd.DataFrame({'x':original_df, 'y':weights_df, 'z':conditions_df})
def div(x, y, z):
if z > 2:
return float(x/y)
else:
return float(-1*x/y)
df['new_feature'] = df.apply(lambda p: div(p['x'], p['y'], p['z']), axis=1)
This is one way of solving. If your conditions_df contains '+'/'-' then you can change the condition in def div(x, y, z) accordingly.
You can use numpy.where for mask by condition:
#data from lisa answer
#df = pd.DataFrame({'x':original_df, 'y':weights_df, 'z':conditions_df})
df['new_feature'] = df['x'] / df['y'] * np.where(df['z'] > 2, 1, -1)
print (df)
x y z new_feature
0 100 10 1 -10.0
1 200 20 2 -10.0
2 300 30 3 10.0
3 400 40 4 10.0
Timings:
#4k rows
df = pd.concat([df]*1000).reset_index(drop=True)
#lisa answer
In [95]: %timeit df['new_feature1'] = df.apply(lambda p: div(p['x'], p['y'], p['z']), axis=1)
10 loops, best of 3: 123 ms per loop
In [96]: %timeit df['new_feature2'] = df['x'] / df['y'] * np.where(df['z'] > 2, 1, -1)
1000 loops, best of 3: 595 µs per loop

Resources