I have two very large tables df1 and df2 (multiple millions of rows each) of person-related data and each table has a column that contains the name of a person (column name: "Name"). The names of one and the same person can be written differently (e.g. "Jeff McGregor" or "Mr. J McGregor", etc.) among the two tables, which is why I want to apply fuzzy string matching with the fuzzywuzzy package in Python (this simply compares two strings and returns a similarity measure).
As an output (see df3 for the desired output table), I would like to fill the "Match_Flag" and the "Match_List" columns in the df1 according to the entries in df2. For every (unique) person in df1, I want to check if there are (fuzzy string) matches in the df2. If there is a string, the column "Match_Flag" should contain a "yes" and if not, a "no". The "Match_list" column should contain for every name a list of matches. If there is one match, the list would contain one entry and if there are e.g. three matches, the list would contain 3 matches. If there is no match, the list should be just empty.
This is the data:
data_df1 = {'ID':[56382, 34732, 12423, 29574, 76532],
'Name':['Tom Hilley', 'Andreas Puthz', 'Jeff McGregor', 'Jack Ebbstein', 'Lisa Norwat'],
'Match_Flag':["", "", "", "", ""],
'Match_List':["", "", "", "", ""]}
df1 = pd.DataFrame(data_df1)
ID Name Match_Flag Match_List
0 56382 Tom Hilley
1 34732 Andreas Puthz
2 12423 Jeff McGregor
3 29574 Jack Ebbstein
4 76532 Lisa Norwat
data_df2 = {'Name':['Tom Hilley', 'Madalina Peter', 'Russel Cross', 'Jenni Pey', 'Kanush Hawks', 'Mr. J McGregor', 'Ebbstein Jack', 'Mr. Jack Ebbstein'],
'Age':[16, 56, 33, 44, 24, 26, 86, 32]}
df2 = pd.DataFrame(data_df2)
Name Age
0 Tom Hilley 16
1 Madalina Peter 56
2 Russel Cross 33
3 Jenni Pey 44
4 Kanush Hawks 24
5 Mr. J McGregor 26
6 Ebbstein Jack 86
7 Mr. Jack Ebbstein 32
data_df3 = {'ID':[56382, 34732, 12423, 29574, 76532],
'Name':['Tom Hilley', 'Andreas Puthz', 'Jeff McGregor', 'Jack Ebbstein', 'Lisa Norwat'],
'Match_Flag':["yes", "no", "yes", "yes", "no"],
'Match_List':[["Tom Hilley"], [], ["Mr. J McGregor"], ["Ebbstein Jack","Mr. Jack Ebbstein"], []]}
df3 = pd.DataFrame(data_df3)
ID Name Match_Flag Match_List
0 56382 Tom Hilley yes [Tom Hilley]
1 34732 Andreas Puthz no []
2 12423 Jeff McGregor yes [Mr. J McGregor]
3 29574 Jack Ebbstein yes [Ebbstein Jack, Mr. Jack Ebbstein]
4 76532 Lisa Norwat no []
My approach:
# import libraries
import pandas as pd
from fuzzywuzzy import fuzz
# create matching
for i in df1["Name"].unique().tolist():
# initialize matching list
matching_list = []
for j in df2["Name"].unique().tolist():
# create matching score
if fuzz.token_set_ratio(i, j) >= 90:
# create red flags
if matching_list:
df1.loc[df1['Name'] == i,'Match_Flag'] = 'yes'
df1.loc[df1['Name'] == i,'Match_List'] = matching_list
df1.loc[df1['Name'] == i,'Match_Flag'] = 'no'
df1.loc[df1['Name'] == i,'Match_List'] = ["-"]
Output of my approach:
line 611, in _setitem_with_indexer
raise ValueError('Must have equal len keys and value '
ValueError: Must have equal len keys and value when setting with an iterable
Since my approach is 1. not working and 2. it will be way too slow for millions of rows, I ask you to help me and find a more efficient and working approach please.

This answer might take a while to run, but should work.
I imported names to create larger dataframes with random names.
import pandas as pd
from fuzzywuzzy import fuzz
import random
import os
import names
id_col = range(10000)
name_col = [names.get_full_name() for _ in range(10000)]
df1 = pd.DataFrame({'ID':id_col, 'name_col':name_col})
age = [random.randint(1, 95) for _ in range(10000)]
name_col2 = [names.get_full_name() for _ in range(10000)]
df2 = pd.DataFrame({'name_col2':name_col2, 'age':age})
Since we want to iterate through df1, I dropped duplicates of the name column. We're going to do a cross join to bring the whole row of the dataframe into the 2nd dataframe, so I assigned v=1
df1_deduped = df1.drop_duplicates('name_col')
df2 = df2.assign(v=1)
define the fuzzy function to use in .apply
def func(row):
return fuzz.token_set_ratio(row['name_col'], row['name_col2'])
Here we're going to loop through the length of the first dataframe, and for every row (unique name), we're joining it to the 2nd dataframe. We then .apply the fuzzy function to a tokenthresh column, and filter down the dataframe by the threshold 70. If there are any matches, it writes it to a csv. This way it's not all done in memory which will mostly likely be an issue for you with multi-million row dataframes on both sides. This will chunk it into pieces. Alternatively instead of going row by row into a million row dataframe, you could do it in 5s or 10s, that could slow it down, I'm not sure.
for i in range(len(df1_deduped)):
df3 = pd.merge(df1.assign(v=1).iloc[[i],:], df2, on='v').drop(['v'], axis=1)
df3['tokenthresh'] = df3.apply(func, axis=1)
df3 = df3[df3.tokenthresh > 70]
print('there are', len(df3), 'records that exceeded the threshold')
if len(df3) > 0:
df3.to_csv(str(i)+'.csv', index=False)
We then can read in the files that were created:
files = []
for file in os.listdir():
data = pd.concat(files)
and lastly concat the different answers:
data['concat_group'] = data.groupby(['ID', 'name_col'])['name_col2'].transform(lambda x: ', '.join(x))
data = data.drop_duplicates(['ID', 'name_col'])

base on this topic I believe merging those two dataframes are a lot more efficient than iterate through the whole data.
since you want matched names, you should use inner join.


Merge two Dataframes in combination with .isin() or .contains() or difflib? [duplicate]

I have two DataFrames which I want to merge based on a column. However, due to alternate spellings, different number of spaces, absence/presence of diacritical marks, I would like to be able to merge as long as they are similar to one another.
Any similarity algorithm will do (soundex, Levenshtein, difflib's).
Say one DataFrame has the following data:
df1 = DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])
one 1
two 2
three 3
four 4
five 5
df2 = DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])
one a
too b
three c
fours d
five e
Then I want to get the resulting DataFrame
number letter
one 1 a
two 2 b
three 3 c
four 4 d
five 5 e
Similar to #locojay suggestion, you can apply difflib's get_close_matches to df2's index and then apply a join:
In [23]: import difflib
In [24]: difflib.get_close_matches
Out[24]: <function difflib.get_close_matches>
In [25]: df2.index = x: difflib.get_close_matches(x, df1.index)[0])
In [26]: df2
one a
two b
three c
four d
five e
In [31]: df1.join(df2)
number letter
one 1 a
two 2 b
three 3 c
four 4 d
five 5 e
If these were columns, in the same vein you could apply to the column then merge:
df1 = DataFrame([[1,'one'],[2,'two'],[3,'three'],[4,'four'],[5,'five']], columns=['number', 'name'])
df2 = DataFrame([['a','one'],['b','too'],['c','three'],['d','fours'],['e','five']], columns=['letter', 'name'])
df2['name'] = df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])[0])
Using fuzzywuzzy
Since there are no examples with the fuzzywuzzy package, here's a function I wrote which will return all matches based on a threshold you can set as a user:
Example datframe
df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})
# df1
0 Apple
1 Banana
2 Orange
3 Strawberry
# df2
0 Aple
1 Mango
2 Orag
3 Straw
4 Bannanna
5 Berry
Function for fuzzy matching
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
:param df_1: the left table to join
:param df_2: the right table to join
:param key1: key column of the left table
:param key2: key column of the right table
:param threshold: how close the matches should be to return a match, based on Levenshtein distance
:param limit: the amount of matches that will get returned, these are sorted high to low
:return: dataframe with boths keys and matches
s = df_2[key2].tolist()
m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
df_1['matches'] = m
m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
df_1['matches'] = m2
return df_1
Using our function on the dataframes: #1
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
fuzzy_merge(df1, df2, 'Key', 'Key', threshold=80)
Key matches
0 Apple Aple
1 Banana Bannanna
2 Orange Orag
3 Strawberry Straw, Berry
Using our function on the dataframes: #2
df1 = pd.DataFrame({'Col1':['Microsoft', 'Google', 'Amazon', 'IBM']})
df2 = pd.DataFrame({'Col2':['Mcrsoft', 'gogle', 'Amason', 'BIM']})
fuzzy_merge(df1, df2, 'Col1', 'Col2', 80)
Col1 matches
0 Microsoft Mcrsoft
1 Google gogle
2 Amazon Amason
pip install fuzzywuzzy
conda install -c conda-forge fuzzywuzzy
I have written a Python package which aims to solve this problem:
pip install fuzzymatcher
You can find the repo here and docs here.
Basic usage:
Given two dataframes df_left and df_right, which you want to fuzzy join, you can write the following:
from fuzzymatcher import link_table, fuzzy_left_join
# Columns to match on from df_left
left_on = ["fname", "mname", "lname", "dob"]
# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]
# The link table potentially contains several matches for each record
fuzzymatcher.link_table(df_left, df_right, left_on, right_on)
Or if you just want to link on the closest match:
fuzzymatcher.fuzzy_left_join(df_left, df_right, left_on, right_on)
I would use Jaro-Winkler, because it is one of the most performant and accurate approximate string matching algorithms currently available [Cohen, et al.], [Winkler].
This is how I would do it with Jaro-Winkler from the jellyfish package:
def get_closest_match(x, list_strings):
best_match = None
highest_jw = 0
for current_string in list_strings:
current_score = jellyfish.jaro_winkler(x, current_string)
if(current_score > highest_jw):
highest_jw = current_score
best_match = current_string
return best_match
df1 = pandas.DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])
df2 = pandas.DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])
df2.index = x: get_closest_match(x, df1.index))
number letter
one 1 a
two 2 b
three 3 c
four 4 d
five 5 e
For a general approach: fuzzy_merge
For a more general scenario in which we want to merge columns from two dataframes which contain slightly different strings, the following function uses difflib.get_close_matches along with merge in order to mimic the functionality of pandas' merge but with fuzzy matching:
import difflib
def fuzzy_merge(df1, df2, left_on, right_on, how='inner', cutoff=0.6):
df_other= df2.copy()
df_other[left_on] = [get_closest_match(x, df1[left_on], cutoff)
for x in df_other[right_on]]
return df1.merge(df_other, on=left_on, how=how)
def get_closest_match(x, other, cutoff):
matches = difflib.get_close_matches(x, other, cutoff=cutoff)
return matches[0] if matches else None
Here are some use cases with two sample dataframes:
key number
0 one 1
1 two 2
2 three 3
3 four 4
4 five 5
key_close letter
0 three c
1 one a
2 too b
3 fours d
4 a very different string e
With the above example, we'd get:
fuzzy_merge(df1, df2, left_on='key', right_on='key_close')
key number key_close letter
0 one 1 one a
1 two 2 too b
2 three 3 three c
3 four 4 fours d
And we could do a left join with:
fuzzy_merge(df1, df2, left_on='key', right_on='key_close', how='left')
key number key_close letter
0 one 1 one a
1 two 2 too b
2 three 3 three c
3 four 4 fours d
4 five 5 NaN NaN
For a right join, we'd have all non-matching keys in the left dataframe to None:
fuzzy_merge(df1, df2, left_on='key', right_on='key_close', how='right')
key number key_close letter
0 one 1.0 one a
1 two 2.0 too b
2 three 3.0 three c
3 four 4.0 fours d
4 None NaN a very different string e
Also note that difflib.get_close_matches will return an empty list if no item is matched within the cutoff. In the shared example, if we change the last index in df2 to say:
one a
too b
three c
fours d
a very different string e
We'd get an index out of range error: x: difflib.get_close_matches(x, df1.index)[0])
IndexError: list index out of range
In order to solve this the above function get_closest_match will return the closest match by indexing the list returned by difflib.get_close_matches only if it actually contains any matches. does not have a hook function to do this on the fly. Would be nice though...
I would just do a separate step and use difflib getclosest_matches to create a new column in one of the 2 dataframes and the merge/join on the fuzzy matched column
I used Fuzzymatcher package and this worked well for me. Visit this link for more details on this.
use the below command to install
pip install fuzzymatcher
Below is the sample Code (already submitted by RobinL above)
from fuzzymatcher import link_table, fuzzy_left_join
# Columns to match on from df_left
left_on = ["fname", "mname", "lname", "dob"]
# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]
# The link table potentially contains several matches for each record
fuzzymatcher.link_table(df_left, df_right, left_on, right_on)
Errors you may get
ZeroDivisionError: float division by zero---> Refer to this
link to resolve it
OperationalError: No Such Module:fts4 --> downlaod the sqlite3.dll
from here and replace the DLL file in your python or anaconda
DLLs folder.
Pros :
Works faster. In my case, I compared one dataframe with 3000 rows with anohter dataframe with 170,000 records . This also uses SQLite3 search across text. So faster than many
Can check across multiple columns and 2 dataframes. In my case, I was looking for closest match based on address and company name. Sometimes, company name might be same but address is the good thing to check too.
Gives you score for all the closest matches for the same record. you choose whats the cutoff score.
Original package installation is buggy
Required C++ and visual studios installed too
Wont work for 64 bit anaconda/Python
There is a package called fuzzy_pandas that can use levenshtein, jaro, metaphone and bilenco methods. With some great examples here
import pandas as pd
import fuzzy_pandas as fpd
df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})
results = fpd.fuzzy_merge(df1, df2,
Key Key
0 Apple Aple
1 Banana Bannanna
2 Orange Orag
As a heads up, this basically works, except if no match is found, or if you have NaNs in either column. Instead of directly applying get_close_matches, I found it easier to apply the following function. The choice of NaN replacements will depend a lot on your dataset.
def fuzzy_match(a, b):
left = '1' if pd.isnull(a) else a
right = b.fillna('2')
out = difflib.get_close_matches(left, right)
return out[0] if out else np.NaN
You can use d6tjoin for that
import d6tjoin.top1
index number index_right letter
0 one 1 one a
1 two 2 too b
2 three 3 three c
3 four 4 fours d
4 five 5 five e
It has a variety of additional features such as:
check join quality, pre and post join
customize similarity function, eg edit distance vs hamming distance
specify max distance
multi-core compute
For details see
MergeTop1 examples - Best match join examples notebook
PreJoin examples - Examples for diagnosing join problems
I have used fuzzywuzz in a very minimal way whilst matching the existing behaviour and keywords of merge in pandas.
Just specify your accepted threshold for matching (between 0 and 100):
from fuzzywuzzy import process
def fuzzy_merge(df, df2, on=None, left_on=None, right_on=None, how='inner', threshold=80):
def fuzzy_apply(x, df, column, threshold=threshold):
if type(x)!=str:
return None
match, score, *_ = process.extract(x, df[column], limit=1)[0]
if score >= threshold:
return match
return None
if on is not None:
left_on = on
right_on = on
# create temp column as the best fuzzy match (or None!)
df2['tmp'] = df2[right_on].apply(
merged_df = df.merge(df2, how=how, left_on=left_on, right_on='tmp')
del merged_df['tmp']
return merged_df
Try it out using the example data:
df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})
fuzzy_merge(df, df2, on='Key', threshold=80)
Using thefuzz
Using SeatGeek's great package thefuzz, which makes use of Levenshtein distance. This works with data held in columns. It adds matches as rows rather than columns, to preserve a tidy dataset, and allows additional columns to be easily pulled through to the output dataframe.
Sample data
df1 = pd.DataFrame({'col_a':['one','two','three','four','five'], 'col_b':[1, 2, 3, 4, 5]})
col_a col_b
0 one 1
1 two 2
2 three 3
3 four 4
4 five 5
df2 = pd.DataFrame({'col_a':['one','too','three','fours','five'], 'col_b':['a','b','c','d','e']})
col_a col_b
0 one a
1 too b
2 three c
3 fours d
4 five e
Function used to do the matching
def fuzzy_match(
df_left, df_right, column_left, column_right, threshold=90, limit=1
# Create a series
series_matches = df_left[column_left].apply(
lambda x: process.extract(x, df_right[column_right], limit=limit) # Creates a series with id from df_left and column name _column_left_, with _limit_ matches per item
# Convert matches to a tidy dataframe
df_matches = series_matches.to_frame()
df_matches = df_matches.explode(column_left) # Convert list of matches to rows
['match_string', 'match_score', 'df_right_id']
] = pd.DataFrame(df_matches[column_left].tolist(), index=df_matches.index) # Convert match tuple to columns
df_matches.drop(column_left, axis=1, inplace=True) # Drop column of match tuples
# Reset index, as in creating a tidy dataframe we've introduced multiple rows per id, so that no longer functions well as the index
index_name = # Stash index name
index_name = 'index' # Default used by pandas
df_matches.rename(columns={index_name: 'df_left_id'}, inplace=True) # The previous index has now become a column: rename for ease of reference
# Drop matches below threshold
df_matches.loc[df_matches['match_score'] < threshold].index,
return df_matches
Use function and merge data
import pandas as pd
from thefuzz import process
df_matches = fuzzy_match(
df_output = df1.merge(
suffixes=['_df1', '_df2']
df_output.set_index('df_left_id', inplace=True) # For some reason the first merge operation wrecks the dataframe's index. Recreated from the value we have in the matches lookup table
df_output = df_output[['col_a_df1', 'col_b_df1', 'col_b_df2']] # Drop columns used in the matching = 'id'
id col_a_df1 col_b_df1 col_b_df2
0 one 1 a
1 two 2 b
2 three 3 c
3 four 4 d
4 five 5 e
Tip: Fuzzy matching using thefuzz is much quicker if you optionally install the python-Levenshtein package too.
For more complex use cases to match rows with many columns you can use recordlinkage package. recordlinkage provides all the tools to fuzzy match rows between pandas data frames which helps to deduplicate your data when merging. I have written a detailed article about the package here
if the join axis is numeric this could also be used to match indexes with a specified tolerance:
def fuzzy_left_join(df1, df2, tol=None):
index1 = df1.index.values
index2 = df2.index.values
diff = np.abs(index1.reshape((-1, 1)) - index2)
mask_j = np.argmin(diff, axis=1) # min. of each column
mask_i = np.arange(mask_j.shape[0])
df1_ = df1.iloc[mask_i]
df2_ = df2.iloc[mask_j]
if tol is not None:
mask = np.abs(df2_.index.values - df1_.index.values) <= tol
df1_ = df1_.loc[mask]
df2_ = df2_.loc[mask]
df2_.index = df1_.index
out = pd.concat([df1_, df2_], axis=1)
return out
TheFuzz is the new version of a fuzzywuzzy
In order to fuzzy-join string-elements in two big tables you can do this:
Use apply to go row by row
Use swifter to parallel, speed up and visualize default apply function (with colored progress bar)
Use OrderedDict from collections to get rid of duplicates in the output of merge and keep the initial order
Increase limit in thefuzz.process.extract to see more options for merge (stored in a list of tuples with % of similarity)
'*' You can use thefuzz.process.extractOne instead of thefuzz.process.extract to return just one best-matched item (without specifying any limit). However, be aware that several results could have same % of similarity and you will get only one of them.
'**' Somehow the swifter takes a minute or two before starting the actual apply. If you need to process small tables you can skip this step and just use progress_apply instead
from thefuzz import process
from collections import OrderedDict
import swifter
def match(x):
matches = process.extract(x, df1, limit=6)
matches = list(OrderedDict((x, True) for x in matches).keys())
print(f'{x:20} : {matches}')
return str(matches)
df1 = df['name'].values
df2['matches'] = df2['name'].swifter.apply(lambda x: match(x))

return records with first value from column partitioned by other columns [duplicate]

The pandas drop_duplicates function is great for "uniquifying" a dataframe. I would like to drop all rows which are duplicates across a subset of columns. Is this possible?
0 foo 0 A
1 foo 1 A
2 foo 1 B
3 bar 1 A
As an example, I would like to drop rows which match on columns A and C so this should drop rows 0 and 1.
This is much easier in pandas now with drop_duplicates and the keep parameter.
import pandas as pd
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
df.drop_duplicates(subset=['A', 'C'], keep=False)
Just want to add to Ben's answer on drop_duplicates:
keep : {‘first’, ‘last’, False}, default ‘first’
first : Drop duplicates except for the first occurrence.
last : Drop duplicates except for the last occurrence.
False : Drop all duplicates.
So setting keep to False will give you desired answer.
DataFrame.drop_duplicates(*args, **kwargs) Return DataFrame with
duplicate rows removed, optionally only considering certain columns
Parameters: subset : column label or sequence of labels, optional
Only consider certain columns for identifying duplicates, by default
use all of the columns keep : {‘first’, ‘last’, False}, default
‘first’ first : Drop duplicates except for the first occurrence. last
: Drop duplicates except for the last occurrence. False : Drop all
duplicates. take_last : deprecated inplace : boolean, default False
Whether to drop duplicates in place or to return a copy cols : kwargs
only argument of subset [deprecated] Returns: deduplicated :
If you want result to be stored in another dataset:
df.drop_duplicates(keep=False, inplace=False)
If same dataset needs to be updated:
df.drop_duplicates(keep=False, inplace=True)
Above examples will remove all duplicates and keep one, similar to DISTINCT * in SQL
use groupby and filter
import pandas as pd
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
df.groupby(["A", "C"]).filter(lambda df:df.shape[0] == 1)
Try these various things
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar","foo"], "B":[0,1,1,1,1], "C":["A","A","B","A","A"]})
>>>df.drop_duplicates( "A" , keep='first')
>>>df.drop_duplicates( keep='first')
>>>df.drop_duplicates( keep='last')
Actually, drop rows 0 and 1 only requires (any observations containing matched A and C is kept.):
In [335]:
In [336]:
print df.drop_duplicates('C', take_last=True) #this dataset is a special case, in general, one may need to first drop_duplicates by 'c' and then by 'a'.
2 foo 1 B fooB
3 bar 1 A barA
[2 rows x 4 columns]
But I suspect what you really want is this (one observation containing matched A and C is kept.):
In [337]:
print df.drop_duplicates('AC')
0 foo 0 A fooA
2 foo 1 B fooB
3 bar 1 A barA
[3 rows x 4 columns]
Now it is much clearer, therefore:
In [352]:
DG=df.groupby(['A', 'C'])
print pd.concat([DG.get_group(item) for item, value in DG.groups.items() if len(value)==1])
2 foo 1 B
3 bar 1 A
[2 rows x 3 columns]
You can use duplicated() to flag all duplicates and filter out flagged rows. If you need to assign columns to new_df later, make sure to call .copy() so that you don't get SettingWithCopyWarning later on.
new_df = df[~df.duplicated(subset=['A', 'C'], keep=False)].copy()
One nice feature of this method is that you can conditionally drop duplicates with it. For example, to drop all duplicated rows only if column A is equal to 'foo', you can use the following code.
new_df = df[~( df.duplicated(subset=['A', 'B', 'C'], keep=False) & df['A'].eq('foo') )].copy()
Also, if you don't wish to write out columns by name, you can pass slices of df.columns to subset=. This is also true for drop_duplicates() as well.
# to consider all columns for identifying duplicates
df[~df.duplicated(subset=df.columns, keep=False)].copy()
# the same is true for drop_duplicates
df.drop_duplicates(subset=df.columns, keep=False)
# to consider columns in positions 0 and 2 (i.e. 'A' and 'C') for identifying duplicates
df.drop_duplicates(subset=df.columns[[0, 2]], keep=False)
If you want to check 2 columns with try and except statements, this one can help out.
if "column_2" in df.columns:
df[['column_1', "column_2"]] = df[['header', "column_2"]].drop_duplicates(subset = ["column_2", "column_1"] ,keep="first")
df[["column_2"]] = df[["column_2"]].drop_duplicates(subset="column_2" ,keep="first")
print(f"No column_1 for {path}.")
df[["column_1"]] = df[["column_1"]].drop_duplicates(subset="column_1" ,keep="first")
print(f"No column_1 or column_2 for {path}.")

Adding a row to existing dataframe [duplicate]

How do I create an empty DataFrame, then add rows, one by one?
I created an empty DataFrame:
df = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
Then I can add a new row at the end and fill a single field with:
df = df._set_value(index=len(df), col='qty1', value=10.0)
It works for only one field at a time. What is a better way to add new row to df?
You can use df.loc[i], where the row with index i will be what you specify it to be in the dataframe.
>>> import pandas as pd
>>> from numpy.random import randint
>>> df = pd.DataFrame(columns=['lib', 'qty1', 'qty2'])
>>> for i in range(5):
>>> df.loc[i] = ['name' + str(i)] + list(randint(10, size=2))
>>> df
lib qty1 qty2
0 name0 3 3
1 name1 2 4
2 name2 2 8
3 name3 2 1
4 name4 9 6
In case you can get all data for the data frame upfront, there is a much faster approach than appending to a data frame:
Create a list of dictionaries in which each dictionary corresponds to an input data row.
Create a data frame from this list.
I had a similar task for which appending to a data frame row by row took 30 min, and creating a data frame from a list of dictionaries completed within seconds.
rows_list = []
for row in input_rows:
dict1 = {}
# get input row in dictionary format
# key = col_name
df = pd.DataFrame(rows_list)
In the case of adding a lot of rows to dataframe, I am interested in performance. So I tried the four most popular methods and checked their speed.
Using .append (NPE's answer)
Using .loc (fred's answer)
Using .loc with preallocating (FooBar's answer)
Using dict and create DataFrame in the end (ShikharDua's answer)
Runtime results (in seconds):
1000 rows
5000 rows
10 000 rows
.loc without prealloc
.loc with prealloc
So I use addition through the dictionary for myself.
import pandas as pd
import numpy as np
import time
del df1, df2, df3, df4
numOfRows = 1000
# append
startTime = time.perf_counter()
df1 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows-4):
df1 = df1.append( dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']), ignore_index=True)
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
# .loc w/o prealloc
startTime = time.perf_counter()
df2 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows):
df2.loc[i] = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
# .loc with prealloc
df3 = pd.DataFrame(index=np.arange(0, numOfRows), columns=['A', 'B', 'C', 'D', 'E'] )
startTime = time.perf_counter()
for i in range( 1,numOfRows):
df3.loc[i] = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
# dict
startTime = time.perf_counter()
row_list = []
for i in range (0,5):
row_list.append(dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']))
for i in range( 1,numOfRows-4):
dict1 = dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E'])
df4 = pd.DataFrame(row_list, columns=['A','B','C','D','E'])
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
P.S.: I believe my realization isn't perfect, and maybe there is some optimization that could be done.
You could use pandas.concat(). For details and examples, see Merge, join, and concatenate.
For example:
def append_row(df, row):
return pd.concat([
pd.DataFrame([row], columns=row.index)]
df = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
new_row = pd.Series({'lib':'A', 'qty1':1, 'qty2': 2})
df = append_row(df, new_row)
NEVER grow a DataFrame!
Yes, people have already explained that you should NEVER grow a DataFrame, and that you should append your data to a list and convert it to a DataFrame once at the end. But do you understand why?
Here are the most important reasons, taken from my post here.
It is always cheaper/faster to append to a list and create a DataFrame in one go.
Lists take up less memory and are a much lighter data structure to work with, append, and remove.
dtypes are automatically inferred for your data. On the flip side, creating an empty frame of NaNs will automatically make them object, which is bad.
An index is automatically created for you, instead of you having to take care to assign the correct index to the row you are appending.
This is The Right Way™ to accumulate your data
data = []
for a, b, c in some_function_that_yields_data():
data.append([a, b, c])
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
These options are horrible
append or concat inside a loop
append and concat aren't inherently bad in isolation. The
problem starts when you iteratively call them inside a loop - this
results in quadratic memory usage.
# Creates empty DataFrame and appends
df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
df = df.append({'A': i, 'B': b, 'C': c}, ignore_index=True)
# This is equally bad:
# df = pd.concat(
# [df, pd.Series({'A': i, 'B': b, 'C': c})],
# ignore_index=True)
Empty DataFrame of NaNs
Never create a DataFrame of NaNs as the columns are initialized with
object (slow, un-vectorizable dtype).
# Creates DataFrame of NaNs and overwrites values.
df = pd.DataFrame(columns=['A', 'B', 'C'], index=range(5))
for a, b, c in some_function_that_yields_data():
df.loc[len(df)] = [a, b, c]
The Proof is in the Pudding
Timing these methods is the fastest way to see just how much they differ in terms of their memory and utility.
Benchmarking code for reference.
It's posts like this that remind me why I'm a part of this community. People understand the importance of teaching folks getting the right answer with the right code, not the right answer with wrong code. Now you might argue that it is not an issue to use loc or append if you're only adding a single row to your DataFrame. However, people often look to this question to add more than just one row - often the requirement is to iteratively add a row inside a loop using data that comes from a function (see related question). In that case it is important to understand that iteratively growing a DataFrame is not a good idea.
If you know the number of entries ex ante, you should preallocate the space by also providing the index (taking the data example from a different answer):
import pandas as pd
import numpy as np
# we know we're gonna have 5 rows of data
numberOfRows = 5
# create dataframe
df = pd.DataFrame(index=np.arange(0, numberOfRows), columns=('lib', 'qty1', 'qty2') )
# now fill it up row by row
for x in np.arange(0, numberOfRows):
#loc or iloc both work here since the index is natural numbers
df.loc[x] = [np.random.randint(-1,1) for n in range(3)]
In[23]: df
lib qty1 qty2
0 -1 -1 -1
1 0 0 0
2 -1 0 -1
3 0 -1 0
4 -1 0 0
Speed comparison
In[30]: %timeit tryThis() # function wrapper for this answer
In[31]: %timeit tryOther() # function wrapper without index (see, for example, #fred)
1000 loops, best of 3: 1.23 ms per loop
100 loops, best of 3: 2.31 ms per loop
And - as from the comments - with a size of 6000, the speed difference becomes even larger:
Increasing the size of the array (12) and the number of rows (500) makes
the speed difference more striking: 313ms vs 2.29s
mycolumns = ['A', 'B']
df = pd.DataFrame(columns=mycolumns)
rows = [[1,2],[3,4],[5,6]]
for row in rows:
df.loc[len(df)] = row
You can append a single row as a dictionary using the ignore_index option.
>>> f = pandas.DataFrame(data = {'Animal':['cow','horse'], 'Color':['blue', 'red']})
>>> f
Animal Color
0 cow blue
1 horse red
>>> f.append({'Animal':'mouse', 'Color':'black'}, ignore_index=True)
Animal Color
0 cow blue
1 horse red
2 mouse black
For efficient appending, see How to add an extra row to a pandas dataframe and Setting With Enlargement.
Add rows through loc/ix on non existing key index data. For example:
In [1]: se = pd.Series([1,2,3])
In [2]: se
0 1
1 2
2 3
dtype: int64
In [3]: se[5] = 5.
In [4]: se
0 1.0
1 2.0
2 3.0
5 5.0
dtype: float64
In [1]: dfi = pd.DataFrame(np.arange(6).reshape(3,2),
.....: columns=['A','B'])
In [2]: dfi
0 0 1
1 2 3
2 4 5
In [3]: dfi.loc[:,'C'] = dfi.loc[:,'A']
In [4]: dfi
0 0 1 0
1 2 3 2
2 4 5 4
In [5]: dfi.loc[3] = 5
In [6]: dfi
0 0 1 0
1 2 3 2
2 4 5 4
3 5 5 5
For the sake of a Pythonic way:
res = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
res = res.append([{'qty1':10.0}], ignore_index=True)
lib qty1 qty2
0 NaN 10.0 NaN
You can also build up a list of lists and convert it to a dataframe -
import pandas as pd
columns = ['i','double','square']
rows = []
for i in range(6):
row = [i, i*2, i*i]
df = pd.DataFrame(rows, columns=columns)
i double square
0 0 0 0
1 1 2 1
2 2 4 4
3 3 6 9
4 4 8 16
5 5 10 25
If you always want to add a new row at the end, use this:
df.loc[len(df)] = ['name5', 9, 0]
I figured out a simple and nice way:
>>> df
one 1 2 3
>>> df.loc["two"] = [4,5,6]
>>> df
one 1 2 3
two 4 5 6
Note the caveat with performance as noted in the comments.
This is not an answer to the OP question, but a toy example to illustrate ShikharDua's answer which I found very useful.
While this fragment is trivial, in the actual data I had 1,000s of rows, and many columns, and I wished to be able to group by different columns and then perform the statistics below for more than one target column. So having a reliable method for building the data frame one row at a time was a great convenience. Thank you ShikharDua!
import pandas as pd
BaseData = pd.DataFrame({ 'Customer' : ['Acme','Mega','Acme','Acme','Mega','Acme'],
'Territory' : ['West','East','South','West','East','South'],
'Product' : ['Econ','Luxe','Econ','Std','Std','Econ']})
columns = ['Customer','Num Unique Products', 'List Unique Products']
for name, group in BaseData.groupby('Customer'):
RecordtoAdd={} #initialise an empty dict
RecordtoAdd.update({'Customer' : name}) #
RecordtoAdd.update({'Num Unique Products' : len(pd.unique(group['Product']))})
RecordtoAdd.update({'List Unique Products' : pd.unique(group['Product'])})
AnalysedData = pd.DataFrame(rows_list)
print('Base Data : \n',BaseData,'\n\n Analysed Data : \n',AnalysedData)
You can use a generator object to create a Dataframe, which will be more memory efficient over the list.
num = 10
# Generator function to generate generator object
def numgen_func(num):
for i in range(num):
yield ('name_{}'.format(i), (i*i), (i*i*i))
# Generator expression to generate generator object (Only once data get populated, can not be re used)
numgen_expression = (('name_{}'.format(i), (i*i), (i*i*i)) for i in range(num) )
df = pd.DataFrame(data=numgen_func(num), columns=('lib', 'qty1', 'qty2'))
To add raw to existing DataFrame you can use append method.
df = df.append([{ 'lib': "name_20", 'qty1': 20, 'qty2': 400 }])
Instead of a list of dictionaries as in ShikharDua's answer (row-based), we can also represent our table as a dictionary of lists (column-based), where each list stores one column in row-order, given we know our columns beforehand. At the end we construct our DataFrame once.
In both cases, the dictionary keys are always the column names. Row order is stored implicitly as order in a list. For c columns and n rows, this uses one dictionary of c lists, versus one list of n dictionaries. The list-of-dictionaries method has each dictionary storing all keys redundantly and requires creating a new dictionary for every row. Here we only append to lists, which overall is the same time complexity (adding entries to list and dictionary are both amortized constant time) but may have less overhead due to being a simple operation.
# Current data
data = {"Animal":["cow", "horse"], "Color":["blue", "red"]}
# Adding a new row (be careful to ensure every column gets another value)
# At the end, construct our DataFrame
df = pd.DataFrame(data)
# Animal Color
# 0 cow blue
# 1 horse red
# 2 mouse black
Create a new record (data frame) and add to old_data_frame.
Pass a list of values and the corresponding column names to create a new_record (data_frame):
new_record = pd.DataFrame([[0, 'abcd', 0, 1, 123]], columns=['a', 'b', 'c', 'd', 'e'])
old_data_frame = pd.concat([old_data_frame, new_record])
Here is the way to add/append a row in a Pandas DataFrame:
def add_row(df, row):
df.loc[-1] = row
df.index = df.index + 1
return df.sort_index()
add_row(df, [1,2,3])
It can be used to insert/append a row in an empty or populated Pandas DataFrame.
If you want to add a row at the end, append it as a list:
valuestoappend = [va1, val2, val3]
res = res.append(pd.Series(valuestoappend, index = ['lib', 'qty1', 'qty2']), ignore_index = True)
Another way to do it (probably not very performant):
# add a row
def add_row(df, row):
colnames = list(df.columns)
ncol = len(colnames)
assert ncol == len(row), "Length of row must be the same as width of DataFrame: %s" % row
return df.append(pd.DataFrame([row], columns=colnames))
You can also enhance the DataFrame class like this:
import pandas as pd
def add_row(self, row):
self.loc[len(self.index)] = row
pd.DataFrame.add_row = add_row
All you need is loc[df.shape[0]] or loc[len(df)]
# Assuming your df has 4 columns (str, int, str, bool)
df.loc[df.shape[0]] = ['col1Value', 100, 'col3Value', False]
df.loc[len(df)] = ['col1Value', 100, 'col3Value', False]
You can concatenate two DataFrames for this. I basically came across this problem to add a new row to an existing DataFrame with a character index (not numeric).
So, I input the data for a new row in a duct() and index in a list.
new_dict = {put input for new row here}
new_list = [put your index here]
new_df = pd.DataFrame(data=new_dict, index=new_list)
df = pd.concat([existing_df, new_df])
initial_data = {'lib': np.array([1,2,3,4]), 'qty1': [1,2,3,4], 'qty2': [1,2,3,4]}
df = pd.DataFrame(initial_data)
lib qty1 qty2
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
val_1 = [10]
val_2 = [14]
val_3 = [20]
df.append(pd.DataFrame({'lib': val_1, 'qty1': val_2, 'qty2': val_3}))
lib qty1 qty2
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
0 10 14 20
You can use a for loop to iterate through values or can add arrays of values.
val_1 = [10, 11, 12, 13]
val_2 = [14, 15, 16, 17]
val_3 = [20, 21, 22, 43]
df.append(pd.DataFrame({'lib': val_1, 'qty1': val_2, 'qty2': val_3}))
lib qty1 qty2
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
0 10 14 20
1 11 15 21
2 12 16 22
3 13 17 43
Make it simple. By taking a list as input which will be appended as a row in the data-frame:
import pandas as pd
res = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
for i in range(5):
res_list = list(map(int, input().split()))
res = res.append(pd.Series(res_list, index=['lib', 'qty1', 'qty2']), ignore_index=True)
DataFrame.append(self, other, ignore_index=False, verify_integrity=False, sort=False) → 'DataFrame'
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
With ignore_index set to True:
df.append(df2, ignore_index=True)
If you have a data frame df and want to add a list new_list as a new row to df, you can simply do:
df.loc[len(df)] = new_list
If you want to add a new data frame new_df under data frame df, then you can use:
We often see the construct df.loc[subscript] = … to assign to one DataFrame row. Mikhail_Sam posted benchmarks containing, among others, this construct as well as the method using dict and create DataFrame in the end. He found the latter to be the fastest by far.
But if we replace the df3.loc[i] = … (with preallocated DataFrame) in his code with df3.values[i] = …, the outcome changes significantly, in that that method performs similar to the one using dict. So we should more often take the use of df.values[subscript] = … into consideration. However note that .values takes a zero-based subscript, which may be different from the DataFrame.index.
Before going to add a row, we have to convert the dataframe to a dictionary. There you can see the keys as columns in the dataframe and the values of the columns are again stored in the dictionary, but there the key for every column is the index number in the dataframe.
That idea makes me to write the below code.
df2 = df.to_dict()
values = ["s_101", "hyderabad", 10, 20, 16, 13, 15, 12, 12, 13, 25, 26, 25, 27, "good", "bad"] # This is the total row that we are going to add
i = 0
for x in df.columns: # Here df.columns gives us the main dictionary key
df2[x][101] = values[i] # Here the 101 is our index number. It is also the key of the sub dictionary
i += 1
If all data in your Dataframe has the same dtype you might use a NumPy array. You can write rows directly into the predefined array and convert it to a dataframe at the end.
It seems to be even faster than converting a list of dicts.
import pandas as pd
import numpy as np
from string import ascii_uppercase
startTime = time.perf_counter()
numcols, numrows = 5, 10000
npdf = np.ones((numrows, numcols))
for row in range(numrows):
npdf[row, 0:] = np.random.randint(0, 100, (1, numcols))
df5 = pd.DataFrame(npdf, columns=list(ascii_uppercase[:numcols]))
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
This code snippet uses a list of dictionaries to update the data frame. It adds on to ShikharDua's and Mikhail_Sam's answers.
import pandas as pd
colour = ["red", "big", "tasty"]
fruits = ["apple", "banana", "cherry"]
for x in colour:
for y in fruits:
# print(x, y)
dict1 = dict([('x',x),('y',y)])
# print(f'dict 1 {dict1}')
# print(f'feat_list {feat_list}')

Check if Pandas Dataframe group has 2 specific values in a column and return those rows

I have a groupby object. For each of these groups, I need to check, if a particular column has rows that contain value-A and value-B and return only those 2 rows within the group. If I use isin or "|" I would get cases where either one of these values are present. Right now I am doing a sloppy job of checking for first condition and then checking for second condition if first one is true and concatenating the results of both the checks.
My code is as follows:
import pandas as pd
from datetime import datetime, timedelta
from statistics import mean
dict = {'col-a': ['T1A', 'T1A', 'T1A', 'T1B', 'T1B', 'T1C', 'T1C', 'P1', 'P1'],
'col-b': ['07:57:00', '09:00:00', '12:00:00', '08:00:00', '08:25:00', '08:15:00', '07:25:00', '10:00:00', '07:45:00'],
'col-c': ['11111', '22222', '99999', '33333', '22222', '22222', '99999', '22222', '99999'],
'col-d': ['07:58:00', '09:01:00', '12:01:00', '08:01:00', '08:26:00', '08:16:00', '07:26:00', '10:01:00', '07:46:00'],
original_df = pd.DataFrame(dict)
print("original df\n", original_df)
# condition 1: must contain T1 in col-a
# condition 2: must contain 22222(variable) amongst each group of col-a
# condition 3: record containing 22222 should have col-b value between 7 and 9
# condition 4: must contain 99999(stays the same) among amongst each group of col-a where above conditions are met
no_to_check = '22222' # comes from another dataframe column
# filtering rows where col-a contains T1
filtered_df = original_df[original_df['col-a'].str.contains('T1')]
# grouping by col-a
trip_groups = filtered_df.groupby('col-a')
# checking if it contains '22222' in column c and '22222' has time between 7 and 9 in column b
trips_time_dict = {}
for group_key, group in trip_groups:
check1 = group[(group['col-c'] == no_to_check) & (group['col-b'].between('07:00:00', '09:00:00'))]
if len(check1) != 0:
# checking if the group contains '99999' in column c
check2 = group[group['col-c'] == '99999']
if len(check2) != 0:
all_conditions = pd.concat([check1,check2])
The desired output should contain one row for 22222 and one row for 99999 for each group that satisfies the criteria.
IIUC, you can do the following with df as your original dataframe:
df[df['col-a'].str.contains('T1')].groupby('col-a').apply(lambda x: x[(x['col-c']=='22222') & (x['col-b'].between('07:00:00', '09:00:00')) & (x['col-c']=='99999').any()])
col-a col-b col-c col-d
T1A 1 T1A 09:00:00 22222 09:01:00
T1C 5 T1C 08:15:00 22222 08:16:00
