how to remove rows from a pandas dataframe if two rows contains at least one matching element - python-3.x

i have a pandas dataframe contains many columns like Name, Email, Mobile Number etc. . which looks like this :
Sr No. Name Email Mobile Number
1. John joh***#gmail.com 1234567890,2345678901
2. kylie k.ki**#yahoo.com 6789012345
3. jon null 1234567890
4. kia kia***#gmail.com 6789012345
5. sam b.sam**#gmail.com 4567890123
I want to remove the rows which contains same Mobile Number. One person can have more than one number. I done this through drop_duplicates function. I tried this:
newdf = df.drop_duplicates(subset = ['Mobile Number'],keep=False)
Here is output :
Sr No. Name Email Mobile Number
1. John joh***#gmail.com 1234567890,2345678901
3. jon null 1234567890
5. sam b.sam**#gmail.com 4567890123
But the problem is it only removes the rows which are exactly same. but i want to remove the row which contains at least one same number i.e Sr. No. 1 and 3 they have one same number. How can i remove them so the final output looks like this :
final output:
Sr No. Name Email Mobile Number
5. sam b.sam**#gmail.com 4567890123

Alright. It is a complicated solution but I was able to solve for it.
Here's how I am doing it.
First, I take all the mobile numbers and split them by ,. Then I explode them (it will retain same index).
Then find all the index of rows with duplicates.
Then exclude the rows from the dataframe if the index was part of the duplicate
This will give you the unique rows that do not have any duplicates.
I modified your dataframe to have a few options.
c = ['Name','Email','Mobile Number']
d = [['John','joh***#gmail.com','1234567890,2345678901,6789012345'],
['kylie','k.ki**#yahoo.com','6789012345'],
['jon','null','1234567890'],
['kia','kia***#gmail.com','6789012345'],
['mac','mac***#gmail.com','2345678901,1098765432'],
['kfc','kfc***#gmail.com','6237778901,1098765432,3034045050'],
['pig','pig***#gmail.com','8007778001,8018765454,5054043030'],
['bil','bil***#gmail.com','1098765432'],
['jun','jun***#gmail.com','9098785434'],
['sam','b.sam**#gmail.com','4567890123']]
import pandas as pd
df = pd.DataFrame(d,columns=c)
print (df)
temp = df.copy()
temp['Mobile Number'] = temp['Mobile Number'].apply(lambda x: x.split(','))
temp = temp.explode('Mobile Number')
#print (temp)
df2 = df[~df.index.isin(temp[temp['Mobile Number'].duplicated(keep=False)].index)]
print (df2)
The output of this is:
Original DataFrame:
Name Email Mobile Number
0 John joh***#gmail.com 1234567890,2345678901,6789012345 # duplicated index: 1, 2,3, 4
1 kylie k.ki**#yahoo.com 6789012345 # duplicated index: 0, 3
2 jon null 1234567890 # duplicated index: 0
3 kia kia***#gmail.com 6789012345 # duplicated index: 0
4 mac mac***#gmail.com 2345678901,1098765432 # duplicated index: 0
5 kfc kfc***#gmail.com 6237778901,1098765432,3034045050 # duplicated index: 7
6 pig pig***#gmail.com 8007778001,8018765454,5054043030 # no duplicate; should output
7 bil bil***#gmail.com 1098765432 # duplicated index: 5
8 jun jun***#gmail.com 9098785434 # no duplicate; should output
9 sam b.sam**#gmail.com 4567890123 # no duplicate; should output
The output of this will be the 3 rows (index: 6, 8, and 9):
Name Email Mobile Number
6 pig pig***#gmail.com 8007778001,8018765454,5054043030
8 jun jun***#gmail.com 9098785434
9 sam b.sam**#gmail.com 4567890123
Since temp is not needed anymore, you can just delete it using del temp.

One possible solution is to do the following. Say your df is given by
Sr No. Name Email Mobile Number
0 1.0 John joh***#gmail.com 1234567890 , 2345678901
1 2.0 kylie k.ki**#yahoo.com 6789012345
2 3.0 jon NaN 1234567890
3 4.0 kia kia***#gmail.com 6789012345
4 5.0 sam b.sam**#gmail.com 4567890123
You can split your Mobile Number column into two (or more) columns mob1, mob2,.... and then drop duplicates
df[['mob1', 'mob2']]= df["Mobile Number"].str.split(" , ", n = 1, expand = True)
newdf = df.drop_duplicates(subset = ['mob1'],keep=False)
which returns
Sr No. Name Email Mobile Number mob1 mob2
4 5.0 sam b.sam**#gmail.com 4567890123 4567890123 None
EDIT
To handle the possible swapped order of numbers, one can extend the method by dropping duplicates from all created columns:
df[['mob1', 'mob2']]= df["Mobile Number"].str.split(" , ", n = 1, expand = True)
newdf = df.drop_duplicates(subset = ['mob1'],keep=False)
newdf = df.drop_duplicates(subset = ['mob2'],keep=False)
which returns:
Sr No. Name Email Mobile Number mob1 \
0 1.0 John joh***#gmail.com 2345678901 , 1234567890 2345678901
mob2
0 1234567890
If there are individuals with more than two number then as many columns as the maximum number of phone numbers need to be created.

Related

Perform unique row operation after a groupby

I have been stuck to a problem where I have done all the groupby operation and got the resultant dataframe as shown below but the problem came in last operation of calculation of one additional column
Current dataframe:
code industry category count duration
2 Retail Mobile 4 7
3 Retail Tab 2 33
3 Health Mobile 5 103
2 Food TV 1 88
The question: Want an additional column operation which calculates the ratio of count of industry 'retail' for the specific code column entry
for example: code 2 has 2 industry entry retail and food so operation column should have value 4/(4+1) = 0.8 and similarly for code3 as well as shown below
O/P:
code industry category count duration operation
2 Retail Mobile 4 7 0.8
3 Retail Tab 2 33 -
3 Health Mobile 5 103 2/7 = 0.285
2 Food TV 1 88 -
Help on here as well that if I do just groupby I will miss out the information of category and duration also what would be better way to represent the output df there can been multiple industry and operation is limited to just retail
I can't think of a single operation. But the way via a dictionary should work. Oh, and in advance for the other answerers the code to create the example dataframe.
st_l = [[2,'Retail','Mobile', 4, 7],
[3,'Retail', 'Tab', 2, 33],
[3,'Health', 'Mobile', 5, 103],
[2,'Food', 'TV', 1, 88]]
df = pd.DataFrame(st_l, columns=
['code','industry','category','count','duration'])
And now my attempt:
sums = df[['code', 'count']].groupby('code').sum().to_dict()['count']
df['operation'] = df.apply(lambda x: x['count']/sums[x['code']], axis=1)
You can create a new column with the total count of each code using groupby.transform(), and then use loc to find only the rows that have as their industry 'Retail' and perform your division:
df['total_per_code'] = df.groupby(['code'])['count'].transform('sum')
df.loc[df.industry.eq('Retail'), 'operation'] = df['count'].div(df.total_per_code)
df.drop('total_per_code',axis=1,inplace=True)
prints back:
code industry category count duration operation
0 2 Retail Mobile 4 7 0.800000
1 3 Retail Tab 2 33 0.285714
2 3 Health Mobile 5 103 NaN
3 2 Food TV 1 88 NaN

Can we separate data using Unique ID in to the following format?

Current Format:
UNIQUE ID
NAME
AGE
DEP
RANK
001
John
10
4th
1
002
Priya
11
4th
2
003
Jack
15
5th
2
004
Jill
14
5th
1
Expected Format:
UNIQUE ID
NAME
COLUMN_NO
001
John
1
001
10
2
001
4th
3
001
1
4
002
Priya
1
002
11
2
002
4th
3
002
2
4
My starting point:
>>> df
UNIQUE ID NAME AGE DEP RANK
0 1 John 10 4th 1
1 2 Priya 11 4th 2
2 3 Jack 15 5th 2
3 4 Jill 14 5th 1
The basic transformation you need is provided by df.stack, which results in:
0 UNIQUE ID 1
NAME John
AGE 10
DEP 4th
RANK 1
1 UNIQUE ID 2
NAME Priya
[...]
However, you want column UNIQUE ID to be treated separately. This can be accomplished by making it the index:
>>> df.set_index('UNIQUE ID').stack()
UNIQUE ID
1 NAME John
AGE 10
DEP 4th
RANK 1
2 NAME Priya
AGE 11
DEP 4th
RANK 2
The last missing bit are the column names: you want them renamed to numbers. This could be accomplished two different ways: a) by re-assigning df.columns (after having moved column UNIQUE ID to the index first):
df = df.set_index('UNIQUE_ID')
df.columns = range(1, 5)
or b) by df.renaming the columns:
df = df.set_index('UNIQUE_ID')
df = df.rename(columns={'NAME': 1, 'AGE': 2, 'DEP': 3, 'RANK': 4})
And finally you can convert the resulting Series back to a DataFrame. The most elegant way to get COLUMN NO at the right place is using df.rename_axis before stacking. All together as one expression (possibly better to split it up):
>>> (df.set_index('UNIQUE ID')
.rename(columns={'NAME': 1, 'AGE': 2, 'DEP': 3, 'RANK': 4})
.rename_axis('COLUMN NO', axis=1)
.stack()
.to_frame('NAME')
.reset_index())
UNIQUE ID COLUMN NO NAME
0 1 1 John
1 1 2 10
2 1 3 4th
3 1 4 1
4 2 1 Priya
5 2 2 11
6 2 3 4th
7 2 4 2
8 3 1 Jack
9 3 2 15
10 3 3 5th
11 3 4 2
12 4 1 Jill
13 4 2 14
14 4 3 5th
15 4 4 1
Things left out: reading the data; preserving the correct type: UNIQUE ID only looks numeric, but has leading zeros that probably want to be preserved; so parsing them as a string would be better.

Algo to identify slightly different uniquely identifiable common names in 3 DataFrame columns

Sample DataFrame df has 3 columns to identify any given person, viz., name, nick_name, initials. They can have slight differences in the way they are specified but looking at three columns together it is possible to overcome these differences and separate out all the rows for given person and normalize these 3 columnns with single value for each person.
>>> import pandas as pd
>>> df = pd.DataFrame({'ID':range(9), 'name':['Theodore', 'Thomas', 'Theodore', 'Christian', 'Theodore', 'Theodore R', 'Thomas', 'Tomas', 'Cristian'], 'nick_name':['Tedy', 'Tom', 'Ted', 'Chris', 'Ted', 'Ted', 'Tommy', 'Tom', 'Chris'], 'initials':['TR', 'Tb', 'TRo', 'CS', 'TR', 'TR', 'tb', 'TB', 'CS']})
>>> df
ID name nick_name initials
0 0 Theodore Tedy TR
1 1 Thomas Tom Tb
2 2 Theodore Ted TRo
3 3 Christian Chris CS
4 4 Theodore Ted TR
5 5 Theodore R Ted TR
6 6 Thomas Tommy tb
7 7 Tomas Tom TB
8 8 Cristian Chris CS
In this case desired output is as follows:
ID name nick_name initials
0 0 Theodore Ted TR
1 1 Thomas Tom TB
2 2 Theodore Ted TR
3 3 Christian Chris CS
4 4 Theodore Ted TR
5 5 Theodore Ted TR
6 6 Thomas Tom TB
7 7 Thomas Tom TB
8 8 Christian Chris CS
The common value can be anything as long as it is normalized to same value. For example, name is Theodore or Theodore R - both fine.
My actual DataFrame is about 4000 rows. Could someone help specify optimal algo to do this.
You'll want to use Levenshtein distance to identify similar strings. A good Python package for this is fuzzywuzzy. Below I used a basic dictionary approach to collect similar rows together, then overwrite each chunk with a designated master row. Note this leaves a CSV with many duplicate rows, I don't know if this is what you want, but if not, easy enough to take the duplicates out.
import pandas as pd
from itertools import chain
from fuzzywuzzy import fuzz
def cluster_rows(df):
row_clusters = {}
threshold = 90
name_rows = list(df.iterrows())
for i, nr in name_rows:
name = nr['name']
new_cluster = True
for other in row_clusters.keys():
if fuzz.ratio(name, other) >= threshold:
row_clusters[other].append(nr)
new_cluster = False
if new_cluster:
row_clusters[name] = [nr]
return row_clusters
def normalize_rows(row_clusters):
for name in row_clusters:
master = row_clusters[name][0]
for row in row_clusters[name][1:]:
for key in row.keys():
row[key] = master[key]
return row_clusters
if __name__ == '__main__':
df = pd.read_csv('names.csv')
rc = cluster_rows(df)
normalized = normalize_rows(rc)
pd.DataFrame(chain(*normalized.values())).to_csv('norm-names.csv')

pandas: search column values from one df in another df column that contains lists

I need to search the values from the df1['numsearch'] column into the lists in df2['Numbers']. If the number is in those lists, then I want to add values from the df2['Score'] column to df1. See desired output below.
df1 = pd.DataFrame(
{'Day':['M','Tu','W','Th','Fr','Sa','Su'],
'numsearch':['1','20','14','99','19','6','101']
})
df2 = pd.DataFrame(
{'Letters':['a','b','c','d'],
'Numbers':[['1','2','3','4'],['5','6','7','8'],['10','20','30','40'],['11','12','13','14']],
'Score': ['1.1','2.2','3.3','4.4']})
desired output
Day numsearch Score
0 M 1 1.1
1 Tu 20 3.3
2 W 4 4.4
3 Th 99 "No score"
4 Fr 19 "No score"
5 Sa 6 2.2
6 Su 101 "No score"
I have written a for loop that works with the test data.
scores = []
for s,ns in enumerate(ppr_data['SN']):
match = ''
for k,q in enumerate(jcr_data['All_ISSNs']):
if ns in q:
scores.append(jcr_data['Journal Impact Factor'][k])
match = 1
else:
continue
if match == "":
scores.append('No score')
match = ""
df1['Score'] = np.array(scores)
In my small test, but above code works, but when working with larger data files, it is creating duplicates. So this clearly isn't the best way to do this.
I'm sure there's a more pandas-proper line of code that ends in .fillna("No score") .
I tried to use a loc statement, but I get hung up on searching the values of one dataframe in a column that contains lists.
Can anyone shed some light?
df2=df2.explode('Numbers')#Explode df2 on Numbers
d=dict(zip(df2.Numbers, df2.Score))#dict Numbers and Scores
df1['Score']=df1.numsearch.map(d).fillna('No Score')#Map dict to df1 filling NaN with No Score
Can shorten it as follows:
df2=df2.explode('Numbers')#Explode df2 on Numbers
df1['Score']=df1.numsearch.map(dict(zip(df2.Numbers, df2.Score))).fillna('No Score')
Day numsearch Score
0 M 1 1.1
1 Tu 20 3.3
2 W 14 4.4
3 Th 99 No Score
4 Fr 19 No Score
5 Sa 6 2.2
6 Su 101 No Score
You can try left join and fillna:
df1.merge(df2.explode('Numbers'),
left_on='numsearch',
right_on='Numbers', how='left')[['Day', 'numsearch', 'Score']].fillna("No score")
Output:
Day numsearch Score
0 M 1 1.1
1 Tu 20 3.3
2 W 14 4.4
3 Th 99 No score
4 Fr 19 No score
5 Sa 6 2.2
6 Su 101 No score

pandas get rows from one dataframe which are existed in other dataframe

I have two dataframes. The dataframes as follows:
df1 is
numbers
user_id
0 9154701244
1 9100913773
2 8639988041
3 8092118985
4 8143131334
5 9440609551
6 8309707235
7 8555033317
8 7095451372
9 8919206985
10 8688960416
11 9676230089
12 7036733390
13 9100914771
it's shape is (14,1)
df2 is
user_id numbers names type duration date_time
0 9032095748 919182206378 ramesh incoming 23 233445445
1 9032095748 918919206983 suresh incoming 45 233445445
2 9032095748 919030785187 rahul incoming 45 233445445
3 9032095748 916281206641 jay incoming 67 233445445
4 jakfnka998nknk 9874654411 query incoming 25 8571228412
5 jakfnka998nknk 9874654112 form incoming 42 678565487
6 jakfnka998nknk 9848022238 json incoming 10 89547212765
7 ukajhj9417fka 9984741215 keert incoming 32 8548412664
8 ukajhj9417fka 9979501984 arun incoming 21 7541344646
9 ukajhj9417fka 95463241 paru incoming 42 945151215451
10 ukajknva939o 7864621215 hari outgoing 34 49829840920
and it's shape is (10308,6)
Here in df1, the column name numbers are having the multiple unique numbers. These numbers are available in df2 and those are repeated depends on the duration. I want to get those data who all are existed in df2 based on the numbers which are available in df1.
Here is the code I've tried to get this but I'm not able to figure it out how it can be solved using pandas.
df = pd.concat([df1, df2]) # concat dataframes
df = df.reset_index(drop=True) # reset the index
df_gpby = df.groupby(list(df.columns)) #group by
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1] #reindex
df = df.reindex(idx)
It gives me only unique numbers column which are there in df2. But I need to get all the data including other columns from the second dataframe.
It would be great that anyone can help me on this. Thanks in advance.
Here is a sample dataframe, I created keeping the gist same.
df1=pd.DataFrame({"numbers":[123,1234,12345,5421]})
df2=pd.DataFrame({"numbers":[123,1234,12345,123,123,45643],"B":[1,2,3,4,5,6],"C":[2,3,4,5,6,7]})
final_df=df2[df2.numbers.isin(df1.numbers)]
Output DataFrame The result is all unique numbers that are present in df1 and present in df2 will be returned
numbers B C
0 123 1 2
1 1234 2 3
2 12345 3 4
3 123 4 5
4 123 5 6

Resources