How to compare row by row in a dataframe - python-3.x

I have a data frame that has a name and the URL ID of the name. For example:
Abc 123
Abc.com 123
Def 345
Pqr 123
PQR.com 123
Here due to data extraction error, at times different names have same ID. I want to clean the table such that if the names are different and ID is same, then the record should remain the same. If the names are similar and ID is also same, the name should be changed to one. To be clear,
The expected output should be:
Abc.com 123
Abc.com 123
Def 354
PQR.com 123
PQR.com 123
That is, the last one was data entry error..and both were the same name(The first word of the string was the same). So they are changed to one name looking at ID.
But first and second records even though they had a similar ID to the last ones their names did not match and were completely different.
I am not able to understand how to achieve this.
Request some guidance here. Thanks in advance.
Note: The size of the dataset is almost 16 million of such records.

Idea is use fuzzy matching lib fuzzywuzzy for ratio of all combinations of Names by cross join by DataFrame.merge and removed rows with same names in both columns by DataFrame.query, also was added new column by lengths of data by Series.str.len:
from fuzzywuzzy import fuzz
df1 = df.merge(df, on='ID').query('Name_x != Name_y')
df1['ratio'] = df1.apply(lambda x: fuzz.ratio(x['Name_x'], x['Name_y']), axis=1)
df1['len'] = df1['Name_x'].str.len()
print (df1)
Name_x ID Name_y ratio len
1 Abc 123 BCD 0 3
2 BCD 123 Abc 0 3
6 Pqr 789 PQR.com 20 3
7 PQR.com 789 Pqr 20 7
Then filter rows by treshold and boolean indexing. Then is necessary choose which value is necessary, one possible solution is get longer text. So is uses DataFrameGroupBy.idxmax with DataFrame.loc and then DataFrame.set_index for Series:
N = 15
df2 = df1[df1['ratio'].gt(N)]
s = df2.loc[df2.groupby('ID')['len'].idxmax()].set_index('ID')['Name_x']
print (s)
ID
789 PQR.com
Name: Name_x, dtype: object
Last Series.map by ID and replace non matched values by original with Series.fillna:
df['Name'] = df['ID'].map(s).fillna(df['Name'])
print (df)
Name ID
0 Abc 123
1 BCD 123
2 Def 345
3 PQR.com 789
4 PQR.com 789
EDIT: If there is more valid strings per ID is is more complicated:
print (df)
Name ID
0 Air Ordnance 1578013421
1 Air-Ordnance.com 1578013421
2 Garreett 1578013421
3 Garrett 1578013421
First get fuzz.ratio like in solution before:
from fuzzywuzzy import fuzz
df1 = df.merge(df, on='ID').query('Name_x != Name_y')
df1['ratio'] = df1.apply(lambda x: fuzz.ratio(x['Name_x'], x['Name_y']), axis=1)
print (df1)
Name_x ID Name_y ratio
1 Air Ordnance 1578013421 Air-Ordnance.com 79
2 Air Ordnance 1578013421 Garreett 30
3 Air Ordnance 1578013421 Garrett 32
4 Air-Ordnance.com 1578013421 Air Ordnance 79
6 Air-Ordnance.com 1578013421 Garreett 25
7 Air-Ordnance.com 1578013421 Garrett 26
8 Garreett 1578013421 Air Ordnance 30
9 Garreett 1578013421 Air-Ordnance.com 25
11 Garreett 1578013421 Garrett 93
12 Garrett 1578013421 Air Ordnance 32
13 Garrett 1578013421 Air-Ordnance.com 26
14 Garrett 1578013421 Garreett 93
Then filter by threshold:
N = 50
df2 = df1[df1['ratio'].gt(N)]
print (df2)
Name_x ID Name_y ratio
1 Air Ordnance 1578013421 Air-Ordnance.com 79
4 Air-Ordnance.com 1578013421 Air Ordnance 79
11 Garreett 1578013421 Garrett 93
14 Garrett 1578013421 Garreett 93
But for more precision is necessary specify, what strings are valid in list L, filter by list:
L = ['Air-Ordnance.com','Garrett']
df2 = df2.loc[df2['Name_x'].isin(L),['Name_x','Name_y','ID']].rename(columns={'Name_y':'Name'})
print (df2)
Name_x Name ID
4 Air-Ordnance.com Air Ordnance 1578013421
14 Garrett Garreett 1578013421
Last merge with left join to original and repalce missing values:
df = df.merge(df2, on=['Name','ID'], how='left')
df['Name'] = df.pop('Name_x').fillna(df['Name'])
print (df)
Name ID
0 Air-Ordnance.com 1578013421
1 Air-Ordnance.com 1578013421
2 Garrett 1578013421
3 Garrett 1578013421

Related

How can I sort 3 columns and assign it to one python pandas

I have a dataframe:
df = {A:[1,1,1], B:[2012,3014,3343], C:[12,13,45], D:[111,222,444]}
but I need to join the last 3 columns in consecutive order horizontally and thus assign it to the first column, some like this:
df2 = {A:[1,1,1,2,2,2], Fusion3:[2012,12,111,3014,13,222]}
I have tried with .melt, but you are struggling with some ideas and grateful for your comments
From the desired output I'm making the assumption that the initial dataframe should have 1,2,3 in the A column rather 1,1,1
import pandas as pd
df= pd.DataFrame({'A':[1,2,3], 'B':[2012,3014,3343], 'C':[12,13,45], 'D':[111,222,444]})
df = df.set_index('A')
df = df.stack().droplevel(1)
will give you this series:
A
1 2012
1 12
1 111
2 3014
2 13
2 222
3 3343
3 45
3 444
Check melt
out = df.melt('A').drop('variable',1)
Out[15]:
A value
0 1 2012
1 2 3014
2 3 3343
3 1 12
4 2 13
5 3 45
6 1 111
7 2 222
8 3 444

Groupby and calculate count and means based on multiple conditions in Pandas

For the given dataframe as follows:
id|address|sell_price|market_price|status|start_date|end_date
1|7552 Atlantic Lane|1170787.3|1463484.12|finished|2019/8/2|2019/10/1
1|7552 Atlantic Lane|1137782.02|1422227.52|finished|2019/8/2|2019/10/1
2|888 Foster Street|1066708.28|1333385.35|finished|2019/8/2|2019/10/1
2|888 Foster Street|1871757.05|1416757.05|finished|2019/10/14|2019/10/15
2|888 Foster Street|NaN|763744.52|current|2019/10/12|2019/10/13
3|5 Pawnee Avenue|NaN|928366.2|current|2019/10/10|2019/10/11
3|5 Pawnee Avenue|NaN|2025924.16|current|2019/10/10|2019/10/11
3|5 Pawnee Avenue|Nan|4000000|forward|2019/10/9|2019/10/10
3|5 Pawnee Avenue|2236138.9|1788938.9|finished|2019/10/8|2019/10/9
4|916 W. Mill Pond St.|2811026.73|1992026.73|finished|2019/9/30|2019/10/1
4|916 W. Mill Pond St.|13664803.02|10914803.02|finished|2019/9/30|2019/10/1
4|916 W. Mill Pond St.|3234636.64|1956636.64|finished|2019/9/30|2019/10/1
5|68 Henry Drive|2699959.92|NaN|failed|2019/10/8|2019/10/9
5|68 Henry Drive|5830725.66|NaN|failed|2019/10/8|2019/10/9
5|68 Henry Drive|2668401.36|1903401.36|finished|2019/12/8|2019/12/9
#copy above data and run below code to reproduce dataframe
df = pd.read_clipboard(sep='|')
I would like to groupby id and address and calculate mean_ratio and result_count based on the following conditions:
mean_ratio: which is groupby id and address and calculate mean for the rows meet the following conditions: status is finished and start_date isin the range of 2019-09 and 2019-10
result_count: which is groupby id and address and count the rows meet the following conditions: status is either finished or failed, and start_date isin the range of 2019-09 and 2019-10
The desired output will like this:
id address mean_ratio result_count
0 1 7552 Atlantic Lane NaN 0
1 2 888 Foster Street 1.32 1
2 3 5 Pawnee Avenue 1.25 1
3 4 916 W. Mill Pond St. 1.44 3
4 5 68 Henry Drive NaN 2
I have tried so far:
# convert date
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(lambda x: pd.to_datetime(x, format = '%Y/%m/%d'))
# calculate ratio
df['ratio'] = round(df['sell_price']/df['market_price'], 2)
In order to filter start_date isin the range of 2019-09 and 2019-10:
L = [pd.Period('2019-09'), pd.Period('2019-10')]
c = ['start_date']
df = df[np.logical_or.reduce([df[x].dt.to_period('m').isin(L) for x in c])]
To filter row status is finished or failed, I use:
mask = df['status'].str.contains('finished|failed')
df[mask]
But I don't know how to use those to get final result. Thanks your help at advance.
I think you need GroupBy.agg, but because some rows are excluded like id=1, then add them by DataFrame.join with all unique pairs id and address in df2, last replace missing values in result_count columns:
df2 = df[['id','address']].drop_duplicates()
print (df2)
id address
0 1 7552 Atlantic Lane
2 2 888 Foster Street
5 3 5 Pawnee Avenue
9 4 916 W. Mill Pond St.
12 5 68 Henry Drive
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(lambda x: pd.to_datetime(x, format = '%Y/%m/%d'))
df['ratio'] = round(df['sell_price']/df['market_price'], 2)
L = [pd.Period('2019-09'), pd.Period('2019-10')]
c = ['start_date']
mask = df['status'].str.contains('finished|failed')
mask1 = np.logical_or.reduce([df[x].dt.to_period('m').isin(L) for x in c])
df = df[mask1 & mask]
df1 = df.groupby(['id', 'address']).agg(mean_ratio=('ratio','mean'),
result_count=('ratio','size'))
df1 = df2.join(df1, on=['id','address']).fillna({'result_count': 0})
print (df1)
id address mean_ratio result_count
0 1 7552 Atlantic Lane NaN 0.0
2 2 888 Foster Street 1.320000 1.0
5 3 5 Pawnee Avenue 1.250000 1.0
9 4 916 W. Mill Pond St. 1.436667 3.0
12 5 68 Henry Drive NaN 2.0
Some helpers
def mean_ratio(idf):
# filtering data
idf = idf[
(idf['start_date'].between('2019-09-01', '2019-10-31')) &
(idf['mean_ratio'].notnull()) ]
return np.round(idf['mean_ratio'].mean(), 2)
def result_count(idf):
idf = idf[
(idf['status'].isin(['finished', 'failed'])) &
(idf['start_date'].between('2019-09-01', '2019-10-31')) ]
return idf.shape[0]
# We can caluclate `mean_ratio` before hand
df['mean_ratio'] = df['sell_price'] / df['market_price']
df = df.astype({'start_date': np.datetime64, 'end_date': np.datetime64})
# Group the df
g = df.groupby(['id', 'address'])
mean_ratio = g.apply(lambda idf: mean_ratio(idf)).to_frame('mean_ratio')
result_count = g.apply(lambda idf: result_count(idf)).to_frame('result_count')
# Final result
pd.concat((mean_ratio, result_count), axis=1)

Match pandas column values and headers across dataframes

I have 3 files that I am reading into dataframes (https://pastebin.com/v7BnSH3s)
map_df: Maps data_file headers to codes_df headers
Field Name Code Name
Gender gender_codes
Race race_codes
Ethnicity ethnicity_codes
code_df: Valid codes
gender_codes race_codes ethnicity_codes
1 1 1
2 2 2
3 3 3
4 4 4
NaN NaN 5
NaN NaN 6
NaN NaN 7
data_df: the actual data that needs to be checked against the codes
Name Gender Race Ethnicity
Alex 99 1 7
Cindy 2 4 5
Tom 1 99 1
Problem:
I need to confirm that each value in every column of data_df is a valid code. If not, I need to write the Name, the invalid value and the column header label as a new column. So my example data_df would yield the following dataframe for the gender_codes check:
result_df:
Name Value Column
Alex 99 Gender
Background:
My actual data file has over 100 columns.
A code column can map to multiple columns in the data_df.
I am currently not using the map_df other than to know which columns map to
which codes. However, if I can incorporate this into my script, that would be
ideal.
What I've tried:
I am currently sending each code column to a list, removing the nan string, performing the lookup with loc and isin, then setting up the result_df...
# code column to list
gender_codes = codes_df["gender_codes"].tolist()
# remove nan string
gender_codes = [gender_codes
for gender_codes in gender_codes
if str(gender_codes) != "nan"]
# check each value against code list
result_df = data_df.loc[(~data_df.Gender.isin(gender_codes))]
result_df = result_df.filter(items = ["Name","Gender"])
result_df.rename(columns = {"Gender":"Value"}, inplace = True)
result_df['Column'] = 'Gender'
This works but obviously is extremely primitive and won't scale with my dataset. I'm hoping to find an iterative and pythonic approach to this problem.
EDIT:
Modified Dataset with np.nan
https://pastebin.com/v7BnSH3s
Boolean indexing
I'd reformat your data into different forms
m = dict(map_df.itertuples(index=False))
c = code_df.T.stack().groupby(level=0).apply(set)
ddf = data_df.melt('Name', var_name='Column', value_name='Value')
ddf[[val not in c[col] for val, col in zip(ddf.Value, ddf.Column.map(m))]]
Name Column Value
0 Alex Gender 99
5 Tom Race 99
Details
m # Just a dictionary with the same content as `map_df`
{'Gender': 'gender_codes',
'Race': 'race_codes',
'Ethnicity': 'ethnicity_codes'}
c # Series of membership sets
ethnicity_codes {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0}
gender_codes {1.0, 2.0, 3.0, 4.0}
race_codes {1.0, 2.0, 3.0, 4.0}
dtype: object
ddf # Melted dataframe to help match the final output
Name Column Value
0 Alex Gender 99
1 Cindy Gender 2
2 Tom Gender 1
3 Alex Race 1
4 Cindy Race 4
5 Tom Race 99
6 Alex Ethnicity 7
7 Cindy Ethnicity 5
8 Tom Ethnicity 1
You will need to preprocess your dataframes and define a validation function. Something like below:
1. Preprocessing
# call melt() to convert columns to rows
mcodes = codes_df.melt(
value_vars=list(codes_df.columns),
var_name='Code Name',
value_name='Valid Code').dropna()
mdata = data_df.melt(
id_vars='Name',
value_vars=list(data_df.columns[1:]),
var_name='Column',
value_name='Value')
validation_df = mcodes.merge(map_df, on='Code Name')
Out:
mcodes:
Code Name Valid Code
0 gender_codes 1
1 gender_codes 2
7 race_codes 1
8 race_codes 2
9 race_codes 3
10 race_codes 4
14 ethnicity_codes 1
15 ethnicity_codes 2
16 ethnicity_codes 3
17 ethnicity_codes 4
18 ethnicity_codes 5
19 ethnicity_codes 6
20 ethnicity_codes 7
mdata:
Name Column Value
0 Alex Gender 99
1 Cindy Gender 2
2 Tom Gender 1
3 Alex Race 1
4 Cindy Race 4
5 Tom Race 99
6 Alex Ethnicity 7
7 Cindy Ethnicity 5
8 Tom Ethnicity 1
validation_df:
Code Name Valid Code Field Name
0 gender_codes 1 Gender
1 gender_codes 2 Gender
2 race_codes 1 Race
3 race_codes 2 Race
4 race_codes 3 Race
5 race_codes 4 Race
6 ethnicity_codes 1 Ethnicity
7 ethnicity_codes 2 Ethnicity
8 ethnicity_codes 3 Ethnicity
9 ethnicity_codes 4 Ethnicity
10 ethnicity_codes 5 Ethnicity
11 ethnicity_codes 6 Ethnicity
12 ethnicity_codes 7 Ethnicity
2. Validation Function
def isValid(row):
valid_list = validation_df[validation_df['Field Name'] == row.Column]['Valid Code'].tolist()
return row.Value in valid_list
3. Validation
mdata['isValid'] = mdata.apply(isValid, axis=1)
result = mdata[mdata.isValid == False]
Out:
result:
Name Column Value isValid
0 Alex Gender 99 False
5 Tom Race 99 False
m, df1 = dict(map_df.values), data_df.set_index('Name')
df1[df1.apply(lambda x:~x.isin(code_df[m[x.name]]))].stack().reset_index()
Out:
Name level_1 0
0 Alex Gender 99.0
1 Tom Race 99.0

Pandas : merge dataframes with conditions

I'd like something pretty complicated, I think.
So i have 2 pandas DataFrames,
contact_extrafields (which is a CSV file converted to a DataFrame):
contact_id departement age region size
0 17068CE3 5 19.5
1 788159ED 59 18 ABC
2 4796EDA9 69 100.0
3 2BB080E4 32 DEF 50.5
4 8562B30E 10 GHI 79.95
5 9602758E 67 JKL 23.7
6 3CBBA9F7 65 MNO 14.7
7 DAE5EE44 75 98 159.6
8 5B9E3410 49 10 PQR 890.1
...
datafield_types (which is a dictionary converted to a DataFrame):
name datatype_id datafield_id datatype_name
0 size 1 4 float
1 region 2 3 string
2 age 3 2 integer
3 departement 3 1 integer
I would like a new DataFrame like this :
contact_id datafield_id string_value integer_value boolean_value float_value
0 17068CE3 4 19.5
1 17068CE3 3
2 17068CE3 2 5
3 17068CE3 1
4 788159ED 4
5 788159ED 3 ABC
6 788159ED 2 18
7 788159ED 1 59
....
The DataFrame contact_extrafields contains about 3 million lines.
EDIT (exemple):
If I take contact_id 788159ED from DataFrame contact_extrafields,
I'll take the name of the column and its value,
check the type of the value with in DataFrame datafield_types with the column name,
for example for the column department its value is 59 and its type is integrated according to the DataFrame datafield_types so the id is 3,
it should insert a line in the new DataFrame that i will create like this:
contact_id datafield_id string_value integer_value boolean_value float_value
0 788159ED 1 59
....
The datafield_id is retrieved from the DataFrame datafield_types this will allow me to know that the contact 788159ED had for the column department which is integer type the value 59.
Each column create a row in the DataFrame I want to create.
Is it possible to do it with pandas?
How to do it?
The columns in contact_extrafields can change (so i will change the datafield_types names too)
I've tried a lot of things that have led me to a memory saturation.
My code is running on a machine with 16 gigas of ram.
Thanks a lot !

Find occurrences of conditional value from one column and count values from another column in a dataframe

I have a dataframe containing userIds, week number, and a column X as shown below:
I am trying to group by the userIds if X is greater than 3 for 3 weeks.
I have tried using groupby and lambda in pandas but I am stuck
weekly_X = df.groupby(['Userid','Week #'], as_index=False)
UserIds Week X
123 14 3
123 15 4
123 16 7
123 17 2
123 18 1
456 14 4
456 15 5
456 16 11
456 17 2
456 18 6
The result I am aiming for is a dataframe containing user 456 and how many weeks the condition occurred.
df_3 = df.groupby('UserIds').apply(lambda x: (x.X > 3).sum() > 3).to_frame('ID_want').reset_index()
df = df[df.UserIds.isin(df_3.loc[df_3.ID_want == 1,'UserIds'])]
Get counts of values greater like 3 with aggregate sum and then filter values greater like 3:
s = df['X'].gt(3).astype(int).groupby(df['UserIds']).sum()
out = s[s.gt(3)].reset_index(name='count')
print (out)
UserIds count
0 456 4

Resources