changing values in data frame based on duplicates - python - python-3.x

I have a quite large data set of over 100k rows with many duplicates and some missing or faulty values. Trying to simplify the problem in the snippet below.
sampleData = {
'BI Business Name' : ['AAA', 'BBB', 'CCC', 'DDD','DDD'],
'BId Postcode' : ['NW1 8NZ', 'NW1 8NZ', 'WC2N 4AA', 'CV7 9JY', 'CV7 9JY'],
'BI Website' : ['www#1', 'www#1', 'www#2', 'www#3', np.nan],
'BI Telephone' : ['999', '999', '666', np.nan, '12345']
}
df = pd.DataFrame(sampleData)
I'm trying to change the values based on duplicate rows so if any three fields are matching then the forth one should match as well. I should get outcome like this:
result = {
'BI Business Name' : ['AAA', 'AAA', 'CCC', 'DDD','DDD'],
'BId Postcode' : ['NW1 8NZ', 'NW1 8NZ', 'WC2N 4AA', 'CV7 9JY', 'CV7 9JY'],
'BI Website' : ['www#1', 'www#1', 'www#2', 'www#3', 'www#3'],
'BI Telephone' : ['999', '999', '666', '12345', '12345']
}
df = pd.DataFrame(result)
I have found extremely long winded method - here showing just the part for changing the name.
df['Phone_code_web'] = df['BId Postcode'] + df['BI Website'] + df['BI Telephone']
reference_name = df[['BI Business Name', 'BI Telephone', 'BId Postcode','BI Website']]
reference_name = reference_name.dropna()
reference_name['Phone_code_web'] = reference_name['BId Postcode'] + reference_name['BI Website'] +
reference_name['BI Telephone']
duplicate_ref = reference_name[reference_name['Phone_code_web'].duplicated()]
reference_name = pd.concat([reference_name,duplicate_ref]).drop_duplicates(keep=False)
reference_name
def replace_name(row):
try:
old_name = row['BI Business Name']
reference = row['Phone_code_web']
new_name = reference_name[reference_name['Phone_code_web']==reference].iloc[0,0]
print(new_name)
return new_name
except Exception as e:
return old_name
df['BI Business Name']=df.apply(replace_name, axis=1)
df
Is there easier way of doing this?

You can try this:
import pandas as pd
sampleData = {
'BI Business Name': ['AAA', 'BBB', 'CCC', 'DDD','DDD'],
'BId Postcode': ['NW1 8NZ', 'NW1 8NZ', 'WC2N 4AA', 'CV7 9JY', 'CV7 9JY'],
'BI Website': ['www#1', 'www#1', 'www#2', 'www#3', np.nan],
'BI Telephone': ['999', '999', '666', np.nan, '12345']
}
df = pd.DataFrame(sampleData)
print(df)
def fill_gaps(_df, _x): # _df and _x are local variables that represent the dataframe and one of its rows, respectively
# pd.isnull(_x) = list of Booleans indicating which columns have NaNs
# df.columns[pd.isnull(_x)] = list of columns whose value is a NaN
for col in df.columns[pd.isnull(_x)]:
# len(set(y) & set(_x)) = length of the intersection of the row being considered (_x) and each of the other rows in turn (y)
# the mask is a list of Booleans which are True if:
# 1) y[col] is not Null (e.g. for row 3 we need to replace (BI Telephone = NaN) with a non-NaN 'BI Telephone' value)
# 2) and the length of the intersection above is at least 3 (as required)
mask = df.apply(lambda y: pd.notnull(y[col]) and len(set(y) & set(_x)) == 3, axis=1)
# if the mask has at least one "True" value, select the value in the corresponding column (if there are several possible values, select the first one)
_x[col] = df[mask][col].iloc[0] if any(mask) else _x[col]
return _x
# Apply the logic described above to each row in turn (x = each row)
df = df.apply(lambda x: fill_gaps(df, x), axis=1)
print(df)
Output:
BI Business Name BId Postcode BI Website BI Telephone
0 AAA NW1 8NZ www#1 999
1 BBB NW1 8NZ www#1 999
2 CCC WC2N 4AA www#2 666
3 DDD CV7 9JY www#3 NaN
4 DDD CV7 9JY NaN 12345
BI Business Name BId Postcode BI Website BI Telephone
0 AAA NW1 8NZ www#1 999
1 BBB NW1 8NZ www#1 999
2 CCC WC2N 4AA www#2 666
3 DDD CV7 9JY www#3 12345
4 DDD CV7 9JY www#3 12345

Related

replacing nan values with a function python

I have a big data set (100k+) with many more columns than in the snippet attached. I need to replace missing values with values from the reference table. I found countless articles of how to replace nan values with the same number but can't find relevant help to replace them with different values obtain from a function. My problem is that np.nan is not equal to np.nan so how can I make a comparison? I'm trying to say that if the value is null then replace it with the particular value from the reference table. I have found the way shown below but its a dangerous method as it replace it only as an exception so if anything goes wrong I wouldn't see it. Here is the snippet:
sampleData = {
'BI Business Name' : ['AAA', 'BBB', 'CCC', 'CCC','DDD','DDD'],
'BId Postcode' : ['NW1 8NZ', 'NW1 8NZ', 'WC2N 4AA','WC2N 4AA', 'CV7 9JY', 'CV7 9JY',],
'BI Website' : ['www#1', 'www#1', 'www#2', 'www#2','www#3', 'www#3'],
'BI Telephone' : ['999', '999', '666', '001', np.nan, '12345']
}
df = pd.DataFrame(sampleData)
df
and here is my method:
feature = 'BI Telephone'
df[[feature]] = df[[feature]].astype('string')
def missing_phone(row):
try:
old_value = row[feature]
if old_value == 'NaN' or old_value == 'nan' or old_value == np.nan or old_value is None or
old_value == '':
reference_value = row[reference_column]
new_value = reference_table[reference_table[reference_column]==reference_value].iloc[0,0]
print('changed')
return new_value
else:
print('unchanged as value is not nan. The value is {}'.format(old_value))
return old_value
except Exception as e:
reference_value = row[reference_column]
new_value = reference_table[reference_table[reference_column]==reference_value].iloc[0,0]
print('exception')
return new_value
df[feature]=df.apply(missing_phone, axis=1)
df
If I don't change the data type to string then the nan is just unchanged. How can I fix it?

Categorizing a data based on string in each row

I have the following dataframe:
raw_data = {'name': ['Willard', 'Nan', 'Omar', 'Spencer'],
'Last_Name': ['Smith', 'Nan', 'Sheng', 'Poursafar'],
'favorite_color': ['blue', 'red', 'Nan', "green"],
'Statues': ['Match', 'Mis-Match', 'Match', 'Mis_match']}
df = pd.DataFrame(raw_data, columns = ['name', 'age', 'favorite_color', 'grade'])
df
I wanna do the following tasks:
Separate the rows that contain Match and Mis-match
Make a category that only contains people whose first name and last name are Nan and love a color(any color except for nan).
Can you guys help me?
Use boolean indexing:
df1 = df[df['Statues'] == 'Match']
df2 = df[df['Statues'] =='Mis-Match']
If missing values are not strings use Series.isna and
Series.notna:
df3 = df[df['Name'].isna() & df['Last_NameName'].isna() & df['favorite_color'].notna()]
If Nans are strings compare by Nan:
df3 = df[(df['Name'] == 'Nan') &
(df['Last_NameName'] == 'Nan') &
(df['favorite_color'] != 'Nan')]

Using non-zero values from columns in function - pandas

I am having the below dataframe and would like to calculate the difference between columns 'animal1' and 'animal2' over their sum within a function while only taking into consideration the values that are bigger than 0 in each of the columns 'animal1' and 'animal2.
How could I do this?
import pandas as pd
animal1 = pd.Series({'Cat': 4, 'Dog': 0,'Mouse': 2, 'Cow': 0,'Chicken': 3})
animal2 = pd.Series({'Cat': 2, 'Dog': 3,'Mouse': 0, 'Cow': 1,'Chicken': 2})
data = pd.DataFrame({'animal1':animal1, 'animal2':animal2})
def animals():
data['anim_diff']=(data['animal1']-data['animal2'])/(data['animal1']+ ['animal2'])
return data['anim_diff'].abs().idxmax()
print(data)
I believe you need check all rows are greater by 0 with DataFrame.gt with test DataFrame.all and filter by boolean indexing:
def animals(data):
data['anim_diff']=(data['animal1']-data['animal2'])/(data['animal1']+ data['animal2'])
return data['anim_diff'].abs().idxmax()
df = data[data.gt(0).all(axis=1)].copy()
#alternative for not equal 0
#df = data[data.ne(0).all(axis=1)].copy()
print (df)
animal1 animal2
Cat 4 2
Chicken 3 2
print(animals(df))
Cat

Pandas, concatenating values of columns.

I have found answers to this question on here before, but none of them seem to work for me. Right now I have a data frame with a list of clients and their address. However, each address is separated into many columns and i'm trying to put them all under one.
The code I have so far read as so:
data1_df['Address'] = data1_df['Address 1'].map(str) + ", " + data1_df['Address 2'].map(str) + ", " + data1_df['Address 3'].map(str) + ", " + data1_df['city'].map(str) + ", " + data1_df['city'].map(str) + ", " + data1_df['Province/State'].map(str) + ", " + data1_df['Country'].map(str) + ", " + data1_df['Postal Code'].map(str)
However, the error I get is:
TypeError: Unary plus expects numeric dtype, not object
I'm not sure why it's not accepting the strings as they are and using the + operator. Shouldn't the plus accommodate objects?
Hopefully you'll find this example helpful:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1,2,3],
'B': list('ABC'),
'C': [4,5,np.nan],
'D': ['One', np.nan, 'Three']})
addColumns = ['B', 'C', 'D']
df['Address'] = df[addColumns].astype(str).apply(lambda x: ', '.join([i for i in x if i != 'nan']), axis=1)
df
# A B C D Address
#0 1 A 4.0 One A, 4.0, One
#1 2 B 5.0 NaN B, 5.0
#2 3 C NaN Three C, Three
The above will work as str representation of NaN is nan.
Or you can make it with filling NaN with empty strings:
df['Address'] = df[addColumns].fillna('').astype(str).apply(lambda x: ', '.join([i for i in x if i]), axis=1)
In the case of columns with NaN values that you need to add together, here's some logic:
def add_cols_w_nan(df, col_list, space_char, new_col_name):
""" Add together multiple columns where some of the columns
may contain NaN, with the appropriate amount of spacing between columns.
Examples:
'Mr.' + NaN + 'Smith' becomes 'Mr. Smith'
'Mrs.' + 'J.' + 'Smith' becomes 'Mrs. J. Smith'
NaN + 'J.' + 'Smith' becomes 'J. Smith'
Args:
df: pd.DataFrame
DataFrame for which strings are added together.
col_list: ORDERED list of column names, eg. ['first_name',
'middle_name', 'last_name']. The columns will be added in order.
space_char: str
Character to insert between concatenation of columns.
new_col_name: str
Name of the new column after adding together strings.
Returns: pd.DataFrame with a string addition column
"""
df2 = df[col_list].copy()
# Convert to strings, leave nulls alone
df2 = df2.where(df2.isnull(), df2.astype('str'))
# Add space character, NaN remains NaN, which is important
df2.loc[:, col_list[1:]] = space_char + df2.loc[:, col_list[1:]]
# Fix rows where leading columns are null
to_fix = df2.notnull().idxmax(1)
for col in col_list[1:]:
m = to_fix == col
df2.loc[m, col] = df2.loc[m, col].str.replace(space_char, '')
# So that summation works
df2[col_list] = df2[col_list].replace(np.NaN, '')
# Add together all columns
df[new_col_name] = df2[col_list].sum(axis=1)
# If all are missing replace with missing
df[new_col_name] = df[new_col_name].replace('', np.NaN)
del df2
return df
Sample Data:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Address 1': ['AAA', 'ABC', np.NaN, np.NaN, np.NaN],
'Address 2': ['foo', 'bar', 'baz', None, np.NaN],
'Address 3': [np.NaN, np.NaN, 17, np.NaN, np.NaN],
'city': [np.NaN, 'here', 'there', 'anywhere', np.NaN],
'state': ['NY', 'TX', 'WA', 'MI', np.NaN]})
# Address 1 Address 2 Address 3 city state
#0 AAA foo NaN NaN NY
#1 ABC bar NaN here TX
#2 NaN baz 17.0 there WA
#3 NaN None NaN anywhere MI
#4 NaN NaN NaN NaN NaN
df = add_cols_w_nan(
df,
col_list = ['Address 1', 'Address 2', 'Address 3', 'city', 'state'],
space_char = ', ',
new_col_name = 'full_address')
df.full_address.tolist()
#['AAA, foo, NY',
# 'ABC, bar, here, TX',
# 'baz, 17.0, there, WA',
# 'anywhere, MI',
# nan]

numpy unique could not filter out groups with the same value on a specific column

I tried to groupby a df and then select groups who do not have the same value on a specific column and whose group size > 1,
df.groupby(['account_no', 'ext_id', 'amount']).filter(lambda x: (len(x) > 1) & (np.unique(x.int_id).size != 1))
the df looks like, note that some account_no strings only have a single space, ext_id and int_id are also strings, amount is float;
account_no ext_id amount int_id
2665057 439.504062 D000192
2665057 439.504062 D000192
353724 2758.92 952
353724 2758.92 952
the code supposed to return an empty df, since none of the rows in the sample satisfy the conditions here, but the rows with int_id = 292 remained, so how to fix the issue here.
ps. numpy 1.14.3, pandas 0.22.0, python 3.5.2
In my opinion there is problem some traling whitespace or similar.
You can check it:
df = pd.DataFrame({'account_no': ['a', 'a', 'a', 'a'],
'ext_id': [2665057, 2665057, 353724, 353724],
'amount': [439.50406200000003, 439.50406200000003, 2758.92, 2758.92],
'int_id': ['D000192', 'D000192', ' 952', '952']})
print (df)
account_no amount ext_id int_id
0 a 439.504062 2665057 D000192
1 a 439.504062 2665057 D000192
2 a 2758.920000 353724 952
3 a 2758.920000 353724 952
df1 = df.groupby(['account_no', 'ext_id', 'amount']).filter(lambda x: (len(x) > 1) & (np.unique(x.int_id).size != 1))
print (df1)
account_no amount ext_id int_id
2 a 2758.92 353724 952
3 a 2758.92 353724 952
print (df1['int_id'].tolist())
[' 952', '952']
And then remove it by str.strip:
df['int_id'] = df['int_id'].str.strip()
df1 = df.groupby(['account_no', 'ext_id', 'amount']).filter(lambda x: (len(x) > 1) & (np.unique(x.int_id).size != 1))
print (df1)
Empty DataFrame
Columns: [account_no, amount, ext_id, int_id]
Index: []

Resources