I have a big data set (100k+) with many more columns than in the snippet attached. I need to replace missing values with values from the reference table. I found countless articles of how to replace nan values with the same number but can't find relevant help to replace them with different values obtain from a function. My problem is that np.nan is not equal to np.nan so how can I make a comparison? I'm trying to say that if the value is null then replace it with the particular value from the reference table. I have found the way shown below but its a dangerous method as it replace it only as an exception so if anything goes wrong I wouldn't see it. Here is the snippet:
sampleData = {
'BI Business Name' : ['AAA', 'BBB', 'CCC', 'CCC','DDD','DDD'],
'BId Postcode' : ['NW1 8NZ', 'NW1 8NZ', 'WC2N 4AA','WC2N 4AA', 'CV7 9JY', 'CV7 9JY',],
'BI Website' : ['www#1', 'www#1', 'www#2', 'www#2','www#3', 'www#3'],
'BI Telephone' : ['999', '999', '666', '001', np.nan, '12345']
}
df = pd.DataFrame(sampleData)
df
and here is my method:
feature = 'BI Telephone'
df[[feature]] = df[[feature]].astype('string')
def missing_phone(row):
try:
old_value = row[feature]
if old_value == 'NaN' or old_value == 'nan' or old_value == np.nan or old_value is None or
old_value == '':
reference_value = row[reference_column]
new_value = reference_table[reference_table[reference_column]==reference_value].iloc[0,0]
print('changed')
return new_value
else:
print('unchanged as value is not nan. The value is {}'.format(old_value))
return old_value
except Exception as e:
reference_value = row[reference_column]
new_value = reference_table[reference_table[reference_column]==reference_value].iloc[0,0]
print('exception')
return new_value
df[feature]=df.apply(missing_phone, axis=1)
df
If I don't change the data type to string then the nan is just unchanged. How can I fix it?
Related
There is a function in the python2 code that I am re-writing into python3
def abc(self, id):
if not isinstance(id, int):
id = int(id)
mask = self.programs['ID'] == id
assert sum(mask) > 0
name = self.programs[mask]['name'].values[0]
"id" here is a panda series where the index is strings and the column is int like the following
data = np.array(['1', '2', '3', '4', '5'])
# providing an index
ser = pd.Series(data, index =['a', 'b', 'c'])
print(ser)
self.programs['ID'] is a dataframe column where there is one row with integer data like '1'
import pandas as pd
# initialize list of lists
data = [[1, 'abc']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['ID', 'name'])
I am really confused with the line "mask = self.programs['ID'] == id \ assert sum(mask) > 0". Could someone enlighten?
Basically, mask = self.programs['ID'] == id would return a series of boolean values, whether thoses 'ID' values are equal to id or not.
Then assert sum(mask) > 0 sums up the boolean series. Note that, bool True can be treated as 1 in python and 0 for False. So this asserts that, there is at least one case where programs['ID'] column has a value equal to id.
I have a quite large data set of over 100k rows with many duplicates and some missing or faulty values. Trying to simplify the problem in the snippet below.
sampleData = {
'BI Business Name' : ['AAA', 'BBB', 'CCC', 'DDD','DDD'],
'BId Postcode' : ['NW1 8NZ', 'NW1 8NZ', 'WC2N 4AA', 'CV7 9JY', 'CV7 9JY'],
'BI Website' : ['www#1', 'www#1', 'www#2', 'www#3', np.nan],
'BI Telephone' : ['999', '999', '666', np.nan, '12345']
}
df = pd.DataFrame(sampleData)
I'm trying to change the values based on duplicate rows so if any three fields are matching then the forth one should match as well. I should get outcome like this:
result = {
'BI Business Name' : ['AAA', 'AAA', 'CCC', 'DDD','DDD'],
'BId Postcode' : ['NW1 8NZ', 'NW1 8NZ', 'WC2N 4AA', 'CV7 9JY', 'CV7 9JY'],
'BI Website' : ['www#1', 'www#1', 'www#2', 'www#3', 'www#3'],
'BI Telephone' : ['999', '999', '666', '12345', '12345']
}
df = pd.DataFrame(result)
I have found extremely long winded method - here showing just the part for changing the name.
df['Phone_code_web'] = df['BId Postcode'] + df['BI Website'] + df['BI Telephone']
reference_name = df[['BI Business Name', 'BI Telephone', 'BId Postcode','BI Website']]
reference_name = reference_name.dropna()
reference_name['Phone_code_web'] = reference_name['BId Postcode'] + reference_name['BI Website'] +
reference_name['BI Telephone']
duplicate_ref = reference_name[reference_name['Phone_code_web'].duplicated()]
reference_name = pd.concat([reference_name,duplicate_ref]).drop_duplicates(keep=False)
reference_name
def replace_name(row):
try:
old_name = row['BI Business Name']
reference = row['Phone_code_web']
new_name = reference_name[reference_name['Phone_code_web']==reference].iloc[0,0]
print(new_name)
return new_name
except Exception as e:
return old_name
df['BI Business Name']=df.apply(replace_name, axis=1)
df
Is there easier way of doing this?
You can try this:
import pandas as pd
sampleData = {
'BI Business Name': ['AAA', 'BBB', 'CCC', 'DDD','DDD'],
'BId Postcode': ['NW1 8NZ', 'NW1 8NZ', 'WC2N 4AA', 'CV7 9JY', 'CV7 9JY'],
'BI Website': ['www#1', 'www#1', 'www#2', 'www#3', np.nan],
'BI Telephone': ['999', '999', '666', np.nan, '12345']
}
df = pd.DataFrame(sampleData)
print(df)
def fill_gaps(_df, _x): # _df and _x are local variables that represent the dataframe and one of its rows, respectively
# pd.isnull(_x) = list of Booleans indicating which columns have NaNs
# df.columns[pd.isnull(_x)] = list of columns whose value is a NaN
for col in df.columns[pd.isnull(_x)]:
# len(set(y) & set(_x)) = length of the intersection of the row being considered (_x) and each of the other rows in turn (y)
# the mask is a list of Booleans which are True if:
# 1) y[col] is not Null (e.g. for row 3 we need to replace (BI Telephone = NaN) with a non-NaN 'BI Telephone' value)
# 2) and the length of the intersection above is at least 3 (as required)
mask = df.apply(lambda y: pd.notnull(y[col]) and len(set(y) & set(_x)) == 3, axis=1)
# if the mask has at least one "True" value, select the value in the corresponding column (if there are several possible values, select the first one)
_x[col] = df[mask][col].iloc[0] if any(mask) else _x[col]
return _x
# Apply the logic described above to each row in turn (x = each row)
df = df.apply(lambda x: fill_gaps(df, x), axis=1)
print(df)
Output:
BI Business Name BId Postcode BI Website BI Telephone
0 AAA NW1 8NZ www#1 999
1 BBB NW1 8NZ www#1 999
2 CCC WC2N 4AA www#2 666
3 DDD CV7 9JY www#3 NaN
4 DDD CV7 9JY NaN 12345
BI Business Name BId Postcode BI Website BI Telephone
0 AAA NW1 8NZ www#1 999
1 BBB NW1 8NZ www#1 999
2 CCC WC2N 4AA www#2 666
3 DDD CV7 9JY www#3 12345
4 DDD CV7 9JY www#3 12345
I have this df:
df = {'Option': ["A", "B", "C"]}
I'm trying to create a new row, Identifier, that equals 1 if the value in the Option column equals "A". If not, the value in Identifier should return 0.
I created the following function to do this:
def trigger(row):
if df['Option'] == "A":
return 1
else:
return 0
Here is what I tried for the Identifier column:
df['Identifier'] = df['Option'].apply(trigger, axis=1)
When I print(df), I get the following error: TypeError: trigger() got an unexpected keyword argument 'axis'
final df should look like this:
finaldf = {'Option': ["A", "B", "C"],
'Identifier': [1,0,0]}
It seems relatively straightforward problem, but idk why it doesnt work.
Your method does not work because you are not using the row in your trigger. Furthermore, you can do this entirely vectorized:
df['Identifier'] = 0
df.loc[df.Option == 'A', 'Identifier'] = 1
Try:
df['Identifier'] = np.where(df.Option == 'A', 1,0)
For multiple conditions you might try
df["Identfier"] = np.where(df.Option.isin(["A", "B"]), 1, 0)
I have the following dataframe:
raw_data = {'name': ['Willard', 'Nan', 'Omar', 'Spencer'],
'Last_Name': ['Smith', 'Nan', 'Sheng', 'Poursafar'],
'favorite_color': ['blue', 'red', 'Nan', "green"],
'Statues': ['Match', 'Mis-Match', 'Match', 'Mis_match']}
df = pd.DataFrame(raw_data, columns = ['name', 'age', 'favorite_color', 'grade'])
df
I wanna do the following tasks:
Separate the rows that contain Match and Mis-match
Make a category that only contains people whose first name and last name are Nan and love a color(any color except for nan).
Can you guys help me?
Use boolean indexing:
df1 = df[df['Statues'] == 'Match']
df2 = df[df['Statues'] =='Mis-Match']
If missing values are not strings use Series.isna and
Series.notna:
df3 = df[df['Name'].isna() & df['Last_NameName'].isna() & df['favorite_color'].notna()]
If Nans are strings compare by Nan:
df3 = df[(df['Name'] == 'Nan') &
(df['Last_NameName'] == 'Nan') &
(df['favorite_color'] != 'Nan')]
I have a data frame with 201279 entries, the last column is labeled "text" with customer reviews. The problem is that most of them are missing values, and come up as NaN.
I read some interesting information from this question:
Python numpy.nan and logical functions: wrong results
and I tried applying it to my problem:
df1.columns
Index(['id', 'sku', 'title', 'reviewCount', 'commentCount', 'averageRating',
'date', 'time', 'ProductName', 'CountOfBigTransactions', 'ClassID',
'Weight', 'Width', 'Depth', 'Height', 'LifeCycleName', 'FinishName',
'Color', 'Season', 'SizeOrUtility', 'Material', 'CountryOfOrigin',
'Quartile', 'display-name', 'online-flag', 'long-description', 'text'],
dtype='object')
I tried experimentingby doing this:
df['firstName'][202360]== np.nan
which returns False but indeed that index contains an np.nan.
So I looked for an answer, read through the question I linked, and saw that
np.bool(df1['text'][201279])==True
is a true statement. I thought, okay, I can run with this.
So, here's my code so far:
from textblob import TextBlob
import string
def remove_num_punct(aText):
p = string.punctuation
d = string.digits
j = p + d
table = str.maketrans(j, len(j)* ' ')
return aText.translate(table)
#Process text
aList = []
for text in df1['text']:
if np.bool(df1['text'])==True:
aList.append(np.nan)
else:
b = remove_num_punct(text)
pol = TextBlob(b).sentiment.polarity
aList.append(pol)
Then I would just convert aList with the sentiment to pd.DataFrame and join it to df1, then impute the missing values with K-nearest neighbors.
My problem is that the little routine I made throws a value error
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
So I'm not really sure what else to try. Thanks in advance!
EDIT: I have tried this:
i = 0
aList = []
for txt in df1['text'].isnull():
i += 1
if txt == True:
aList.append(np.nan)
which correctly populates the list with NaN.
But this gives me a different error:
i = 0
aList = []
for txt in df1['text'].isnull():
if txt == True:
aList.append(np.nan)
else:
b = remove_num_punct(df1['text'][i])
pol = TextBlob(b).sentiment.polarity
aList.append(pol)
i+=1
AttributeError: 'float' object has no attribute 'translate'
Which doesn't make sense, since if it is not NaN, then it contains text, right?
import pandas as pd
import numpy as np
df = pd.DataFrame({'age': [5, 6, np.NaN],
'born': [pd.NaT, pd.Timestamp('1939-05-27'), pd.Timestamp('1940-04-25')],
'name': ['Alfred', 'Batman', ''],
'toy': [None, 'Batmobile', 'Joker']})
df1 = df['toy']
for i in range(len(df1)):
if not df1[i]:
df2 = df1.drop(i)
df2
you can try in this way to deal the text which is null
I fixed it, I had to move the i += 1 back from the else indentation to the for indentation:
i = 0
aList = []
for txt in df1['text'].isnull():
if txt == True:
aList.append(np.nan)
else:
b = remove_num_punct(df1['text'][i])
pol = TextBlob(b).sentiment.polarity
aList.append(pol)
i+=1