How to skip over np.nan while iterating through a dataframe for sentiment analysis - python-3.x

I have a data frame with 201279 entries, the last column is labeled "text" with customer reviews. The problem is that most of them are missing values, and come up as NaN.
I read some interesting information from this question:
Python numpy.nan and logical functions: wrong results
and I tried applying it to my problem:
df1.columns
Index(['id', 'sku', 'title', 'reviewCount', 'commentCount', 'averageRating',
'date', 'time', 'ProductName', 'CountOfBigTransactions', 'ClassID',
'Weight', 'Width', 'Depth', 'Height', 'LifeCycleName', 'FinishName',
'Color', 'Season', 'SizeOrUtility', 'Material', 'CountryOfOrigin',
'Quartile', 'display-name', 'online-flag', 'long-description', 'text'],
dtype='object')
I tried experimentingby doing this:
df['firstName'][202360]== np.nan
which returns False but indeed that index contains an np.nan.
So I looked for an answer, read through the question I linked, and saw that
np.bool(df1['text'][201279])==True
is a true statement. I thought, okay, I can run with this.
So, here's my code so far:
from textblob import TextBlob
import string
def remove_num_punct(aText):
p = string.punctuation
d = string.digits
j = p + d
table = str.maketrans(j, len(j)* ' ')
return aText.translate(table)
#Process text
aList = []
for text in df1['text']:
if np.bool(df1['text'])==True:
aList.append(np.nan)
else:
b = remove_num_punct(text)
pol = TextBlob(b).sentiment.polarity
aList.append(pol)
Then I would just convert aList with the sentiment to pd.DataFrame and join it to df1, then impute the missing values with K-nearest neighbors.
My problem is that the little routine I made throws a value error
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
So I'm not really sure what else to try. Thanks in advance!
EDIT: I have tried this:
i = 0
aList = []
for txt in df1['text'].isnull():
i += 1
if txt == True:
aList.append(np.nan)
which correctly populates the list with NaN.
But this gives me a different error:
i = 0
aList = []
for txt in df1['text'].isnull():
if txt == True:
aList.append(np.nan)
else:
b = remove_num_punct(df1['text'][i])
pol = TextBlob(b).sentiment.polarity
aList.append(pol)
i+=1
AttributeError: 'float' object has no attribute 'translate'
Which doesn't make sense, since if it is not NaN, then it contains text, right?

import pandas as pd
import numpy as np
df = pd.DataFrame({'age': [5, 6, np.NaN],
'born': [pd.NaT, pd.Timestamp('1939-05-27'), pd.Timestamp('1940-04-25')],
'name': ['Alfred', 'Batman', ''],
'toy': [None, 'Batmobile', 'Joker']})
df1 = df['toy']
for i in range(len(df1)):
if not df1[i]:
df2 = df1.drop(i)
df2
you can try in this way to deal the text which is null

I fixed it, I had to move the i += 1 back from the else indentation to the for indentation:
i = 0
aList = []
for txt in df1['text'].isnull():
if txt == True:
aList.append(np.nan)
else:
b = remove_num_punct(df1['text'][i])
pol = TextBlob(b).sentiment.polarity
aList.append(pol)
i+=1

Related

Set decimal values to 2 points in list under list pandas

I am trying to set max decimal values upto 2 digit for result of a nested list. I have already tried to set precision and tried other things but can not find a way.
r_ij_matrix = variables[1]
print(type(r_ij_matrix))
print(type(r_ij_matrix[0]))
pd.set_option('display.expand_frame_repr', False)
pd.set_option("display.precision", 2)
data = pd.DataFrame(r_ij_matrix, columns= Attributes, index= Names)
df = data.style.set_table_styles([dict(selector='th', props=[('text-align', 'center')])])
df.set_properties(**{'text-align': 'center'})
df.set_caption('Table: Combined Decision Matrix')
You can solve your problem with the apply() method of the dataframe. You can do something like that :
df.apply(lambda x: [[round(elt, 2) for elt in list_] for list_ in x])
Solved it by copying the list to another with the desired decimal points. Thanks everyone.
rij_matrix = variables[1]
rij_nparray = np.empty([8, 6, 3])
for i in range(8):
for j in range(6):
for k in range(3):
rij_nparray[i][j][k] = round(rij_matrix[i][j][k], 2)
rij_list = rij_nparray.tolist()
pd.set_option('display.expand_frame_repr', False)
data = pd.DataFrame(rij_list, columns= Attributes, index= Names)
df = data.style.set_table_styles([dict(selector='th', props=[('text-align', 'center')])])
df.set_properties(**{'text-align': 'center'})
df.set_caption('Table: Normalized Fuzzy Decision Matrix (r_ij)')
applymap seems to be good here:
but there is a BUT: be aware that it is propably not the best idea to store lists as values of a df, you just give up the functionality of pandas. and also after formatting them like this, there are stored as strings. This (if really wanted) should only be for presentation.
df.applymap(lambda lst: list(map("{:.2f}".format, lst)))
Output:
A B
0 [2.05, 2.28, 2.49] [3.11, 3.27, 3.42]
1 [2.05, 2.28, 2.49] [3.11, 3.27, 3.42]
2 [2.05, 2.28, 2.49] [3.11, 3.27, 3.42]
Used Input:
df = pd.DataFrame({
'A': [[2.04939015319192, 2.280350850198276, 2.4899799195977463],
[2.04939015319192, 2.280350850198276, 2.4899799195977463],
[2.04939015319192, 2.280350850198276, 2.4899799195977463]],
'B': [[3.1144823004794873, 3.271085446759225, 3.420526275297414],
[3.1144823004794873, 3.271085446759225, 3.420526275297414],
[3.1144823004794873, 3.271085446759225, 3.420526275297414]]})

Python3 multiple equal sign in the same line

There is a function in the python2 code that I am re-writing into python3
def abc(self, id):
if not isinstance(id, int):
id = int(id)
mask = self.programs['ID'] == id
assert sum(mask) > 0
name = self.programs[mask]['name'].values[0]
"id" here is a panda series where the index is strings and the column is int like the following
data = np.array(['1', '2', '3', '4', '5'])
# providing an index
ser = pd.Series(data, index =['a', 'b', 'c'])
print(ser)
self.programs['ID'] is a dataframe column where there is one row with integer data like '1'
import pandas as pd
# initialize list of lists
data = [[1, 'abc']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['ID', 'name'])
I am really confused with the line "mask = self.programs['ID'] == id \ assert sum(mask) > 0". Could someone enlighten?
Basically, mask = self.programs['ID'] == id would return a series of boolean values, whether thoses 'ID' values are equal to id or not.
Then assert sum(mask) > 0 sums up the boolean series. Note that, bool True can be treated as 1 in python and 0 for False. So this asserts that, there is at least one case where programs['ID'] column has a value equal to id.

Truth value of a Series is ambiguous when using apply lambda to check condition

I am attempting to strip commas from columns that I will later convert to numeric and was hoping I could get some advice regarding this error.
I have defined my columns that I want to conduct the str.replace operation on. I can remove whitespace using the same approach with no issues, but when I run the below code I get the following error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Here is my code. Thanks for some pointers in how I am misusing the lambda function.
numeric_cols = ['Doses – AZ/SII (indicative distribution)',
'Doses – AZ/SKBio (indicative distribution)',
'Doses – Pfizer-BioNTech (exceptional allocation)']
df = df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
df = df.apply(lambda x: x.str.replace(',', '') if x in numeric_cols else x)
You can do it with a helper function as follows:
import pandas as pd
cols = ['AZ/SII','AZ/SKBio','Pfizer-BioNTech','non-numeric']
df = pd.DataFrame([['3,2', '4,5', '5,6', 'a,b,c,d'], ['5,','3,2','4,7','a,f,r,h']], columns= cols)
print(f'Before stripping: \n{df}\n')
def remove_comma(row):
numeric_cols = ['AZ/SII', 'AZ/SKBio', 'Pfizer-BioNTech']
for col in numeric_cols:
row[col] = row[col].replace(',','')
return row
df = df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
df = df.apply(lambda row: remove_comma(row), axis = 1)
print(f'After stripping: \n{df}\n')
#output:
Before stripping:
AZ/SII AZ/SKBio Pfizer-BioNTech non-numeric
0 3,2 4,5 5,6 a,b,c,d
1 5, 3,2 4,7 a,f,r,h
After stripping:
AZ/SII AZ/SKBio Pfizer-BioNTech non-numeric
0 32 45 56 a,b,c,d
1 5 32 47 a,f,r,h
this is also an option
for col in numeric_cols:
df[col] = df[col].str.replace(',', '')

Pandas print unique values as string

I've got a list of unique value from selected column in pandas dataframe. What I want to achieve is to print the result as string.
import pandas as pd
df = pd.DataFrame({'A':['A','C','C','B','A','C','B']})
a = df['A'].unique()
print(a)
Output: ['A' 'C' 'B']
Desired output: A, C, B
So far I've tried below,
print(a.to_string())
Got this error: AttributeError: 'numpy.ndarray' object has no attribute 'to_string'
print(a.tostring())
Got this: b'\xf0\x04\xa6P\x9e\x01\x00\x000\xaf\x92P\x9e\x01\x00\x00\xb0\xaf\x92P\x9e\x01\x00\x00'
Can anyone give a hint.
import pandas as pd
df = pd.DataFrame({'A':['A','C','C','B','A','C','B']})
a = df['A'].unique()
print(', '.join(a)) # or print(*a, sep=', ')
Prints:
A, C, B
EDIT: To store as variable:
text = ', '.join(a)
print(text)
This should work:
print(', '.join(a))
py3 solution
df = pd.DataFrame({'A':['A','C','C','B','A','C','B']})
a = df['A'].unique()
print(*a, sep=", ")

How to remove rows in pandas dataframe column that contain the hyphen character?

I have a DataFrame given as follows:
new_dict = {'Area_sqfeet': '[1002, 322, 420-500,300,1.25acres,100-250,3.45 acres]'}
df = pd.DataFrame([new_dict])
df.head()
I want to remove hyphen values and change acres to sqfeet in this dataframe.
How may I do it efficiently?
Use list comprehension:
mylist = ["1002", "322", "420-500","300","1.25acres","100-250","3.45 acres"]
# ['1002', '322', '420-500', '300', '1.25acres', '100-250', '3.45 acres']
Step 1: Remove hyphens
filtered_list = [i for i in mylist if "-" not in i] # remove hyphens
Step 2: Convert acres to sqfeet:
final_list = [i if 'acres' not in i else eval(i.split('acres')[0])*43560 for i in filtered_list] # convert to sq foot
#['1002', '322', '300', 54450.0, 150282.0]
Also, if you want to keep the "sqfeet" next tot he converted values use this:
final_list = [i if 'acres' not in i else "{} sqfeet".format(eval(i.split('acres')[0])*43560) for i in filtered_list]
# ['1002', '322', '300', '54450.0 sqfeet', '150282.0 sqfeet']
It's not clear if this is homework, and you haven't shown us what you have already tried per https://stackoverflow.com/help/how-to-ask
Here's something that might get you going in the right direction:
import pandas as pd
col_name = 'Area_sqfeet'
# per comment on your question, you need to make a dataframe with more
# than one row, your original question only had one row
new_list = ["1002", "322", "420-500","300","1.25acres","100-250","3.45 acres"]
df = pd.DataFrame(new_list)
df.columns = ["Area_sqfeet"]
# once you have the df as strings, here's how to remove the ones with hyphens
df = df[df["Area_sqfeet"].str.contains("-")==False]
print(df.head())

Resources