Iterating over a text column in a dataframe - python-3.x

DataFrame
Hi all. I am working on a dataframe (picture above) with over 18000 observations. What I'd like to do is to get the text in the column 'review' one after the other and then do a word count later on it. At the moment I have been trying to iterate over it but I have been getting error like "TypeError: 'float' object is not iterable". Here is the code I used:
def tokenize(text):
for row in text:
for i in row:
if i is not None:
words = i.lower().split()
return words
else:
return None
data['review_two'] = data['review'].apply(tokenize)
Now my question is: how do I iterate effectively and efficiently over the column 'review' so that I can now preprocess each row one after the other before I now perform word count on it?

My hypothesis for the error is that you have missing data, which is NaN and makes tokenize function fail. You can checkt it with pd.isnull(df["review"]), which will show you a boolean array that whether each line is NaN. If any(pd.isnull(df["review"])) is true, then there is a missing value in the column.
I cannot reproduce the error as I don't have the data, but I think your goal can be achieve with this.
from collections import Counter
df = pd.DataFrame([{"name": "A", "review": "No it is not good.", "rating":2},
{"name": "B", "review": "Awesome!", "rating":5},
{"name": "C", "review": "This is fine.", "rating":3},
{"name": "C", "review": "This is fine.", "rating":3}])
# first .lower and then .replace for punctuations and finally .split to get lists
df["splitted"] = df.review.str.lower().str.replace('[^\w\s]','').str.split()
# pass a counter to count every list. Then sum counters. (Counters can be added.)
df["splitted"].transform(lambda x: Counter(x)).sum()
Counter({'awesome': 1,
'fine': 2,
'good': 1,
'is': 3,
'it': 1,
'no': 1,
'not': 1,
'this': 2})
str.replace part is to remove punctuations see the answer Replacing punctuation in a data frame based on punctuation list from #EdChum

I'm not sure what you're trying to do, especially with for i in row. In any case, apply already iterates over the rows of your DataFrame/Series, so there's no need to do it in the function that you pass to apply.
Besides, your code does not return a TypeError for a DataFrame such as yours where the columns contain strings. See here for how to check if your 'review' column contains only text.

Maybe something like this, that gives you the word count, the rest I did not understand what you want.
import pandas as pd
a = ['hello friend', 'a b c d']
b = pd.DataFrame(a)
print(b[0].str.split().str.len())
>> 0 2
1 4

Related

Find a match in a column from list of strings from another column

I have two dataframes. I need to find a match and return results in another column using below criteria.
df1 = pd.DataFrame(
{
"Keywords": ["SYS", "SYS2", "SYS3"]
}
df2 = pd.DataFrame(
{
"Lookup": ["TEST SYSTEM", "SYS", "DUMMY" , "THIS IS SYS3"]
}
My expected end result is
df2 = pd.DataFrame(
{
"LookupResults": ["SYS", "THIS IS SYS3"]
}
Basically i need to find those columns with full strings that match my keywords. Note i dont want TEST SYSTEM as my result. i.e no partial.
Have tried this so far.
--Convert the keywords column to list
findwords = df['Keywords'].values
--Split the Lookup strings into a list
df2['words'] = [set(words) for words in
df2['Lookup'].str.strip().str.split()]
--Search using below
df2['match'] = df2.words.apply(lambda words: all(target_word in words for target_word in findwords))
I am not getting desired result . However if I do something like
findwords = ['SYS'] i am getting desired result.
Clearly i am a novice and missing some basics.
Any help is appreciated.
Thanks
# define the pattern from Keywords in df1
# \b : word boundary
pat='\\b('+ '|'.join(df1['Keywords'].values) +')\\b'
p
'\\b(SYS|SYS2|SYS3)\\b'
# extract pattern and filter using loc
df2.loc[df2['Lookup'].str.extract(pat)[0].notna()]
Lookup
1 SYS
3 THIS IS SYS3

How to get the count of a word in a database using pandas

For the same data base, the following 2 codes are showing different answers when executed. According to the answer given the 2nd one is correct but what is the mistake in the 1st code.
code 1
df=pd.read_csv("amazon_baby.csv", index_col ="name")
sw = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']
for i in sw:
df[i]=df["review"].str.count(i)
y=df[i].sum(axis=0)
print(i,y)
code 2
df=pd.read_csv("amazon_baby.csv", index_col ="name")
sw = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']
df['word_count']=df['review'].apply(lambda x:Counter(str(x).split()))
def great_count(x):
if 'great' in x:
return x.get('great')
else:
return 0
df['great3'] = df['word_count'].apply(great_count)
print (sum(df['great3']))
These pieces of code are quite different.
The first one takes each of the words in sw list and counts the number of occurences in string. This means that in string "this is great, this is the greatest" for word "great" it will show 2. This is the error, I suppose.
Second code splits the text in separate words: ['this', 'is', 'great,', 'this', 'is', 'the', 'greatest'], then calculates counts: Counter({'this': 2, 'is': 2, 'great,': 1, 'the': 1, 'greatest': 1}) and shows the sum of the column.
But!! There is no word "great" in the Counter - this is because of the comma. So this is also wrong.
A better way would be to get rid of punctuation at first. For example like this:
sum(1 for i in ''.join(ch for ch in t if ch not in string.punctuation).split() if i == 'great')

Robot framework: Lists should be equal problem with empty values: "None" and ''

So problem here is that I am comparing two lists from different locations. One list is from excel and other list is from particular table which represents the imported values of the same excel values.
So all values are correct; but the excel gives one or possbily more values which are "none" and from the table i get those values only empty value as astrophes ''. How can i change "None" to '' or vice versa?
In this particular case "None" and '' are in the 10th value slot in lists but over time it can change because different values are put to the excel.
So how can I remove or replace/modify these "nones" to '':s or vice versa?
Excel list: [1, 'X', 'Y', 200, 1999, 'Z', 'W', 4, 'V', None, 2, 1100]
Table list: [1, 'X', 'Y', 200, 1999, 'Z', 'W', 4, 'V', '', 2, 1100]
Using ExcelLibrary and ExcelRobot to get the mixture of keywords .. below is the similar approach
${iTotalRows} = Get Row Count Sheet1 (etc.) # excel
${item1} = Get Table cell //table[#class="xx"] 2 1
${item1} = Get Table cell //table[#class="xx"] 2 2 #etc..
Lists should be equal ${x} ${y}
Thank you in advance
I don't think there is a prepared keyword for this (e.g. in Collections library). If I'm wrong and I'm reinventing the wheel, please let me know, I can edit or delete my answer.
I'd create a custom keyword in Python and import it as a library into RF. This could be easily done in Python (one line in fact), so it doesn't even take much time or effort to create it.
Libraries/ListUtils.py:
def substitute_values_in_list(list, value_to_substitute, substitute_to):
return [substitute_to if ele == value_to_substitute else ele for ele in list]
Then in a test or in keywords:
*** Settings ***
Library ../Libraries/ListUtils.py
*** Test Cases ***
Empty List Value
${list}= Create List 1 2 ${None}
Log To Console ${list}
${new_list}= Substitute Values In List ${list} ${None} ${Empty}
Log To Console ${new_list}
The first console output will be:
['1', '2', None]
and the second one with substituted values:
['1', '2', '']
You can parametrize custom keyword Substitute Values In List in another way, so you can substitute empty string for None values or something like that.

How to get the value of the cell from np.where() instead of True/False

Hi I am trying to get my code to print the value of the cell in the column "Position" so that for example if the previous cell in "Position" is long it should also put long until it says Short based on the column "Signal" which will return either buy or sell or "None". However, when I do this what I actually get is True or False im assuming based on the "Long" or Short" but I am new to this so I could be mistaken. The code does what I want in that it correctly picks if we are long or short however instead of returning True or False I would like it to return "Long" or "Short" the value in the cell above (talking about when I convert it to a csv here).
df["Position"] = np.where(df['Signal'].ne("None"),np.where(df[f'Signal'].eq("Buy"), "Long", "Short"), np.where(df["Position"].shift(1).ne("None"), df["Position"].shift(1), "None"))
np.where() returns the indices where the condition is true. You can use these indices to get the values at those positions. Here is a simple example:
import numpy as np
a = np.array([1, 2, 3, 4, 0])
mask = np.where(a > 1)
values = a[mask]
>>array([2, 3, 4])

pandas: change the previous cell value of a column based on conditions in another column

I have a Pandas dataset that looks like this:
dataset of words and their features
I would like to replace the "x "in "Gender" column with a condition that if a list of words like "Mädchen" is in the column "Words", "Neutral" should be put in the "Gender" column, in the previous word's row (which is a number).
So, for example, this:
Gender Words
x 10.
x Mädchen
Should become:
Gender Words
Neutral 10.
x Mädchen
I have already tried np.where like this:
Food2_case["Gender"]= np.where(Food2_case.Words.isin(["Mädchen"]), (dropped_data.Words.str.contains('\d',regex= True) == 'A'), "x")
But I've got this error:
ValueError: operands could not be broadcast together with shapes
(8000,) (275988,) ()
Try the following:
for index, row in Food2_case.iterrows():
if(isinstance(row['Words'],str)):
if('Mädchen' in row['Words']):
Food2_case['Gender'][index-1] = 'Neutral'
If I understood your question correctly, it should work.
[EDIT]
If you want to check for other words other than Mädchen, you can do the following:
words_to_check = ['Mädchen', ...]
for index, row in Food2_case.iterrows():
if(isinstance(row['Words'],str)):
if(any((x in row['Words'] for x in words_to_check))):
Food2_case['Gender'][index-1] = 'Neutral'
# Create dataset
data = pd.DataFrame([[0, 0, 0], [10, "Madchen", 5]]).T
data.columns = ["Gender", "Words"]
# Shift one column of interest (take the value of previous row)
data.loc[:, "iswordin"] = data.Words.shift(-1)
# Do what you want to do
data.loc[data.iswordin.isin(["Madchen", "Girl", "boy", "..."]), "Gender"] = "Neutral"
# Now you can drop "iswordin" column which is no longer useful

Resources