Removing specific word in a string in pandas - python-3.x

I'm trying to remove several words in each value of a column but nothing is happening.
stop_words = ["and","lang","naman","the","sa","ko","na",
"yan","n","yang","mo","ung","ang","ako","ng",
"ndi","pag","ba","on","un","Me","at","to",
"is","sia","kaya","I","s","sla","dun","po","b","pro"
]
newdata['Verbatim'] = newdata['Verbatim'].replace(stop_words,'', inplace = True)
I'm trying to generate a word cloud out from the result of the replacement but I am getting the same words(that doesn't mean anything but has a lot of volumne)

You can use words boundaries \b with joined values by | for regex OR:
pat = '|'.join(r"\b{}\b".format(x) for x in stop_words)
newdata['Verbatim'] = newdata['Verbatim'].str.replace(pat, '')
Another solution is split values, remove stopwords and join back with sapce in lambda function:
stop_words = set(stop_words)
f = lambda x: ' '.join(w for w in x.split() if not w in stop_words)
newdata['Verbatim'] = newdata['Verbatim'].apply(f)
Sample:
stop_words = ["and","lang","naman","the","sa","ko","na",
"yan","n","yang","mo","ung","ang","ako","ng",
"ndi","pag","ba","on","un","Me","at","to",
"is","sia","kaya","I","s","sla","dun","po","b","pro"
]
newdata = pd.DataFrame({'Verbatim':['I love my lang','the boss come to me']})
pat = '|'.join(r"\b{}\b".format(x) for x in stop_words)
newdata['Verbatim1'] = newdata['Verbatim'].str.replace(pat, '')
top_words = set(stop_words)
f = lambda x: ' '.join(w for w in x.split() if not w in stop_words)
newdata['Verbatim2'] = newdata['Verbatim'].apply(f)
print (newdata)
Verbatim Verbatim1 Verbatim2
0 I love my lang love my love my
1 the boss come to me boss come me boss come me

Related

Filter user names from a string

I'm trying to filter the usernames that are being referenced in a tweet like in the following example:
Example:
tw = 'TR #uname1, #uname2, #uname3, text1, text2, #uname4, text3, #uname5, RT #uname6'
the desired output will be:
rt_unames = ['uname1', 'uname6']
mt_unames = ['uname2', 'uname3', 'uname4', 'uname5']
I can do something like a for loop that goes over the string like the naïve solution below:
Naïve Solution:
def find_end_idx(tw_part):
end_space_idx = len(tw)
try:
end_space_idx = tw[start_idx:].index(' ')
except Exception as e:
pass
end_dot_idx = len(tw)
try:
end_dot_idx = tw[start_idx:].index('.')
except Exception as e:
pass
end_semi_idx = len(tw)
try:
end_semi_idx = tw[start_idx:].index(',')
except Exception as e:
pass
return min(end_space_idx, end_dot_idx, end_semi_idx)
tw = 'RT #uname1, #uname2, #uname3, text1, text2, #uname4, text3, #uname5, RT #uname6'
acc = ''
rt_unames = []
mt_unames = []
for i, c in enumerate(tw):
acc += c
if acc[::-1][:2][::-1] == 'RT':
start_idx = i+2
end_idx = find_end_idx(tw_part=tw[start_idx:])
uname = tw[start_idx:start_idx+end_idx]
if uname not in mt_unames:
rt_unames.append(uname)
acc = ''
elif acc[::-1][:1]=='#':
start_idx = i
end_idx = find_end_idx(tw_part=tw[start_idx:])
uname = tw[start_idx:start_idx+end_idx]
if uname not in rt_unames:
mt_unames.append(uname)
acc = ''
rt_unames, mt_unames
which outputs:
(['#uname1', '#uname6'], ['#uname2', '#uname3', '#uname4', '#uname5'])
Question:
As I need to apply it to every tweet in a pandas.DataFrame, I'm looking for a more elegant and fast solution to get this outcome.
I'd appreciate any suggestions.
Let's try re.findall with a regex pattern::
import re
rt_unames = re.findall(r'(?<=TR |RT )#([^,]+)', tw)
mt_unames = re.findall(r'(?<!TR |RT )#([^,]+)', tw)
In the similar way, you can use str.findall method on the column in dataframe:
df['rt_unames'] = df['tweet'].str.findall(r'(?<=TR |RT )#([^,]+)')
df['mt_unames'] = df['tweet'].str.findall(r'(?<!TR |RT )#([^,]+)')
Result:
['uname1', 'uname6']
['uname2', 'uname3', 'uname4', 'uname5']
If the format of input string is always the same, I would do it like this:
def parse_tags(str_tags):
rts = []
others = []
for tag in [tag.strip() for tag in str_tags.split(',')]:
if tag.startswith('RT'):
rts.append(tag[3:])
elif tag.startswith('#'):
others.append(tag)
return rts, others
An alternative approach using filters and list comprehension.
import re
def your_func_name(tw):
tw_list = [x.strip() for x in tw.split(",")]
rt_unames_raw = filter(lambda x: "#" in x and x.startswith("RT"),tw_list)
mt_unames_raw = filter(lambda x: x.startswith("#"),tw_list)
rt_unames = [re.sub(r"RT|#","",uname).strip() for uname in rt_unames_raw]
mt_unames = [re.sub("#","",uname).strip() for uname in mt_unames_raw]
return rt_unames, mt_unames
tw = 'RT #uname1, #uname2, #uname3, text1, text2, #uname4, text3, #uname5, RT #uname6'
your_func_name(tw=tw)
You can use regex patterns and use the apply function on the tweet column of your dataframe
import pandas as pd
import re
pattern1 = r"(RT\s+#[^\,]+)|(TR\s+#[^\,]+)"
pattern2 = r"#[^\,]+"
df = pd.DataFrame(['TR #uname1, #uname2, #uname3, text1, text2, #uname4, text3, #uname5, RT #uname6'], columns=['Tweet'])
df['group1'] = df.Tweet.apply(lambda x: re.findall(pattern1, x))
df['group2'] = df.Tweet.apply(lambda x: re.findall(pattern2, x))
This is my second time, so I will try to make it as easy as possible.
tw = 'TR #uname1, #uname2, #uname3, text1, text2, #uname4, text3, #uname5, RT #uname6'
res = tw.replace(", ", " ").split()
final = []
k = "#"
for e in res:
if e[0].lower == k.lower:
final.append(e)
stringe = str(final).replace(",", "")
stringe = stringe.replace("[", "")
stringe = stringe.replace("]", "")
stringe =stringe.replace("'", "")
print("Result is :", str(stringe))
from what I can see, you already know python, so this example should only take you a while.
Here, I use the replace function to replace all the commas (,) with blank, and use the split function, which seperates the words seperated by spaces. The result is then stored in res.
In the next few lines, I use the replace function to replace all unwanted strings like "[" and "]" and "'" , to be replaced by a blank.
Then, I simply print the result.
Hit me up at #Vishma Pratim Das on twitter if you don't understand something

How to simplify text comparison for big data-set where text meaning is same but not exact - deduplicate text data

I have text data set (different menu items like chocolate, cake, coke etc) of around 1.8 million records which belongs to 6 different categories (category A, B, C, D, E, F). one of the category has around 700k records. Most of the menu items are mixed up in multiple categories to which they doesn't belong to, for example: cake belongs to category 'A' but it is found in category 'B' & 'C' as well.
I want to identify those misclassified items and report to a personnel but the challenge is the item name is not always correct because it is totally human typed text. For example: Chocolate might be updated as hot chclt, sweet choklate, chocolat etc. There can also be items like chocolate cake ;)
so to handle this, I tried a simple method using cosine similarity to compare category-wise and identify those anomalies but it takes alot of time since I am comparing each items to 1.8 million records (Sample code is as shown below). Can anyone suggest a better way to deal with this problem?
#Function
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
def cos_similarity(a,b):
X =a
Y =b
# tokenization
X_list = word_tokenize(X)
Y_list = word_tokenize(Y)
# sw contains the list of stopwords
sw = stopwords.words('english')
l1 =[];l2 =[]
# remove stop words from the string
X_set = {w for w in X_list if not w in sw}
Y_set = {w for w in Y_list if not w in sw}
# form a set containing keywords of both strings
rvector = X_set.union(Y_set)
for w in rvector:
if w in X_set: l1.append(1) # create a vector
else: l1.append(0)
if w in Y_set: l2.append(1)
else: l2.append(0)
c = 0
# cosine formula
for i in range(len(rvector)):
c+= l1[i]*l2[i]
if float((sum(l1)*sum(l2))**0.5)>0:
cosine = c / float((sum(l1)*sum(l2))**0.5)
else:
cosine = 0
return cosine
#Base code
cos_sim_list = []
for i in category_B.index:
ln_cosdegree = 0
ln_degsem = []
for j in category_A.index:
ln_j = str(category_A['item_name'][j])
ln_i = str(category_B['item_name'][i])
degreeOfSimilarity = cos_similarity(ln_j,ln_i)
if degreeOfSimilarity>0.5:
cos_sim_list.append([ln_j,ln_i,degreeOfSimilarity])
Consider text is already cleaned
I used KNeighbor and cosine similarity to solve this case. Though I am running the code multiple times to compare category by category; still it is effective because of lesser number of categories. Please suggest me if any better solution is available
cat_A_clean = category_A['item_name'].unique()
print('Vecorizing the data - this could take a few minutes for large datasets...')
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams, lowercase=False)
tfidf = vectorizer.fit_transform(cat_A_clean)
print('Vecorizing completed...')
from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors(n_neighbors=1, n_jobs=-1).fit(tfidf)
unique_B = set(category_B['item_name'].values)
def getNearestN(query):
queryTFIDF_ = vectorizer.transform(query)
distances, indices = nbrs.kneighbors(queryTFIDF_)
return distances, indices
import time
t1 = time.time()
print('getting nearest n...')
distances, indices = getNearestN(unique_B)
t = time.time()-t1
print("COMPLETED IN:", t)
unique_B = list(unique_B)
print('finding matches...')
matches = []
for i,j in enumerate(indices):
temp = [round(distances[i][0],2), cat_A_clean['item_name'].values[j],unique_B[i]]
matches.append(temp)
print('Building data frame...')
matches = pd.DataFrame(matches, columns=['Match confidence (lower is better)','ITEM_A','ITEM_B'])
print('Done')
def clean_string(text):
text = str(text)
text = text.lower()
return(text)
def cosine_sim_vectors(vec1,vec2):
vec1 = vec1.reshape(1,-1)
vec2 = vec2.reshape(1,-1)
return cosine_similarity(vec1,vec2)[0][0]
def cos_similarity(sentences):
cleaned = list(map(clean_string,sentences))
print(cleaned)
vectorizer = CountVectorizer().fit_transform(cleaned)
vectors = vectorizer.toarray()
print(vectors)
return(cosine_sim_vectors(vectors[0],vectors[1]))
cos_sim_list =[]
for ind in matches.index:
a = matches['Match confidence (lower is better)'][ind]
b = matches['ITEM_A'][ind]
c = matches['ITEM_B'][ind]
degreeOfSimilarity = cos_similarity([b,c])
cos_sim_list.append([a,b,c,degreeOfSimilarity])

How can I set up a new column in python with a value based on the return of a function?

I am doing some text mining in python and want to set up a new column with the value 1 if the return of my search function is true and 0 if it's false.
I have tried various if statements, but cannot get anything to work.
A simplified version of what I'm doing is below:
import pandas as pd
import nltk
nltk.download('punkt')
df = pd.DataFrame (
{
'student number' : [1,2,3,4,5],
'answer' : [ 'Yes, she is correct.', 'Yes', 'no', 'north east', 'No its North East']
# I know there's an apostrophe missing
}
)
print(df)
# change all text to lower case
df['answer'] = df['answer'].str.lower()
# split the answer into individual words
df['text'] = df['answer'].apply(nltk.word_tokenize)
# Check if given words appear together in a list of sentence
def check(sentence, words):
res = []
for substring in sentence:
k = [ w for w in words if w in substring ]
if (len(k) == len(words) ):
res.append(substring)
return res
# Driver code
sentence = df['text']
words = ['no','north','east']
print(check(sentence, words))
This is what you want I think:
df['New'] = df['answer'].isin(words)*1
This one works for me:
for i in range(0, len(df)):
if set(words) <= set(df.text[i]):
df['NEW'][i] = 1
else:
df['NEW'][i] = 0
You don't need the function if you use this method.

how to update contents of file in python

def update():
global mylist
i = j = 0
mylist[:]= []
key = input("enter student's tp")
myf = open("data.txt","r+")
ml = myf.readlines()
#print(ml[1])
for line in ml:
words = line.split()
mylist.append(words)
print(mylist)
l = len(mylist)
w = len(words)
print(w)
print(l)
for i in range(l):
for j in range(w):
print(mylist[i][j])
## if(key == mylist[i][j]):
## print("found at ",i,j)
## del mylist[i][j]
## mylist[i].insert((j+1), "xxx")
below is the error
print(mylist[i][j])
IndexError: list index out of range
I am trying to update contents in a file. I am saving the file in a list as lines and each line is then saved as another list of words. So "mylist" is a 2D list but it is giving me error with index
Your l variable is the length of the last line list. Others could be shorter.
A better idiom is to use a for loop to iterate over a list.
But there is an even better way.
It appears you want to replace a "tp" (whatever that is) with the string xxx everywhere. A quicker way to do that would be to use regular expressions.
import re
with open('data.txt') as myf:
myd = myf.read()
newd = re.sub(key, 'xxx', myd)
with open('newdata.txt', 'w') ad newf:
newf.write(newd)

K-means clustering by using Apache Spark

I would like to do "text clustering" using k-means and Spark on a massive dataset. As you know, before running the k-means, I have to do pre-processing methods such as TFIDF and NLTK on my big dataset. The following is my code in python :
if __name__ == '__main__':
# Cluster a bunch of text documents.
import re
import sys
k = 6
vocab = {}
xs = []
ns=[]
cat=[]
filename='2013-01.csv'
with open(filename, newline='') as f:
try:
newsreader = csv.reader(f)
for row in newsreader:
ns.append(row[3])
cat.append(row[4])
except csv.Error as e:
sys.exit('file %s, line %d: %s' % (filename, newsreader.line_num, e))
remove_spl_char_regex = re.compile('[%s]' % re.escape(string.punctuation)) # regex to remove special characters
remove_num = re.compile('[\d]+')
#nltk.download()
stop_words=nltk.corpus.stopwords.words('english')
for a in ns:
x = defaultdict(float)
a1 = a.strip().lower()
a2 = remove_spl_char_regex.sub(" ",a1) # Remove special characters
a3 = remove_num.sub("", a2) #Remove numbers
#Remove stop words
words = a3.split()
filter_stop_words = [w for w in words if not w in stop_words]
stemed = [PorterStemmer().stem_word(w) for w in filter_stop_words]
ws=sorted(stemed)
#ws=re.findall(r"\w+", a1)
for w in ws:
vocab.setdefault(w, len(vocab))
x[vocab[w]] += 1
xs.append(x.items())
Can anyone explain to me how can I do the pre-processing step in Spark, before running the k-means.
This is in response to comment by user3789843.
Yes. Each stop word in a separate line without quotes.
Sorry, I do not have permission to comment.

Resources