Deleting a word in column based on frequencies - python-3.x

I have a NLP project where I would like to remove the words that appear only once in the keywords. That is to say, for each row I have a list of keywords and their frequencies.
I would like something like
if the frequency for the word in the whole column ['keywords'] ==1 then replace by "".
I cannot test word by word. So my idea was creating a list with all the words and remove the duplicates, then for each word in this list count.sum and then delete. But I have no idea how to do that.
Any ideas? Thanks!
Here's how my data looks like:
sample.head(4)
ID keywords age sex
0 1 fibre:16;quoi:1;dangers:1;combien:1;hightech:1... 62 F
1 2 restaurant:1;marrakech.shtml:1 35 M
2 3 payer:1;faq:1;taxe:1;habitation:1;macron:1;qui... 45 F
3 4 rigaud:3;laurent:3;photo:11;profile:8;photopro... 46 F

To add on to what #jpl mentioned with scikit-learn's CountVectorizer, there exists an option min_df that does exactly what you want, provided you can get your data in the right format. Here's an example:
from sklearn.feature_extraction.text import CountVectorizer
# assuming you want the token to appear in >= 2 documents
vectorizer = CountVectorizer(min_df=2)
documents = ['hello there', 'hello']
X = vectorizer.fit_transform(documents)
This gives you:
# Notice the dimensions of our array – 2 documents by 1 token
>>> X.shape
(2, 1)
# Here is a count of how many times the tokens meeting the inclusion
# criteria are observed in each document (as you see, "hello" is seen once
# in each document
>>> X.toarray()
array([[1],
[1]])
# this is the entire vocabulary our vectorizer knows – see how "there" is excluded?
>>> vectorizer.vocabulary_
{'hello': 0}

Your representation makes that difficult.
You should build a dataframe where each column is a word; then you can use easily pandas operations like the sum to do whatever you want.
However this will lead to a very sparse dataframe, which is never good.
Many libraries, e.g. scikit learn's CountVectorizer allow you to do what you want efficiently.

Related

Check if the lines in the dataframe roughly correspond to each other

I have a data frame with names of cities in Morocco and another one with similar names but that was not well coded. Here's the first one:
>>> df[['new_regiononame']].head()
new_regiononame
0 Grand Casablanca-Settat
1 Fès-Meknès
2 Souss-Massa
3 Laayoune-Sakia El Hamra
4 Fès-Meknès
and here's the other one I wanted to change to the names of the first one. At least they know a way to read it correctly:
>>>X_train[['S02Q03A_Region']].head()
S02Q03A_Region
10918 Fès-Meknès
1892 Rabat-Salé-Kénitra
6671 Casablanca-Settat
4837 Marrakech-Safi
6767 Casablanca-Settat
How can I check if the lines in the dataframe roughly correspond to each other and, if so, rename X_train rows by df ones?
So far I only know how to extract which rows in X_train have exact equivalents in df:
X_train['S02Q03A_Region'][X_train['S02Q03A_Region'].isin(df['new_regiononame'].unique())]
The Levenshtein distance could do the job here.
The Levenshtein distance gives you the distance between two words by calculating the number of single character edits that are needed to convert one word to the other. You could establish a reasonable threshold comparing one dataframe column to the other such as:
If it starts with the same character (?)
If the difference between
lengths of the city names is only x characters apart?
If the Levenshtein distance is less than y
etc. etc.
The code to calculate Levenshtein distance is:
import nltk
nltk.edit_distance("Fès-Meknès", "Fès-Meknès")
Output:
4

How to check if a word in one csv exist in another column of another csv file

I have 2 csv file, one is dictionary.csv which contains a list of words, and another is story.csv. In the story.csv there are many columns, and in one of the columns contains a lots of words called news_story. I wanted to check if the list of words from dictionary.csv exists in the news_story column. Afterwards i wanted to print all of the rows in which the news_story column contained words from the lists of words from dictionary.csv in a new csv file called New.csv
These are the codes i have tried so far
import csv
import pandas as pd
news=pd.read_csv("story.csv")
dictionary=pd.read_csv("dictionary.csv")
pattern = '|'.join(dictionary)
exist=news['news_story'].str.contains(pattern)
for CHECK in exist:
if not CHECK:
news['NEWcolumn']='NO'
else:
news['NEWcolumn']='YES'
news.to_csv('New.csv')
I kept on getting a nos eventhough there should be some trues
story.csv
news_url news_title news_date news_story
goog.com functional 2019 This story is about a functional requirement
live.com pbandJ 2001 I made a sandwich today
key.com uAndI 1992 A code name of a spy
dictionary.csv
red
tie
lace
books
functional
New.csv
news_url news_title news_date news_story
goog.com functional 2019 This story is about a functional requirement
First convert column to Series with header=None for avoid remove first value with squeeze=True in read_csv:
dictionary=pd.read_csv("dictionary.csv", header=None, squeeze=True)
print (dictionary)
0 red
1 tie
2 lace
3 books
4 functional
Name: 0, dtype: object
pattern = '|'.join(dictionary)
#for avoid match substrings use words boundaries
#pattern = '|'.join(r"\b{}\b".format(x) for x in dictionary)
Last filter by boolean indexing:
exist = news['news_story'].str.contains(pattern)
news[exist].to_csv('New.csv')
Detail:
print (news[exist])
news_url news_title news_date \
0 goog.com functional 2019
news_story
0 This story is about a functional requirement

How to convert negations and single words with same repetitive letter

I have a data frame that has a column with text data in it. I want to remove words that mean nothing and convert negations like "isn't" to "is not" from the text data. Because when I remove the punctuations "isn't" becomes "isn t" and when I will remove words having letters less than length 2 "t" will be deleted completely. So, I want to do the following 3 tasks-
1) convert negations like "isn't" to "is not"
2) remove words that mean nothing
3) remove less than length 2 letters
For eg, the df column looks similar to this-
user_id text data column
1 it's the coldest day
2 they aren't going
3 aa
4 how are you jkhf
5 v
6 ps
7 jkhf
The output should be-
user_id text data column
1 it is the coldest day
2 they are not going
3
4 how are you
5
6
7
How to implement this?
def is_repetitive(w):
"""Predicate, true for words like jj or aaaaa."""
w = str(w) # caller should have provided a single word as input
return len(w) > 1 and all((c == w[0] for c in w[1:]))
Feed all words in the corpus to that function,
to accumulate a list of repetitive words.
Then add such words to your list of stop words.
1) Use SpaCy or NLTK's lemmatization tools to convert strings (though they do other things like convert plural to singular as well - so you may end up needing to write your own code to do this).
2) Use stopwords from NLTK or spacy to remove the obvious stop words. Alternatively, feed them your own list of stop words (their default stop words are things like is, a, the).
3)Use a basic filter, if len<2 remove row

Retrieve first element in a column list and sum over it (e.g. if first element = k, sum over k) in python

really sorry if this has been answered already, I'm new to python and might have been searching for the wrong terminology.
I'm working with the US Baby name data as in Python for Data Analysis 2nd ed. Basically I've concated the datasets into a df called name_df looks like
id name births
1 Aaron 20304
2 Adam 10000
etc.
I'm looking to sum over the first letter of each name element if it is a K (or any other letter). I'm struggling to get the first element part though - here is what I have so far:
count = 0
letter = ['K']
for n in ['name']:
if name_df['name'][0] == letter:
count +=1
else:
count+=0
print(count)
clearly that just retrieves the first element. do i need to use some sort of splicing technique instead?
Would you like to count the distinct names starting with 'K'?
len([n for n in name_df['name'] if n[0]=='K'])
Or do you want to sum up to get the number of babies?
sum([c for n,c in name_df[['name','births']].values if n[0]=='K'])
Or with more 'pandaish' syntax:
sum(name_df.loc[name_df['name'].str[0]=='K','births'])

How to count a specific word separated by paragraphs?

So I want to be able to count the number of times a certain sequence such as "AGCT" appears in a document full of letters. However I don't just want the total amount in the document, I want how many times it shows up separated by ">".
So for example if the document contained: asdflkdafagctalkjsdjagctlkdjf>asdlfkjaagct>adjkhfhAGCTlksdjfagct>...
It would tell me:
2
1
1
since the sequence "AGCT" appears twice before the first ">" and once after the next one and once more after the third one and so on.
I do not know how to do this and any help would be appreciated.
You can use a combination of string methods and Python's llist comprehension like this:
Split your text in paragraphs, and for each paragraph count the ocurrences of the wanted substring. It is actually more concise in Python than in English:
>>> mytext = "asdflkdafagctalkjsdjagctlkdjf>asdlfkjaagct>adjkhfhAGCTlksdjfagct>"
>>> count = [para.count("agc") for para in mytext.split(">") ]
>>> count
[2, 1, 1, 0]

Resources