How to convert negations and single words with same repetitive letter - python-3.x

I have a data frame that has a column with text data in it. I want to remove words that mean nothing and convert negations like "isn't" to "is not" from the text data. Because when I remove the punctuations "isn't" becomes "isn t" and when I will remove words having letters less than length 2 "t" will be deleted completely. So, I want to do the following 3 tasks-
1) convert negations like "isn't" to "is not"
2) remove words that mean nothing
3) remove less than length 2 letters
For eg, the df column looks similar to this-
user_id text data column
1 it's the coldest day
2 they aren't going
3 aa
4 how are you jkhf
5 v
6 ps
7 jkhf
The output should be-
user_id text data column
1 it is the coldest day
2 they are not going
3
4 how are you
5
6
7
How to implement this?

def is_repetitive(w):
"""Predicate, true for words like jj or aaaaa."""
w = str(w) # caller should have provided a single word as input
return len(w) > 1 and all((c == w[0] for c in w[1:]))
Feed all words in the corpus to that function,
to accumulate a list of repetitive words.
Then add such words to your list of stop words.

1) Use SpaCy or NLTK's lemmatization tools to convert strings (though they do other things like convert plural to singular as well - so you may end up needing to write your own code to do this).
2) Use stopwords from NLTK or spacy to remove the obvious stop words. Alternatively, feed them your own list of stop words (their default stop words are things like is, a, the).
3)Use a basic filter, if len<2 remove row

Related

Deleting a word in column based on frequencies

I have a NLP project where I would like to remove the words that appear only once in the keywords. That is to say, for each row I have a list of keywords and their frequencies.
I would like something like
if the frequency for the word in the whole column ['keywords'] ==1 then replace by "".
I cannot test word by word. So my idea was creating a list with all the words and remove the duplicates, then for each word in this list count.sum and then delete. But I have no idea how to do that.
Any ideas? Thanks!
Here's how my data looks like:
sample.head(4)
ID keywords age sex
0 1 fibre:16;quoi:1;dangers:1;combien:1;hightech:1... 62 F
1 2 restaurant:1;marrakech.shtml:1 35 M
2 3 payer:1;faq:1;taxe:1;habitation:1;macron:1;qui... 45 F
3 4 rigaud:3;laurent:3;photo:11;profile:8;photopro... 46 F
To add on to what #jpl mentioned with scikit-learn's CountVectorizer, there exists an option min_df that does exactly what you want, provided you can get your data in the right format. Here's an example:
from sklearn.feature_extraction.text import CountVectorizer
# assuming you want the token to appear in >= 2 documents
vectorizer = CountVectorizer(min_df=2)
documents = ['hello there', 'hello']
X = vectorizer.fit_transform(documents)
This gives you:
# Notice the dimensions of our array – 2 documents by 1 token
>>> X.shape
(2, 1)
# Here is a count of how many times the tokens meeting the inclusion
# criteria are observed in each document (as you see, "hello" is seen once
# in each document
>>> X.toarray()
array([[1],
[1]])
# this is the entire vocabulary our vectorizer knows – see how "there" is excluded?
>>> vectorizer.vocabulary_
{'hello': 0}
Your representation makes that difficult.
You should build a dataframe where each column is a word; then you can use easily pandas operations like the sum to do whatever you want.
However this will lead to a very sparse dataframe, which is never good.
Many libraries, e.g. scikit learn's CountVectorizer allow you to do what you want efficiently.

Openrefine: Split multi-valued cells by token/word count?

I have a large corpus of text data that I'm pre-processing for document classification with MALLET using openrefine.
Some of the cells are long (>150,000 characters) and I'm trying to split them into <1,000 word/token segments.
I'm able to split long cells into 6,000 character chunks using the "Split multi-valued cells" by field length, which roughly translates to 1,000 word/token chunks, but it splits words across rows, so I'm losing some of my data.
Is there a function I could use to split long cells by the first whitespace (" ") after every 6,000th character, or even better, split every 1,000 words?
Here is my simple solution:
Go to Edit cells -> Transform and enter
value.replace(/((\s+\S+?){999})\s+/,"$1###")
This will replace every 1000th whitespace (consecutive whitespaces are counted as one and replaced if they appear at the split border) with ### (you can choose any token you like, as long as it doesn't appear in the original text).
The go to Edit cells -> Split multi-valued cells and split using the token ### as separator.
The simplest way is probably to split your text by spaces, to insert a very rare character (or group of characters) after each group of 1000 elements, to reconcatenate, then to use "Split multivalued cells" with your weird character(s).
You can do that in GREL, but it will be much clearer by choosing "Python/Jython" as script language.
So: Edit cells -> Transform -> Python/Jython:
my_list = value.split(' ')
n = 1000
i = n
while i < len(my_list):
my_list.insert(i, '|||')
i+= (n+1)
return " ".join(my_list)
(For an explanation of this script, see here)
Here is a more compact version :
text = value.split(' ')
n = 1000
return "|||".join([' '.join(text[i:i+n]) for i in range(0,len(text),n)])
You can then split using ||| as separator.
If you prefer to split by characters instead of words, looks like you can do that in two lines with textwrap :
import textwrap
return "|||".join(textwrap.wrap(value, 6000))

TCL, extract 2 integers from string into list?

I have 2 string formatted as such:
(1234, 4567)
And I have a list
points {0 1 2 4}
I would like to extract 2 integers from the first list and replace the first two integers in the list, after that extract two more integers from the 2nd list and replace the 3rd and 4th integers in the list so at the end I will have a list of 4 integers from the two strings.
So far I have tried all kind of things but always end up with errors or brackets in the list which I do not want. I feel I am missing out on the easy way to do that.
With the first set of values, you can parse with scan or regexp; in this case, I think scan looks better:
set input "(1234, 5678)"
scan $input "(%d,%d)" a b
To update a Tcl list (formally, one in a variable), you use lset; you can give a sequence of (zero-based) indices to it to navigate into the exact place in the list where you want to update:
set workingArea "points {0 1 2 4}"
lset workingArea 1 2 $a
lset workingArea 1 3 $b
puts $workingArea
# prints: points {0 1 1234 5678}

How to count a specific word separated by paragraphs?

So I want to be able to count the number of times a certain sequence such as "AGCT" appears in a document full of letters. However I don't just want the total amount in the document, I want how many times it shows up separated by ">".
So for example if the document contained: asdflkdafagctalkjsdjagctlkdjf>asdlfkjaagct>adjkhfhAGCTlksdjfagct>...
It would tell me:
2
1
1
since the sequence "AGCT" appears twice before the first ">" and once after the next one and once more after the third one and so on.
I do not know how to do this and any help would be appreciated.
You can use a combination of string methods and Python's llist comprehension like this:
Split your text in paragraphs, and for each paragraph count the ocurrences of the wanted substring. It is actually more concise in Python than in English:
>>> mytext = "asdflkdafagctalkjsdjagctlkdjf>asdlfkjaagct>adjkhfhAGCTlksdjfagct>"
>>> count = [para.count("agc") for para in mytext.split(">") ]
>>> count
[2, 1, 1, 0]

MATLAB - How to get number of occurences of each word in a string?

Suppose we want to check how many times any word occurs in a particular text file through MATLAB , How do we do that ?
Now, since i'm checking for the word to be a SPAM word or a HAM word(doing Content filtering),i'm looking to find the probability of the word to be spam i.e n(no. of spam occurrences)/n(total occurrences) would give the probability.
Hints ?
As an example, consider a text file called text.txt containing the following text:
These two sentences, like all sentences, contain words. Some of those words are repeated; but not all.
A possible approach is as follows:
s = importdata('text.txt'); %// import text. Gives a 1x1 cell containing a string
words = regexp([lower(s{1}) '.'], '[\s\.,;:-''"?!/()]+', 'split'); %// split
%// into words. Make sure there's always at least a final punctuation sign.
%// You may want to extend the list of separators (between the brackets)
%// I have made this case insensitive using "lower"
words = words(1:end-1); %// remove last "word", which will always be empty
[uniqueWords, ~, intLabels] = unique(words); %// this is the important part:
%// get unique words and an integer label for each one
count = histc(intLabels, 1:numel(uniqueWords)); %// occurrences of each label
The result is uniqueWords and count:
uniqueWords =
'all' 'are' 'but' 'contain' 'like' 'not' 'of' 'repeated'
'sentences' 'some' 'these' 'those' 'two' 'words'
count =
2 1 1 1 1 1 1 1 2 1 1 1 1 2
Can use regular expressions to find the number of occurrence of a word..
For example:
txt = fileread( fileName );
tokens = regexp( txt, string, 'tokens' );
String is the one you are searching for..

Resources