How to count a specific word separated by paragraphs? - python-3.x

So I want to be able to count the number of times a certain sequence such as "AGCT" appears in a document full of letters. However I don't just want the total amount in the document, I want how many times it shows up separated by ">".
So for example if the document contained: asdflkdafagctalkjsdjagctlkdjf>asdlfkjaagct>adjkhfhAGCTlksdjfagct>...
It would tell me:
2
1
1
since the sequence "AGCT" appears twice before the first ">" and once after the next one and once more after the third one and so on.
I do not know how to do this and any help would be appreciated.

You can use a combination of string methods and Python's llist comprehension like this:
Split your text in paragraphs, and for each paragraph count the ocurrences of the wanted substring. It is actually more concise in Python than in English:
>>> mytext = "asdflkdafagctalkjsdjagctlkdjf>asdlfkjaagct>adjkhfhAGCTlksdjfagct>"
>>> count = [para.count("agc") for para in mytext.split(">") ]
>>> count
[2, 1, 1, 0]

Related

Merge 2 lists in Python

List 1 = ['_','_','_','a','_']
List 2 = ['d','_','_','_','_']
I'd like to merge two lists of identical length where:
Alphabets in List 2 must replace special characters in List 1 but
Special characters in List 2 must not replace alphabets in List 1.
The merged list would look like this:
Merged List = ['d','_','_','a','_']
Any tips on the fastest way to accomplish this would be much appreciated!
You can use enumerate and list comprehension:
mergedlist = [c if c.isalnum() else list2[i] for i, c in enumerate(list1)]
NB/ I have used isalnum for making the distinction with "special" characters. Depending on your definition of what makes a character special, you may need a different function to do that.

Why is count() method giving strange answer?

s = "ANANAS"
print(s.count("ANA"))
print(s.count("AN"))
print(s.count("A"))
"ANA" occurs two times in "ANANAS" but python prints 1 whereas
"AN" occurs two times and python prints 2. "A" occurs three times and python prints 3 as output. Why is this strange behaviour?
Straight from the documentation:
str.count(sub[, start[, end]])
Return the number of non-overlapping
occurrences of substring sub in the range [start, end]. Optional
arguments start and end are interpreted as in slice notation.
The two occurences of "ANA" in "ANANAS" are overlapping, hence s.count("ANA") only returns 1.
This is because in your sub string ANA will be only counted twice if it's something like "testANAANAAN " I.e two full occurrences of ANA .
As, in your case if it already checked first full substring it will not use that string part again from full string and will look for matching substring in rest of string.

How to convert negations and single words with same repetitive letter

I have a data frame that has a column with text data in it. I want to remove words that mean nothing and convert negations like "isn't" to "is not" from the text data. Because when I remove the punctuations "isn't" becomes "isn t" and when I will remove words having letters less than length 2 "t" will be deleted completely. So, I want to do the following 3 tasks-
1) convert negations like "isn't" to "is not"
2) remove words that mean nothing
3) remove less than length 2 letters
For eg, the df column looks similar to this-
user_id text data column
1 it's the coldest day
2 they aren't going
3 aa
4 how are you jkhf
5 v
6 ps
7 jkhf
The output should be-
user_id text data column
1 it is the coldest day
2 they are not going
3
4 how are you
5
6
7
How to implement this?
def is_repetitive(w):
"""Predicate, true for words like jj or aaaaa."""
w = str(w) # caller should have provided a single word as input
return len(w) > 1 and all((c == w[0] for c in w[1:]))
Feed all words in the corpus to that function,
to accumulate a list of repetitive words.
Then add such words to your list of stop words.
1) Use SpaCy or NLTK's lemmatization tools to convert strings (though they do other things like convert plural to singular as well - so you may end up needing to write your own code to do this).
2) Use stopwords from NLTK or spacy to remove the obvious stop words. Alternatively, feed them your own list of stop words (their default stop words are things like is, a, the).
3)Use a basic filter, if len<2 remove row

Retrieve first element in a column list and sum over it (e.g. if first element = k, sum over k) in python

really sorry if this has been answered already, I'm new to python and might have been searching for the wrong terminology.
I'm working with the US Baby name data as in Python for Data Analysis 2nd ed. Basically I've concated the datasets into a df called name_df looks like
id name births
1 Aaron 20304
2 Adam 10000
etc.
I'm looking to sum over the first letter of each name element if it is a K (or any other letter). I'm struggling to get the first element part though - here is what I have so far:
count = 0
letter = ['K']
for n in ['name']:
if name_df['name'][0] == letter:
count +=1
else:
count+=0
print(count)
clearly that just retrieves the first element. do i need to use some sort of splicing technique instead?
Would you like to count the distinct names starting with 'K'?
len([n for n in name_df['name'] if n[0]=='K'])
Or do you want to sum up to get the number of babies?
sum([c for n,c in name_df[['name','births']].values if n[0]=='K'])
Or with more 'pandaish' syntax:
sum(name_df.loc[name_df['name'].str[0]=='K','births'])

MATLAB - How to get number of occurences of each word in a string?

Suppose we want to check how many times any word occurs in a particular text file through MATLAB , How do we do that ?
Now, since i'm checking for the word to be a SPAM word or a HAM word(doing Content filtering),i'm looking to find the probability of the word to be spam i.e n(no. of spam occurrences)/n(total occurrences) would give the probability.
Hints ?
As an example, consider a text file called text.txt containing the following text:
These two sentences, like all sentences, contain words. Some of those words are repeated; but not all.
A possible approach is as follows:
s = importdata('text.txt'); %// import text. Gives a 1x1 cell containing a string
words = regexp([lower(s{1}) '.'], '[\s\.,;:-''"?!/()]+', 'split'); %// split
%// into words. Make sure there's always at least a final punctuation sign.
%// You may want to extend the list of separators (between the brackets)
%// I have made this case insensitive using "lower"
words = words(1:end-1); %// remove last "word", which will always be empty
[uniqueWords, ~, intLabels] = unique(words); %// this is the important part:
%// get unique words and an integer label for each one
count = histc(intLabels, 1:numel(uniqueWords)); %// occurrences of each label
The result is uniqueWords and count:
uniqueWords =
'all' 'are' 'but' 'contain' 'like' 'not' 'of' 'repeated'
'sentences' 'some' 'these' 'those' 'two' 'words'
count =
2 1 1 1 1 1 1 1 2 1 1 1 1 2
Can use regular expressions to find the number of occurrence of a word..
For example:
txt = fileread( fileName );
tokens = regexp( txt, string, 'tokens' );
String is the one you are searching for..

Resources