I have a list of sequences to be found in the sequencing data. So I run a for loop to find the match sequences in a dataset, and used Counter() to get the maximum sequences. But I found the Counter() function would add previous loop data, not as separate one.
ls = ['AGC', 'GCT', 'TAC', 'CGT']
dataset.txt like a bunch of sequences of "AGTAGCTTT", "AGTTAGC"......
def xfind(seq):
ls2 = []
with open(dataset.txt, 'r') as f:
for line in f:
if seq in line:
ls2.append(line)
import collections
from collections import Counter
cnt = Counter()
for l in ls2:
cnt[l] += 1
print (cnt.most_common()[0])
for l2 in ls:
xfind(l2)
The results look like:
('AGTAGCTTT", 2)
('AGTAGCTTT", 5)
It should be:
('AGTAGCTTT', 2)
('GCT...', 3)
I'm not sure you understand your code very well and your use of Counter isn't really how it's intended to be used I think.
You start by checking if the substring is in the sequence (line) for each line of the text file, and if it is you add it to a list ls2
Then for every element of that list (which are the whole lines/sequences from the text file) you add 1 to the counter for that key. You do this in a loop, when the whole point of Counter is that you can simply call:
cnt = Counter(ls2)
This all means that you are reporting the most common sequence in the file, which also contains the given subsequence.
Now it is actually a bit hard to say what your exact output should be, without knowing what your dataset.txt looks like.
I would start by tidying up the code a little:
from collections import Counter
subsequences = ['AGC', 'GCT', 'TAC', 'CGT']
def xfind(subseq):
contains_ss = []
with open("dataset.txt", 'r') as f:
for line in f:
if subseq in line:
contains_ss.append(line)
cnt = Counter(contains_ss)
print(cnt.most_common()[0])
for ss in subsequences:
xfind(ss)
Related
I very often face the following problem:
I have a list with unknown elements in it (but each element is of the same type, e.g.: str) and I want to count the occurrence of each element. Sometime I also want to do something with the occurrence values, so I usually store them in a dictionary.
My problem is, that I cannot "auto initialize" a dictionary with +=1, so I first I have to do a check, if the given element is still in the dictionary.
My usual go to solution:
dct = {}
for i in iterable:
if i in dct:
dct[i] += 1
else:
dct[i] = 1
Is there a simpler colution to this problem?
Yes! A defaultdict.
from collections import defaultdict
dct = defaultdict(int)
for i in iterable:
dict[i] += 1
You can auto-initialise with other types too:
Docs: https://docs.python.org/3.3/library/collections.html#collections.defaultdict
d = defaultdict(str)
d[i] += 'hello'
If you're just counting things, you could use a Counter instead:
from collections import Counter
c = Counter(iterable) # c is a subclass of dict
I've a dataset containing lists of tokens in csv format like this:
song, tokens
aaa,"['everyon', 'pict', 'becom', 'somebody', 'know']"
bbb,"['tak', 'money', 'tak', 'prid', 'tak', 'littl']"
First i want to find all the words that appears in text at least a certain amount of time, let's say 5, and this is easily done:
# converters simply reconstruct the string of tokens in a list of tokens
tokens = pd.read_csv('dataset.csv',
converters={'tokens': lambda x: x.strip("[]").replace("'", "").split(", ")})
# List of all words
allwords = [word for tokens in darklyrics['tokens'] for word in tokens]
allwords = pd.DataFrame(allwords, columns=['word'])
more5 = allwords[allwords.groupby("word")["word"].transform('size') >= 5]
more5 = set(more5['word'])
frequentwords = [token.strip() for token in more5]
frequentwords.sort()
Now i want to remove for each list of tokens those who appear inside frequentwords, to do so i'm using this code:
def remove_non_frequent(x):
global frequentwords
output = []
for token in x:
if token in frequentwords:
output.append(token)
return output
def remove_on_chunk(df):
df['tokens'] = df.apply(lambda x: remove_non_frequent(x['tokens']), axis=1)
return df
def parallelize_dataframe(df, func, n_split=10, n_cores=4):
df_split = np.array_split(df, n_split)
pool = Pool(n_cores)
df = pd.concat(pool.map(func, df_split))
pool.close()
pool.join()
return df
lyrics_reconstructed = parallelize_dataframe(lyrics, remove_on_chunk)
The non multiprocess version take around 2.30-3 hours to compute, while this versione takes 1 hour.
Surely it's a slow process because i've to perform the search of circa 130 milions tokens in a list of 30k elements, but i'm quite sure my code is not particularly good.
Is there a faster and surely better way to achieve something like this?
go for Set operations. I've saved your example data to "tt1" file, so this should work. Also, If you are generating the data somehow yourself, do yourself a favour and drop the quotes and square brackets. It would save you time in pre-process.
from collections import Counter
import re
rgx = re.compile(r"[\[\]\"' \n]") # data cleanup
# load and pre-process the data
counter = Counter()
data = []
with open('tt1', 'r') as o:
o.readline()
for line in o:
parts = line.split(',')
clean_parts = {re.sub(rgx, "", i) for i in parts[1:]}
counter.update(clean_parts)
data.append((parts[0], clean_parts))
n = 2 # <- here set threshold for number of occurences
common_words = {i[0] for i in counter.items() if i[1] > n}
# process the data
clean_data = []
for s, r in data:
clean_data.append((s, r - common_words))
It's been a while but i'll post the correct solution to the problem, thantk sto Marek because it's just a slightly modification of his code.
He uses sets which can't handle duplicates, so the obvious idea is to reuse the same code but with multisets.
I've worked with this implementation https://pypi.org/project/multiset/
from collections import Counter
import re
from multiset import Multiset
rgx = re.compile(r"[\[\]\"' \n]") # data cleanup
# load and pre-process the data
counter = Counter()
data = []
with open('tt1', 'r') as o:
o.readline()
for line in o:
parts = line.split(',')
clean_parts = [re.sub(rgx, "", i) for i in parts[1:]]
counter.update(clean_parts)
ms = Multiset()
for word in clean_parts:
ms.add(word)
data.append([parts[0], ms])
n = 2 # <- here set threshold for number of occurences
common_words = Multiset()
# I'm using intersection with the most common words since
# common_words is way smaller than uncommon_words
# Intersection returns the lowest value count between two multisets
# E.g ('sky', 10) and ('sky', 1) will produce ('sky', 1)
# I want the number of repeated words in my document so i set the
# common words counter to be very high
for item in counter.items():
if item[1] >= n:
common_words.add(item[0], 100)
# process the data
clean_data = []
for s, r in data:
clean_data.append((s, r.intersection(common_words)))
output_data = []
for s, ms in clean_data:
tokens = []
for item in ms.items():
for i in range(0, item[1]):
tokens.append(item[0])
output_data.append([s] + [tokens])
This code extracts the most frequent words and filters each document according to this list, on a 110 MB dataset performs the job in less than 2 minutes.
So first of all I have a function that count words in a text file, and a program that creates a dictionary based on how many occurences of the word is in that text file. The program is
def counter (AllWords):
d = {}
for word in AllWords:
if word in d.keys():
d[word] = d[word] + 1
else:
d[word] = 1
return d;
f = open("test.txt", "r")
AllWords = []
for word in f.read().split():
AllWords.append(word.lower())
print(counter(AllWords))
Now given that dictionary, I want to create a list of objects such that the objects will have two instance variables, the word (string) and how many time it appears (integer). Any help is appreciated!
What about:
list(d.items())
It will create a list of tuples like:
[('Foo',3),('Bar',2)]
Or you can define your own class:
class WordCount:
def __init__(self,word,count):
self.word = word
self.count = count
and use list comprehension:
[WordCount(*item) for item in d.items()]
So here you create a list of WordCount objects.
Nevertheless, your counter(..) method is actually not necessary: Python already has a Counter:
from collections import Counter
which is "a dictionary with things" so to speak: you can simply construct it like:
from collections import Counter
Counter(allWords)
No need to reinvent to wheel to count items.
What about a quasi one-liner to do all the heavy lifting, using of course collections.Counter and the mighty str.split ?
import collections
with open("text.txt") as f:
c = collections.Counter(f.read().split())
Now c contains the couples: word,number of occurences of the word
I am creating a code that allows the user to input a .txt file of their choice. So, for example, if the text read:
"I am you. You ArE I."
I would like my code to create a dictionary that resembles this:
{I: 2, am: 1, you: 2, are: 1}
Having the words in the file appear as the key, and the number of times as the value. Capitalization should be irrelevant, so are = ARE = ArE = arE = etc...
This is my code so far. Any suggestions/help?
>> file = input("\n Please select a file")
>> name = open(file, 'r')
>> dictionary = {}
>> with name:
>> for line in name:
>> (key, val) = line.split()
>> dictionary[int(key)] = val
Take a look at the examples in this answer:
Python : List of dict, if exists increment a dict value, if not append a new dict
You can use collections.Counter() to trivially do what you want, but if for some reason you can't use that, you can use a defaultdict or even a simple loop to build the dictionary you want.
Here is code that solves your problem. This will work in Python 3.1 and newer.
from collections import Counter
import string
def filter_punctuation(s):
return ''.join(ch if ch not in string.punctuation else ' ' for ch in s)
def lower_case_words(f):
for line in f:
line = filter_punctuation(line)
for word in line.split():
yield word.lower()
def count_key(tup):
"""
key function to make a count dictionary sort into descending order
by count, then case-insensitive word order when counts are the same.
tup must be a tuple in the form: (word, count)
"""
word, count = tup
return (-count, word.lower())
dictionary = {}
fname = input("\nPlease enter a file name: ")
with open(fname, "rt") as f:
dictionary = Counter(lower_case_words(f))
print(sorted(dictionary.items(), key=count_key))
From your example I could see that you wanted punctuation stripped away. Since we are going to split the string on white space, I wrote a function that filters punctuation to white space. That way, if you have a string like hello,world this will be split into the words hello and world when we split on white space.
The function lower_case_words() is a generator, and it reads an input file one line at a time and then yields up one word at a time from each line. This neatly puts our input processing into a tidy "black box" and later we can simply call Counter(lower_case_words(f)) and it does the right thing for us.
Of course you don't have to print the dictionary sorted, but I think it looks better this way. I made the sort order put the highest counts first, and where counts are equal, put the words in alphabetical order.
With your suggested input, this is the resulting output:
[('i', 2), ('you', 2), ('am', 1), ('are', 1)]
Because of the sorting it always prints in the above order.
Write a function that takes as argument, a filename to read, returns the number of even numbers present in the file.
I have tried and tried please some one help. it does not return the the even numbers.
def counteven(l):
infile = open('even.txt', 'r')
num = infile.read()
for i in infile:
if (i %2!=0):
return i
infile.close()
assertEqual(counteven('even.txt'),2)
#Ergwun pointed out already the problems in your code. Here's another solution:
def counteven(integers):
return sum(1 for n in integers if n % 2 == 0)
with open('even.txt') as f:
numbers = (int(line) for line in f)
print(counteven(numbers))
You do not say what the format of the file is. Based on your attempt, I'm assuming that your file contains just a single integer on each line.
Here are some of the problems with your function:
You are passing an argument to the function called l, but not using it. You should be using it as the name of the file to open, instead of hard coding 'even.txt'.
You are reading the entire file into a variable called num and then do not even use that variable. Having read in the entire file, there is nothing left to iterate over in your for loop.
Your for loop iterates over the lines of the file as strings. You need to convert the line to an integer before testing if it's divisible by two.
Inside the for loop, you are going to return the first even number found, rather than counting all the even numbers. You need to create a count variable before the loop, and increment in the loop every time an even number is found, then return the count after the loop has completed.
If you fix those problems, your function should look something like this:
def counteven(filename):
countOfEvenNumbers = 0
infile = open(filename, 'r')
for line in infile:
number = int(line)
if (number %2 == 0):
countOfEvenNumbers+= 1
infile.close()
return countOfEvenNumbers
...
UPDATE (to address your comment):
assertEqual is a method of the TestCase class provided by the unittest module.
If you are writing a unit test, then assertEqual should be called in a test case in a class derived from TestCase.
If you simply want to make an assertion ouside of a unit test you can write:
assert counteven('even.txt') == 2, ' Number of even numbers must be 2'