Writing sequences into separate list or array - python-3.x

I'm trying to extracts these sequences into separate lists or arrays in Python from a file.
My data looks like:
>gene_FST
AGTGGGTAATG--TGATG...GAAATTTG
>gene_FPY
AGT-GG..ATGAAT---AAATGAAAT--G
I would like to have
seq1 = [AGTGGGTAATG--TGATG...GAAATTTG]
seq2 = [AGT-GG..ATGAAT---AAATGAAAT--G]
My plan is to later compare the contents of the list
I would appreciate any advise

So far, here's what I have done, that
f = open (r"C:\Users\Olukayode\Desktop\my_file.txt", 'r') #first r - before the normal string it converts normal string to raw string
def parse_fasta(lines):
seq = []
seq1 = []
seq2 = []
head = []
data = ''
for line in lines:
if line.startswith('>'):
if data:
seq.append(data)
data = ''
head.append(line[1:])
else:
data+= line.rstrip()
seq.append(data)
return seq
h = parse_fasta(f)
print(h)
print(h[0])
print(h[1])
gives:
['AGTGGGTAATG--TGATG...GAAATTTG', 'AGT-GG..ATGAAT---AAATGAAAT--G']
AGTGGGTAATG--TGATG...GAAATTTG
AGT-GG..ATGAAT---AAATGAAAT--G
I think I just figured it out, I can pass each string the list containing both sequences into a separate list, if possible though

If you want to get the exact results you were looking for in your original question, i.e.
seq1 = [AGTGGGTAATG--TGATG...GAAATTTG]
seq2 = [AGT-GG..ATGAAT---AAATGAAAT--G]
you can do it in a variety of ways. Instead of changing anything you already have though, you can just convert your data into a dictionary and print the dictionary items.
your code block...
h = parse_fasta(f)
sDict = {}
for i in range(len(h)):
sDict["seq"+str(i+1)] = [h[i]]
for seq, data in sDict.items():
print(seq, "=", data)

Related

How To Generate A List Of Possible Combination From A Set Dictionary

I'm a beginner coder, I have the code below
def PossibleNum(List):
DefaultSymbol = '%'
NumDict = ["0","1","2","3","4","5","6","7","8","9"]
FinishList = []
for Item in List:
for i in range(len(NumDict)):
_item = Item.replace(DefaultSymbol,NumDict[i])
FinishList.append(_item)
return FinishList
List = ["AAAA%%","BBB%%%","CC%%C%"]
print (PossibleNum(List))
I'm trying to get every possible combination from NumDict by Replacing each of "%" into every possible NumDict
Wanted Output : [AAAA00,AAAA01,AAAA02,AAAA03....,AAAA99]
Current Output : [AAAA11,AAAA22,AAAA33,AAAA,44,AAAA55,AAAA66]
You can use str.replace with count parameter set to 1. To obtain the combinations, I used str.format method.
For example:
lst = ["AAAA%%","BBB%%%","CC%%C%"]
output = []
for i in lst:
n = i.count('%')
backup = i
for v in range(10**n):
i = backup
for ch in '{:0{n}}'.format(v, n=n):
i = i.replace('%', ch, 1)
output.append(i)
# pretty print:
from pprint import pprint
pprint(output)
Prints:
['AAAA00',
'AAAA01',
'AAAA02',
'AAAA03',
...all the way to:
'CC99C5',
'CC99C6',
'CC99C7',
'CC99C8',
'CC99C9']
An option using itertools.product to get all the possible inserts:
import itertools
l = ["AAAA%%","BBB%%%","CC%%C%"]
DefaultSymbol = '%'
NumDict = ["0","1","2","3","4","5","6","7","8","9"]
out = []
for s in l:
n = s.count(DefaultSymbol)
prod = itertools.product(NumDict, repeat=n)
for p in prod:
tmp = s
for i in p:
tmp = tmp.replace(DefaultSymbol, i, 1)
out.append(tmp)
Pretty straight forward; for each input list element get the number of replacements (count of '%'), calculate all possible elements to insert using itertools.product, then iterate over all these elements (for p in prod) and do the replacements, one at a time (for i in p, with replace count set to 1).

Fastest way to filter non frequent words inside lists of words

I've a dataset containing lists of tokens in csv format like this:
song, tokens
aaa,"['everyon', 'pict', 'becom', 'somebody', 'know']"
bbb,"['tak', 'money', 'tak', 'prid', 'tak', 'littl']"
First i want to find all the words that appears in text at least a certain amount of time, let's say 5, and this is easily done:
# converters simply reconstruct the string of tokens in a list of tokens
tokens = pd.read_csv('dataset.csv',
converters={'tokens': lambda x: x.strip("[]").replace("'", "").split(", ")})
# List of all words
allwords = [word for tokens in darklyrics['tokens'] for word in tokens]
allwords = pd.DataFrame(allwords, columns=['word'])
more5 = allwords[allwords.groupby("word")["word"].transform('size') >= 5]
more5 = set(more5['word'])
frequentwords = [token.strip() for token in more5]
frequentwords.sort()
Now i want to remove for each list of tokens those who appear inside frequentwords, to do so i'm using this code:
def remove_non_frequent(x):
global frequentwords
output = []
for token in x:
if token in frequentwords:
output.append(token)
return output
def remove_on_chunk(df):
df['tokens'] = df.apply(lambda x: remove_non_frequent(x['tokens']), axis=1)
return df
def parallelize_dataframe(df, func, n_split=10, n_cores=4):
df_split = np.array_split(df, n_split)
pool = Pool(n_cores)
df = pd.concat(pool.map(func, df_split))
pool.close()
pool.join()
return df
lyrics_reconstructed = parallelize_dataframe(lyrics, remove_on_chunk)
The non multiprocess version take around 2.30-3 hours to compute, while this versione takes 1 hour.
Surely it's a slow process because i've to perform the search of circa 130 milions tokens in a list of 30k elements, but i'm quite sure my code is not particularly good.
Is there a faster and surely better way to achieve something like this?
go for Set operations. I've saved your example data to "tt1" file, so this should work. Also, If you are generating the data somehow yourself, do yourself a favour and drop the quotes and square brackets. It would save you time in pre-process.
from collections import Counter
import re
rgx = re.compile(r"[\[\]\"' \n]") # data cleanup
# load and pre-process the data
counter = Counter()
data = []
with open('tt1', 'r') as o:
o.readline()
for line in o:
parts = line.split(',')
clean_parts = {re.sub(rgx, "", i) for i in parts[1:]}
counter.update(clean_parts)
data.append((parts[0], clean_parts))
n = 2 # <- here set threshold for number of occurences
common_words = {i[0] for i in counter.items() if i[1] > n}
# process the data
clean_data = []
for s, r in data:
clean_data.append((s, r - common_words))
It's been a while but i'll post the correct solution to the problem, thantk sto Marek because it's just a slightly modification of his code.
He uses sets which can't handle duplicates, so the obvious idea is to reuse the same code but with multisets.
I've worked with this implementation https://pypi.org/project/multiset/
from collections import Counter
import re
from multiset import Multiset
rgx = re.compile(r"[\[\]\"' \n]") # data cleanup
# load and pre-process the data
counter = Counter()
data = []
with open('tt1', 'r') as o:
o.readline()
for line in o:
parts = line.split(',')
clean_parts = [re.sub(rgx, "", i) for i in parts[1:]]
counter.update(clean_parts)
ms = Multiset()
for word in clean_parts:
ms.add(word)
data.append([parts[0], ms])
n = 2 # <- here set threshold for number of occurences
common_words = Multiset()
# I'm using intersection with the most common words since
# common_words is way smaller than uncommon_words
# Intersection returns the lowest value count between two multisets
# E.g ('sky', 10) and ('sky', 1) will produce ('sky', 1)
# I want the number of repeated words in my document so i set the
# common words counter to be very high
for item in counter.items():
if item[1] >= n:
common_words.add(item[0], 100)
# process the data
clean_data = []
for s, r in data:
clean_data.append((s, r.intersection(common_words)))
output_data = []
for s, ms in clean_data:
tokens = []
for item in ms.items():
for i in range(0, item[1]):
tokens.append(item[0])
output_data.append([s] + [tokens])
This code extracts the most frequent words and filters each document according to this list, on a 110 MB dataset performs the job in less than 2 minutes.

Common values in a Python dictionary

I'm trying to write a code that will return common values from a dictionary based on a list of words.
Example:
inp = ['here','now']
dict = {'here':{1,2,3}, 'now':{2,3}, 'stop':{1, 3}}
for val in inp.intersection(D):
lst = D[val]
print(sorted(lst))
output: [2, 3]
The input inp may contain any one or all of the above words, and I want to know what values they have in common. I just cannot seem to figure out how to do that. Please, any help would be appreciated.
The easiest way to do this is to just count them all, and then make a dict of the values that are equal to the number of sets you intersected.
To accomplish the first part, we do something like this:
answer = {}
for word in inp:
for itm in word:
if itm in answer:
answer[itm] += 1
else:
answer[itm] = 1
To accomplish the second part, we just have to iterate over answer and build an array like so:
answerArr = []
for i in answer:
if (answer[i] == len(inp)):
answerArr.append(i)
i'm not certain that i understood your question perfectly but i think this is what you meant albeit in a very simple way:
inp = ['here','now']
dict = {'here':{1,2,3}, 'now':{2,3}, 'stop':{1, 3}}
output = []
for item in inp:
output.append(dict[item])
for item in output:
occurances = output.count(item)
if occurances <= 1:
output.remove(item)
print(output)
This should output the items from the dict which occurs in more than one input. If you want it to be common for all of the inputs just change the <= 1 to be the number of inputs given.

nested lists converting certain elements into ints

I've imported a file with tabular data and have made it a nested list. I would like to convert numerical string elements to inergers. how do i convert them to the ints ?
this is what i have thus far:
f = open("data.txt", "r")
prov_data = []
for line in f:
prov_data.append(line.strip().split(","))
prov = []
for prov in prov_data:
for prov in range(len(prov_data)):
prov.append(prov_data[prov])
f.close()
The list is:
l = [['MB' '1281000' '14'], ['NB' '754900' '14'] ,['NL' '528300' '7'],['NT' '43900' '1']]
basically im trying to understand how to convert those second and third elements
Just convert them after the split
def convert_to_int(foo):
try:
return int(foo.strip())
except ValueError:
return foo
...
prov_data = [ [convert_to_int(x) for x in line.strip().split(",")]
for line in f
]
...

python 3.4.2 joining strings into lists

I am a python newbie so am writing small programs to get more familiar. I have a rasp PI to, very unix skilled,done programming but not python3. One of these is a simple bubble sort, it reads in two txt files with numbers 5 9 2 19 18 17 13 and another with different numbers 10 14 2 4 6 20 type thing
I use a function to read in each file then join them before I bubblesort the whole string, I'm aware it needs to be a list so that the bubblesort function can move the numbers around during each pass. From what I can tell my issue is the mergesort (var name for the concatenated list) always is a string.
Anyone shed any light on why this is so? and how could I convert the two files into a single list?
------------------sample code-------------------
mergesort = []
def readfile1():
tempfile1 = open('sortfile1.txt','r')
tempfile1 = tempfile1.read()
return tempfile1
def readfile2():
tempfile2 = open('sortfile2.txt','r')
tempfile2 = tempfile2.read()
return tempfile2
sortstring1 = readfile1()
# print (sortstring1)
sortstring2 = readfile2()
# print (sortstring2)
# mergesort = list(set(sortstring1) | set(sortstring2)
mergesort = sortstring1 + sortstring2
print (mergesort, "Type=", type(mergesort))
Assuming you want to get one list of integers, you could do it like this. Notice I also combined your functions into one because they were doing the exact same thing.
In your code, you are not splitting the contents of the file into a list, so it was read in as a string. use the split() method to split a string into a list.
def read_file_to_list(filename):
temp = open(filename, 'r')
string = temp.read()
numbers = [int(x) for x in string.split(' ')]
return numbers
sort1 = read_file_to_list('sortfile1.txt')
sort2 = read_file_to_list('sortfile2.txt')
total = sort1 + sort2

Resources