How to count strings in specified field within each line of one or more csv files - python-3.x

Writing a Python program (ver. 3) to count strings in a specified field within each line of one or more csv files.
Where the csv file contains:
Field1, Field2, Field3, Field4
A, B, C, D
A, E, F, G
Z, E, C, D
Z, W, C, Q
the script is executed, for example:
$ ./script.py 1,2,3,4 file.csv
And the result is:
A 10
C 7
D 2
E 2
Z 2
B 1
Q 1
F 1
G 1
W 1
ERROR
the script is executed, for example:
$ ./script.py 1,2,3,4 file.csv file.csv file.csv
Where the error occurs:
for rowitem in reader:
for pos in field:
pos = rowitem[pos] ##<---LINE generating error--->##
if pos not in fieldcnt:
fieldcnt[pos] = 1
else:
fieldcnt[pos] += 1
TypeError: list indices must be integers or slices, not str
Thank you!

Judging from the output, I'd say that the fields in the csv file does not influence the count of the string. If the string uniqueness is case-insensitive please remember to use yourstring.lower() to return the string so that different case matches are actually counted as one. Also do keep in mind that if your text is large the number of unique strings you might find could be very large as well, so some sort of sorting must be in place to make sense of it! (Or else it might be a long list of random counts with a large portion of it being just 1s)
Now, to get a count of unique strings using the collections module is an easy way to go.
file = open('yourfile.txt', encoding="utf8")
a= file.read()
#if you have some words you'd like to exclude
stopwords = set(line.strip() for line in open('stopwords.txt'))
stopwords = stopwords.union(set(['<media','omitted>','it\'s','two','said']))
# make an empty key-value dict to contain matched words and their counts
wordcount = {}
for word in a.lower().split(): #use the delimiter you want (a comma I think?)
# replace punctuation so they arent counted as part of a word
word = word.replace(".","")
word = word.replace(",","")
word = word.replace("\"","")
word = word.replace("!","")
if word not in stopwords:
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
That should do it. The wordcount dict should contain the word and it's frequency. After that just sort it using collections and print it out.
word_counter = collections.Counter(wordcount)
for word, count in word_counter.most_common(20):
print(word, ": ", count)
I hope this solves your problem. Lemme know if you face problems.

Related

find a sequence in string

Hi I've got problem set in cs50 and having difficulties as this is my first week in Python and I would be appreciate if you don't directly write an open answer but forward me to the right functions or method to use.
We've been given a long string sequence in a .txt file, one line and no white spaces. I have to find the longest consecutive sequence of words of given DNA string
example txt:
GGAGGCCAAAGTCTTGTGATATCGGGCAACTCCCCGGGAGGAACACAGGCCCACCGAAAACAGCTTGAAATGGGAAACGTTCCCGATCTACGCCGGGCCAGAGG
original text is around 5000 characters but it goes like the example below. My task is to find the longest consecutive sequences of 'AGATC' string.
lets say the first consequtive sequence is 23 times, after i kept reading and find another consequtive sequences in 34 times, I have to store the biggest number.
My problem is not to find a way to read and analyse a string in this way. I can read a string can find the total repetitive times and so on but finding the longest repetition is not making sense in every way I've tried. I thought C was hard but I can write this code with C so easily as I we can manipulate strings in so much way in C. At least in C there are ways to read in a size but as far as I see Python reads at once and there is no control over read. In Python it doesn't seem you can make much with, at least in my level of knowledge at the moment :/ Probably Python got one line solutions for this, please don't judge this is my 3rd day and 4th program in Python.
What functions or methods I should look to analyze a string in this way. I've watched videos for a similiar thing but for sequence of single character, not a string. Also bought the Python Crash Course to get some knowledge about the string manipulation but couldn't find anything related in this case. Also checked the Python documentation but obviously it's so much complicated for day 3 in Python.
Could anyone help me please.TIA
here is my not-working and not-making-sense code
import csv
import sys
#check the arguments count
if len(sys.argv) != 3:
print("Usage: python dna.py data.csv sequence.txt")
sys.exit(1)
#create a dictionary to store str results
SEQ = {
"AGATC": 0,
"AATG": 0,
"TATC": 0
}
counter = 0 #keeps the the length of the sequence
seq = 0 #keeps the longest sequence
DNA = '' ## keeps the key of SEQ, "AGATC" etc.
#find the longest consecutive sequence of DNA
def findSEQ(file, DNA): #get the sequences text file and the string of the key as parameters
for DNA in (DNA, file):
if file[i:i + len(DNA)] == DNA: #if find a match
counter += 1 #count up the sequence
else:
if counter > seq: #if it's not a sequence the next thing it reads
seq = counter
counter = 0
return seq
seq = 0
#open sequence file and read
with open(sys.argv[2],'r') as file:
reader = csv.reader(file)
#find the longest sequence of AGATC
findSEQ("AGATC", file)
#update the seq dictionary
SEQ["AGATC"] = seq
#find the longest sequence of AATG
findSEQ(file, "AATG")
#update the seq dictionary
SEQ["AATG"] = seq
#find the longest sequence of TATC
findSEQ(file, "TATC")
#update the seq dictionary
SEQ["TATC"] = seq
#open and read database
with open(sys.argv[1], "r") as file:
reader = csv.reader(file)
#skip the first row
next(reader)
#compare the seq dictionary results with database
for row in reader:
seq1, seq2, seq3 = row[1], row[2], row[3]
#if found any match print the name
if SEQ[seq1] == row[1] and SEQ[seq2] == row[2] and SEQ[seq3] == row[3]:
print(row[0])
#otherwise print not found
else:
print("Not found any match.")
To elaborate on my comment, please find the following example:
import re
text = 'GGAGGCCAAGATCAAGTCTTGTGATATCGGGCAACTCCCCGGGAAGATCAGATCAGATCGGAACACAGGCCCACCGAAAACAGCTTGAAGATCAATGGGAAACGTTCCCGATCTACGCCGGGCCAGAGG'
sequence = 'AGATC'
pattern = f'(?:{sequence})+'
findings = sorted(re.findall(pattern, text), key=len)
longest_sequence = len(findings[-1]) / len(sequence)
print(f'longest sequence: {longest_sequence}')
This program uses regex (regular expressions) to find sequences of the pattern you're looking for. It then sorts the findings by length (in an ascending order), allowing you to find the longest sequences in the last index of the list.

looping through two files and count string occurrences of 2nd file in lines of first file using python

Thank you for your help and Apologies for my naiveness
I need to generate permutation of some words (A T G C ) actually nucleotides for di-composition (eg AA AT AG AC), tri-composition (AAA AAT AAC AAG), tetra, penta etc (one at a time) and then check in the other file that contains sequences with some values the count of occurrences of each permutation. I generated the permutation list.
Now I need to loop through the sequences only (splitting the sequences from values) for counting each of the permutation generated above and get the output in new file.
But I'm getting the answer for only one sequence and not for the other sequences.
Logic of the programme i tried to follow is :
Generate the permutations of ATCG in a file1 (e.g. AT AG AC AA ...)
Read the generated file1 and sequence#value file (DNA_seq_val.txt)
Read the sequences and separate the sequences form values
Loop through the sequences for the permutations and print their
occurrence with values (each separated with comma) in results file.
Input test file= DNA_seq_val.txt
AAAATTTT#99 \n
CCCCGGGG#77\n
ATATATCGCGCG#88\n
*Output I got is --
2,0,0,1,0,0,0,0,0,0,0,0,0,0,0,2,99 AAAATTTT \n
77 CCCCGGGG\n
88 ATATATCGCGCG
Output Needed is
2,0,0,1,0,0,0,0,0,0,0,0,0,0,0,2,99 AAAATTTT \n
x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,77 CCCCGGGGx \n
x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,88 ATATATCGCGCG (where x= corresponding counts as in first line)
enter code herefrom itertools import product
import os
f2 = open('TRYYY', 'a')
#********Generate the permutations start********
per = product('ACGT', repeat=2) # ATGC =nucleotides; 2= for di ntd(replace 2 with 3 fir tri ntds and so on)
f = open('myfile', 'w')
p = ""
for p in per:
p = "".join(p)
f.write(p + "\n")
f.close()
#********Generate the permutations ENDS********
with open('DNA_seq_val.txt', 'r+') as SEQ, open('myfile', 'r+') as TET: #open two files
SEQ_lines = sum(1 for line in open('DNA_seq_val.txt')) #count lines in sequences file
#print (SEQ_lines)
compo_lines = sum(1 for line in open('myfile')) #count lines in composition
#print (compo_lines)
for lines in SEQ:
line,val1 = lines.split("#")
val2 = val1.rstrip('\n')
val = str(val2)
line = line.rstrip('\n')
length =len(line)
#print (line)
#print (val)
LIN = line, val
#print (LIN)
newstr = "".join((line))
print (newstr)
#while True: # infinte loop
for PER in TET:
#print (line)
PER = PER.rstrip('\n')
length2 =len(PER)
#print (length2)
#print (line)
# print (PER)
C_PER = str(line.count(PER))
# print (C_PER)
for R in C_PER:
R1 = "".join(R)
f2.write(R1+ ",")
f2.write(val,)
f2.write('\t')
f2.write(line)
f2.write('\n')
#exit()

Return number of alphabetical substrings within input string

I'm trying to generate code to return the number of substrings within an input that are in sequential alphabetical order.
i.e. Input: 'abccbaabccba'
Output: 2
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def cake(x):
for i in range(len(x)):
for j in range (len(x)+1):
s = x[i:j+1]
l = 0
if s in alphabet:
l += 1
return l
print (cake('abccbaabccba'))
So far my code will only return 1. Based on tests I've done on it, it seems it just returns a 1 if there are letters in the input. Does anyone see where I'm going wrong?
You are getting the output 1 every time because your code resets the count to l = 0 on every pass through the loop.
If you fix this, you will get the answer 96, because you are including a lot of redundant checks on empty strings ('' in alphabet returns True).
If you fix that, you will get 17, because your test string contains substrings of length 1 and 2, as well as 3+, that are also substrings of the alphabet. So, your code needs to take into account the minimum substring length you would like to consider—which I assume is 3:
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def cake(x, minLength=3):
l = 0
for i in range(len(x)):
for j in range(i+minLength, len(x)): # carefully specify both the start and end values of the loop that determines where your substring will end
s = x[i:j]
if s in alphabet:
print(repr(s))
l += 1
return l
print (cake('abccbaabccba'))

Change Letters in A String One at a Time (Pandas,Python3)

I have a list of words in Pandas (DF)
Words
Shirt
Blouse
Sweater
What I'm trying to do is swap out certain letters in those words with letters from my dictionary one letter at a time.
so for example:
mydict = {"e":"q,w",
"a":"z"}
would create a new list that first replaces all the "e" in a list one at a time, and then iterates through again replacing all the "a" one at a time:
Words
Shirt
Blouse
Sweater
Blousq
Blousw
Swqater
Swwater
Sweatqr
Sweatwr
Swezter
I've been looking around at solutions here: Mass string replace in python?
and have tried the following code but it changes all instances "e" instead of doing so one at a time -- any help?:
mydict = {"e":"q,w"}
s = DF
for k, v in mydict.items():
for j in v:
s['Words'] = s["Words"].str.replace(k, j)
DF["Words"] = s
this doesn't seem to work either:
s = DF.replace({"Words": {"e": "q","w"}})
This answer is very similar to Brian's answer, but a little bit sanitized and the output has no duplicates:
words = ["Words", "Shirt", "Blouse", "Sweater"]
md = {"e": "q,w", "a": "z"}
md = {k: v.split(',') for k, v in md.items()}
newwords = []
for word in words:
newwords.append(word)
for c in md:
occ = word.count(c)
pos = 0
for _ in range(occ):
pos = word.find(c, pos)
for r in md[c]:
tmp = word[:pos] + r + word[pos+1:]
newwords.append(tmp)
pos += 1
Content of newwords:
['Words', 'Shirt', 'Blouse', 'Blousq', 'Blousw', 'Sweater', 'Swqater', 'Swwater', 'Sweatqr', 'Sweatwr', 'Swezter']
Prettyprint:
Words
Shirt
Blouse
Blousq
Blousw
Sweater
Swqater
Swwater
Sweatqr
Sweatwr
Swezter
Any errors are a result of the current time. ;)
Update (explanation)
tl;dr
The main idea is to find the occurences of the character in the word one after another. For each occurence we are then replacing it with the replacing-char (again one after another). The replaced word get's added to the output-list.
I will try to explain everything step by step:
words = ["Words", "Shirt", "Blouse", "Sweater"]
md = {"e": "q,w", "a": "z"}
Well. Your basic input. :)
md = {k: v.split(',') for k, v in md.items()}
A simpler way to deal with replacing-dictionary. md now looks like {"e": ["q", "w"], "a": ["z"]}. Now we don't have to handle "q,w" and "z" differently but the step for replacing is just the same and ignores the fact, that "a" only got one replace-char.
newwords = []
The new list to store the output in.
for word in words:
newwords.append(word)
We have to do those actions for each word (I assume, the reason is clear). We also append the world directly to our just created output-list (newwords).
for c in md:
c as short for character. So for each character we want to replace (all keys of md), we do the following stuff.
occ = word.count(c)
occ for occurrences (yeah. count would fit as well :P). word.count(c) returns the number of occurences of the character/string c in word. So "Sweater".count("o") => 0 and "Sweater".count("e") => 2.
We use this here to know, how often we have to take a look at word to get all those occurences of c.
pos = 0
Our startposition to look for c in word. Comes into use in the next loop.
for _ in range(occ):
For each occurence. As a continual number has no value for us here, we "discard" it by naming it _. At this point where c is in word. Yet.
pos = word.find(c, pos)
Oh. Look. We found c. :) word.find(c, pos) returns the index of the first occurence of c in word, starting at pos. At the beginning, this means from the start of the string => the first occurence of c. But with this call we already update pos. This plus the last line (pos += 1) moves our search-window for the next round to start just behind the previous occurence of c.
for r in md[c]:
Now you see, why we updated mc previously: we can easily iterate over it now (a md[c].split(',') on the old md would do the job as well). So we are doing the replacement now for each of the replacement-characters.
tmp = word[:pos] + r + word[pos+1:]
The actual replacement. We store it in tmp (for debug-reasons). word[:pos] gives us word up to the (current) occurence of c (exclusive c). r is the replacement. word[pos+1:] adds the remaining word (again without c).
newwords.append(tmp)
Our so created new word tmp now goes into our output-list (newwords).
pos += 1
The already mentioned adjustment of pos to "jump over c".
Additional question from OP: Is there an easy way to dictate how many letters in the string I want to replace [(meaning e.g. multiple at a time)]?
Surely. But I have currently only a vague idea on how to achieve this. I am going to look at it, when I got my sleep. ;)
words = ["Words", "Shirt", "Blouse", "Sweater", "multipleeee"]
md = {"e": "q,w", "a": "z"}
md = {k: v.split(',') for k, v in md.items()}
num = 2 # this is the number of replaces at a time.
newwords = []
for word in words:
newwords.append(word)
for char in md:
for r in md[char]:
pos = multiples = 0
current_word = word
while current_word.find(char, pos) != -1:
pos = current_word.find(char, pos)
current_word = current_word[:pos] + r + current_word[pos+1:]
pos += 1
multiples += 1
if multiples == num:
newwords.append(current_word)
multiples = 0
current_word = word
Content of newwords:
['Words', 'Shirt', 'Blouse', 'Sweater', 'Swqatqr', 'Swwatwr', 'multipleeee', 'multiplqqee', 'multipleeqq', 'multiplwwee', 'multipleeww']
Prettyprint:
Words
Shirt
Blouse
Sweater
Swqatqr
Swwatwr
multipleeee
multiplqqee
multipleeqq
multiplwwee
multipleeww
I added multipleeee to demonstrate, how the replacement works: For num = 2 it means the first two occurences are replaced, after them, the next two. So there is no intersection of the replaced parts. If you would want to have something like ['multiplqqee', 'multipleqqe', 'multipleeqq'], you would have to store the position of the "first" occurence of char. You can then restore pos to that position in the if multiples == num:-block.
If you got further questions, feel free to ask. :)
Because you need to replace letters one at a time, this doesn't sound like a good problem to solve with pandas, since pandas is about doing everything at once (vectorized operations). I would dump out your DataFrame into a plain old list and use list operations:
words = DF.to_dict()["Words"].values()
for find, replace in reversed(sorted(mydict.items())):
for word in words:
occurences = word.count(find)
if not occurences:
print word
continue
start_index = 0
for i in range(occurences):
for replace_char in replace.split(","):
modified_word = list(word)
index = modified_word.index(find, start_index)
modified_word[index] = replace_char
modified_word = "".join(modified_word)
print modified_word
start_index = index + 1
Which gives:
Words
Shirt
Blousq
Blousw
Swqater
Swwater
Sweatqr
Sweatwr
Words
Shirt
Blouse
Swezter
Instead of printing the words, you can append them to a list and re-create a DataFrame if that's what you want to end up with.
If you are looping, you need to update s at each cycle of the loop. You also need to loop over v.
mydict = {"e":"q,w"}
s=deduped
for k, v in mydict.items():
for j in v:
s = s.replace(k, j)
Then reassign it to your dataframe:
df["Words"] = s
If you can write this as a function that takes in a 1d array (list, numpy array etc...), you can use df.apply to apply it to any column, using df.apply().

How do I read a delimited file with strings/numbers with Octave?

I am trying to read a text file containing digits and strings using Octave. The file format is something like this:
A B C
a 10 100
b 20 200
c 30 300
d 40 400
e 50 500
but the delimiter can be space, tab, comma or semicolon. The textread function works fine if the delimiter is space/tab:
[A,B,C] = textread ('test.dat','%s %d %d','headerlines',1)
However it does not work if delimiter is comma/semicolon. I tried to use dklmread:
dlmread ('test.dat',';',1,0)
but it does not work because the first column is a string.
Basically, with textread I can't specify the delimiter and with dlmread I can't specify the format of the first column. Not with the versions of these functions in Octave, at least. Has anybody ever had this problem before?
textread allows you to specify the delimiter-- it honors the property arguments of strread. The following code worked for me:
[A,B,C] = textread( 'test.dat', '%s %d %d' ,'delimiter' , ',' ,1 )
I couldn't find an easy way to do this in Octave currently. You could use fopen() to loop through the file and manually extract the data. I wrote a function that would do this on arbitrary data:
function varargout = coltextread(fname, delim)
% Initialize the variable output argument
varargout = cell(nargout, 1);
% Initialize elements of the cell array to nested cell arrays
% This syntax is due to {:} producing a comma-separated
[varargout{:}] = deal(cell());
fid = fopen(fname, 'r');
while true
% Get the current line
ln = fgetl(fid);
% Stop if EOF
if ln == -1
break;
endif
% Split the line string into components and parse numbers
elems = strsplit(ln, delim);
nums = str2double(elems);
nans = isnan(nums);
% Special case of all strings (header line)
if all(nans)
continue;
endif
% Find the indices of the NaNs
% (i.e. the indices of the strings in the original data)
idxnans = find(nans);
% Assign each corresponding element in the current line
% into the corresponding cell array of varargout
for i = 1:nargout
% Detect if the current index is a string or a num
if any(ismember(idxnans, i))
varargout{i}{end+1} = elems{i};
else
varargout{i}{end+1} = nums(i);
endif
endfor
endwhile
endfunction
It accepts two arguments: the file name, and the delimiter. The function is governed by the number of return variables that are specified, so, for example, [A B C] = coltextread('data.txt', ';'); will try to parse three different data elements from each row in the file, while A = coltextread('data.txt', ';'); will only parse the first elements. If no return variable is given, then the function won't return anything.
The function ignores rows that have all-strings (e.g. the 'A B C' header). Just remove the if all(nans)... section if you want everything.
By default, the 'columns' are returned as cell arrays, although the numbers within those arrays are actually converted numbers, not strings. If you know that a cell array contains only numbers, then you can easily convert it to a column vector with: cell2mat(A)'.

Resources