find a sequence in string - python-3.x

Hi I've got problem set in cs50 and having difficulties as this is my first week in Python and I would be appreciate if you don't directly write an open answer but forward me to the right functions or method to use.
We've been given a long string sequence in a .txt file, one line and no white spaces. I have to find the longest consecutive sequence of words of given DNA string
example txt:
GGAGGCCAAAGTCTTGTGATATCGGGCAACTCCCCGGGAGGAACACAGGCCCACCGAAAACAGCTTGAAATGGGAAACGTTCCCGATCTACGCCGGGCCAGAGG
original text is around 5000 characters but it goes like the example below. My task is to find the longest consecutive sequences of 'AGATC' string.
lets say the first consequtive sequence is 23 times, after i kept reading and find another consequtive sequences in 34 times, I have to store the biggest number.
My problem is not to find a way to read and analyse a string in this way. I can read a string can find the total repetitive times and so on but finding the longest repetition is not making sense in every way I've tried. I thought C was hard but I can write this code with C so easily as I we can manipulate strings in so much way in C. At least in C there are ways to read in a size but as far as I see Python reads at once and there is no control over read. In Python it doesn't seem you can make much with, at least in my level of knowledge at the moment :/ Probably Python got one line solutions for this, please don't judge this is my 3rd day and 4th program in Python.
What functions or methods I should look to analyze a string in this way. I've watched videos for a similiar thing but for sequence of single character, not a string. Also bought the Python Crash Course to get some knowledge about the string manipulation but couldn't find anything related in this case. Also checked the Python documentation but obviously it's so much complicated for day 3 in Python.
Could anyone help me please.TIA
here is my not-working and not-making-sense code
import csv
import sys
#check the arguments count
if len(sys.argv) != 3:
print("Usage: python dna.py data.csv sequence.txt")
sys.exit(1)
#create a dictionary to store str results
SEQ = {
"AGATC": 0,
"AATG": 0,
"TATC": 0
}
counter = 0 #keeps the the length of the sequence
seq = 0 #keeps the longest sequence
DNA = '' ## keeps the key of SEQ, "AGATC" etc.
#find the longest consecutive sequence of DNA
def findSEQ(file, DNA): #get the sequences text file and the string of the key as parameters
for DNA in (DNA, file):
if file[i:i + len(DNA)] == DNA: #if find a match
counter += 1 #count up the sequence
else:
if counter > seq: #if it's not a sequence the next thing it reads
seq = counter
counter = 0
return seq
seq = 0
#open sequence file and read
with open(sys.argv[2],'r') as file:
reader = csv.reader(file)
#find the longest sequence of AGATC
findSEQ("AGATC", file)
#update the seq dictionary
SEQ["AGATC"] = seq
#find the longest sequence of AATG
findSEQ(file, "AATG")
#update the seq dictionary
SEQ["AATG"] = seq
#find the longest sequence of TATC
findSEQ(file, "TATC")
#update the seq dictionary
SEQ["TATC"] = seq
#open and read database
with open(sys.argv[1], "r") as file:
reader = csv.reader(file)
#skip the first row
next(reader)
#compare the seq dictionary results with database
for row in reader:
seq1, seq2, seq3 = row[1], row[2], row[3]
#if found any match print the name
if SEQ[seq1] == row[1] and SEQ[seq2] == row[2] and SEQ[seq3] == row[3]:
print(row[0])
#otherwise print not found
else:
print("Not found any match.")

To elaborate on my comment, please find the following example:
import re
text = 'GGAGGCCAAGATCAAGTCTTGTGATATCGGGCAACTCCCCGGGAAGATCAGATCAGATCGGAACACAGGCCCACCGAAAACAGCTTGAAGATCAATGGGAAACGTTCCCGATCTACGCCGGGCCAGAGG'
sequence = 'AGATC'
pattern = f'(?:{sequence})+'
findings = sorted(re.findall(pattern, text), key=len)
longest_sequence = len(findings[-1]) / len(sequence)
print(f'longest sequence: {longest_sequence}')
This program uses regex (regular expressions) to find sequences of the pattern you're looking for. It then sorts the findings by length (in an ascending order), allowing you to find the longest sequences in the last index of the list.

Related

How to count strings in specified field within each line of one or more csv files

Writing a Python program (ver. 3) to count strings in a specified field within each line of one or more csv files.
Where the csv file contains:
Field1, Field2, Field3, Field4
A, B, C, D
A, E, F, G
Z, E, C, D
Z, W, C, Q
the script is executed, for example:
$ ./script.py 1,2,3,4 file.csv
And the result is:
A 10
C 7
D 2
E 2
Z 2
B 1
Q 1
F 1
G 1
W 1
ERROR
the script is executed, for example:
$ ./script.py 1,2,3,4 file.csv file.csv file.csv
Where the error occurs:
for rowitem in reader:
for pos in field:
pos = rowitem[pos] ##<---LINE generating error--->##
if pos not in fieldcnt:
fieldcnt[pos] = 1
else:
fieldcnt[pos] += 1
TypeError: list indices must be integers or slices, not str
Thank you!
Judging from the output, I'd say that the fields in the csv file does not influence the count of the string. If the string uniqueness is case-insensitive please remember to use yourstring.lower() to return the string so that different case matches are actually counted as one. Also do keep in mind that if your text is large the number of unique strings you might find could be very large as well, so some sort of sorting must be in place to make sense of it! (Or else it might be a long list of random counts with a large portion of it being just 1s)
Now, to get a count of unique strings using the collections module is an easy way to go.
file = open('yourfile.txt', encoding="utf8")
a= file.read()
#if you have some words you'd like to exclude
stopwords = set(line.strip() for line in open('stopwords.txt'))
stopwords = stopwords.union(set(['<media','omitted>','it\'s','two','said']))
# make an empty key-value dict to contain matched words and their counts
wordcount = {}
for word in a.lower().split(): #use the delimiter you want (a comma I think?)
# replace punctuation so they arent counted as part of a word
word = word.replace(".","")
word = word.replace(",","")
word = word.replace("\"","")
word = word.replace("!","")
if word not in stopwords:
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
That should do it. The wordcount dict should contain the word and it's frequency. After that just sort it using collections and print it out.
word_counter = collections.Counter(wordcount)
for word, count in word_counter.most_common(20):
print(word, ": ", count)
I hope this solves your problem. Lemme know if you face problems.

How can I get the the sum of all combined word lengths to print as a single value?

I am having trouble converting the total of all word counts. I have tried various methods, the length per word is correct, I just can't get a total?
file=open(r"sheSaid.txt","r+")
from collections import Counter
wordcount = Counter(file.read().split())
for word in file.read().split(' '):
word = word.rstrip(".""',?!")
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
#print (word.rstrip(".""..',?!"),wordcount)
for item in wordcount.items(): print("{}\t{}".format(*item))
wordCount = len(wordcount)
#can count all word lengths just fine
print ("The total word count is:", word, wordCount) # when I use Len() Or #Count I cannot get the sum of all wordCounts?
print ("The total length of all words are:","total_of_all_word_counts?")
#I need the sum to complete this
print ("The avg length is:", wordcount/total_of_all_word_counts?)
file.close();
Something like this will do, although not clear what you are looking for, but this will count the words in the text file, calculate the character count for all words and then the character average:
file = open(r"sheSaid.txt","r+")
file_contents = file.read().split() # Reads the file and split by spaces to create a list of words
file.close() # Always good idea to close the file after done with it
words_count = len(file_contents)
characters_count = len(''.join(file_contents))
average_characters = characters_count / words_count
You may want to do an extra effort and handle some other cases, like removing characters like !, . etc. That will be straightforward. This is just a starting point to build upon. You only know the corner cases and what will be in the text file.

Return number of alphabetical substrings within input string

I'm trying to generate code to return the number of substrings within an input that are in sequential alphabetical order.
i.e. Input: 'abccbaabccba'
Output: 2
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def cake(x):
for i in range(len(x)):
for j in range (len(x)+1):
s = x[i:j+1]
l = 0
if s in alphabet:
l += 1
return l
print (cake('abccbaabccba'))
So far my code will only return 1. Based on tests I've done on it, it seems it just returns a 1 if there are letters in the input. Does anyone see where I'm going wrong?
You are getting the output 1 every time because your code resets the count to l = 0 on every pass through the loop.
If you fix this, you will get the answer 96, because you are including a lot of redundant checks on empty strings ('' in alphabet returns True).
If you fix that, you will get 17, because your test string contains substrings of length 1 and 2, as well as 3+, that are also substrings of the alphabet. So, your code needs to take into account the minimum substring length you would like to consider—which I assume is 3:
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def cake(x, minLength=3):
l = 0
for i in range(len(x)):
for j in range(i+minLength, len(x)): # carefully specify both the start and end values of the loop that determines where your substring will end
s = x[i:j]
if s in alphabet:
print(repr(s))
l += 1
return l
print (cake('abccbaabccba'))

Python calling class variable in forloop

I am new to python, and I have confused with the below for loop usage. Can anyone please help me to understand the class usage in the below forloop.
import sys
def checkline():
glb.linecount += 1
w = glb.l.split()
glb.wordcount += len(w)
class glb:
linecount = 0
wordcount = 0
l = []
f = open('Untitled9.ipynb','r')
for glb.l in f.readlines(): #what glb.l exactly does?
checkline()
print(glb.linecount, glb.wordcount)
This entire program counts the lines and words in a file. specifically,
glb.l becomes each line in a file, so you could iterate and count the words in each one of them.
Let me pseudo code it for you.
Open the file `Untitled9.ipynb` for reading. //f
For each line in the file: // checkline
Store the line.// youre adding the line to glb.l, which you will later iterate on to count the words in the file.
Add one to the line count.
For each space, add one to the word count. // counting the results of the split() on glb.l
Print the line and the word count.

Python Spell Checker Linear Search

I'm learning Python and one of the labs requires me to import a list of words to serve as a dictionary, then compare that list of words to some text that is also imported. This isn't for a class, I'm just learning this on my own, or I'd ask the teacher. I've been hung up on how to covert that imported text to uppercase before making the comparision.
Here is the URL to the lab: http://programarcadegames.com/index.php?chapter=lab_spell_check
I've looked at the posts/answers below and some youtube videos and I still can't figure out how to do this. Any help would be appreciated.
Convert a Python list with strings all to lowercase or uppercase
How to convert upper case letters to lower case
Here is the code I have so far:
# Chapter 16 Lab 11
import re
# This function takes in a line of text and returns
# a list of words in the line.
def split_line(line):
return re.findall('[A-Za-z]+(?:\'[A-Za-z]+)?',line)
dfile = open("dictionary.txt")
dictfile = []
for line in dfile:
line = line.strip()
dictfile.append(line)
dfile.close()
print ("--- Linear Search ---")
afile = open("AliceInWonderLand200.txt")
for line in afile:
words = []
line = split_line(line)
words.append(line)
for word in words:
lineNumber = 0
lineNumber += 1
if word != (dictfile):
print ("Line ",(lineNumber)," possible misspelled word: ",(word))
afile.close()
Like the lb says: You use .upper():
dictfile = []
for line in dfile:
line = line.strip()
dictfile.append(line.upper()) # <- here.

Resources