Here I want to extract 011700 (these are 6 digit codes) which I want to extract without the semi-colon and later I will use a dict for a value against it.
How do I extract only 011700 (or 6 digit number from that line)?
And how to print it as a 6 digit number - instead of printing it like ['011700']?
Thanks.
import re
line = "N 011700; 3;20:34:00:02:ac:07:e9:d5;2f:f7:00:02:ac:07:e9:d5; 3333"
line_list = line.split()
print(line_list)
result = (re.findall('\\d+', line))
print(result)
Here's how I would go about modifying your current code.
First, I would specify that you are trying to split the string by semicolons, by changing your split line to:
line_list = line.split(";")
Then I would trim off any whitespace, which you could do with a second line like:
line_list = [l.strip() for l in line_list]
(or by combining them like)
line_list = [l.strip() for l in line.split(";")]
Then I would simply loop through the list like so:
for l in line_list:
if len(l) == 6:
result = l
break
And if you want the result to be the actual number and not just a string of the number, change the line to:
result = int(l)
Altogether that would look like this:
line = "N 011700; 3;20:34:00:02:ac:07:e9:d5;2f:f7:00:02:ac:07:e9:d5; 3333"
line_list = [l.strip() for l in line.split(";")]
for l in line_list:
if len(l) == 6:
result = int(l)
break
print(line_list)
print(result)
Result now contains the string of the first six-digit number found in your original string.
Related
I'm working on a project for creating some word list. I have a word and some rules, for example, this char % is for digit, while this one ^ for special character, for example January%%^ should create things like:
January00!
January01!
January02!
January03!
January04!
January05!
January06!
etc.
For now I'm trying to do it with only digit and create a recursive function, because people can add as many digits and special characters as they want
January^%%%^% (for example)
This is the first function I have created:
month = "January"
nbDigit = "%%%"
def addNumber(month : list, position: int):
for i in range(position, len(month)):
for j in range(0,10):
month[position] = j
if(position == len(month)-1):
print (''.join(str(v) for v in month))
if position < len(month):
if month[position+1] == "%":
addNumber(month, position+1)
The problem is for each % that I have there is another output (three %, three times as output January000-January999/January000-January999/January000-January999).
When I tried to add the new function special character it's even worse, because I can't manage the output since every word can't end with a special character or digit. (AddSpecialChar is also a recursive function).
I believe what you are looking for is the following:
month = 'January'
nbDigit = "%%"
def addNumbers(root: str, mask: str)-> list:
# create a list of words using root followed By digits
rslt = []
mxNmb = 0
for i in range(len(mask)):
mxNmb += 9 * 10**i
mxNmb += 1
for i in range(mxNmb):
word = f"{root}{((str(i).rjust(len(mask), '0')))}"
rslt.append(word)
return rslt
this will produce:
['January00',
'January01',
'January02',
'January03',
'January04',
'January05',
'January06',
'January07',
'January08',
'January09',
'January10',
'January11',
'January12',
'January13',
'January14',
'January15',
'January16',
'January17',
'January18',
'January19',
'January20',
'January21',
'January22',
'January23',
'January24',
'January25',
'January26',
'January27',
'January28',
'January29',
'January30',
'January31',
'January32',
'January33',
'January34',
'January35',
'January36',
'January37',
'January38',
'January39',
'January40',
'January41',
'January42',
'January43',
'January44',
'January45',
'January46',
'January47',
'January48',
'January49',
'January50',
'January51',
'January52',
'January53',
'January54',
'January55',
'January56',
'January57',
'January58',
'January59',
'January60',
'January61',
'January62',
'January63',
'January64',
'January65',
'January66',
'January67',
'January68',
'January69',
'January70',
'January71',
'January72',
'January73',
'January74',
'January75',
'January76',
'January77',
'January78',
'January79',
'January80',
'January81',
'January82',
'January83',
'January84',
'January85',
'January86',
'January87',
'January88',
'January89',
'January90',
'January91',
'January92',
'January93',
'January94',
'January95',
'January96',
'January97',
'January98',
'January99']
Adding another position to the nbDigit variable will produce the numeric sequence from 000 to 999
I am currently having some issues trying to append strings into a new list. However, when I get to the end, my list looks like this:
['MDAALLLNVEGVKKTILHGGTGELPNFITGSRVIFHFRTMKCDEERTVIDDSRQVGQPMH\nIIIGNMFKLEVWEILLTSMRVHEVAEFWCDTIHTGVYPILSRSLRQMAQGKDPTEWHVHT\nCGLANMFAYHTLGYEDLDELQKEPQPLVFVIELLQVDAPSDYQRETWNLSNHEKMKAVPV\nLHGEGNRLFKLGRYEEASSKYQEAIICLRNLQTKEKPWEVQWLKLEKMINTLILNYCQCL\nLKKEEYYEVLEHTSDILRHHPGIVKAYYVRARAHAEVWNEAEAKADLQKVLELEPSMQKA\nVRRELRLLENRMAEKQEEERLRCRNMLSQGATQPPAEPPTEPPAQSSTEPPAEPPTAPSA\nELSAGPPAEPATEPPPSPGHSLQH\n']
I'd like to remove the newlines somehow. I looked at other questions on here and most suggest to use .rstrip however in adding that to my code, I get the same output. What am I missing here? Apologies if this question has been asked.
My input also looks like this(took the first 3 lines):
sp|Q9NZN9|AIPL1_HUMAN Aryl-hydrocarbon-interacting protein-like 1 OS=Homo sapiens OX=9606 GN=AIPL1 PE=1 SV=2
MDAALLLNVEGVKKTILHGGTGELPNFITGSRVIFHFRTMKCDEERTVIDDSRQVGQPMH
IIIGNMFKLEVWEILLTSMRVHEVAEFWCDTIHTGVYPILSRSLRQMAQGKDPTEWHVHT
from sys import argv
protein = argv[1] #fasta file
sequence = '' #string linker
get_line = False #False = not the sequence
Uniprot_ID = []
sequence_list =[]
with open(protein) as pn:
for line in pn:
line.rstrip("\n")
if line.startswith(">") and get_line == False:
sp, u_id, name = line.strip().split('|')
Uniprot_ID.append(u_id)
get_line = True
continue
if line.startswith(">") and get_line == True:
sequence.rstrip('\n')
sequence_list.append(sequence) #add the amino acids onto the list
sequence = '' #resets the str
if line != ">" and get_line == True: #if the first line is not a fasta ID and is it a sequence?
sequence += line
print(sequence_list)
Per documentation, rstrip removes trailing characters – the ones at the end. You probably misunderstood others' use of it to remove \ns because typically those would only appear at the end.
To replace a character with something else in an entire string, use replace instead.
These commands do not modify your string! They return a new string, so if you want to change something 'in' a current string variable, assign the result back to the original variable:
>>> line = 'ab\ncd\n'
>>> line.rstrip('\n')
'ab\ncd' # note: this is the immediate result, which is not assigned back to line
>>> line = line.replace('\n', '')
>>> line
'abcd'
When I asked this question I didn't take my time in looking at documentation & understanding my code. After looking, I realized two things:
my code isn't actually getting what I am interested in.
For the specific question I asked, I could have simply used line.split() to remove the '\n'.
sequence = '' #string linker
get_line = False #False = not the sequence
uni_seq = {}
"""this block of code takes a uniprot FASTA file and creates a
dictionary with the key as the uniprot id and the value as a sequence"""
with open (protein) as pn:
for line in pn:
if line.startswith(">"):
if get_line == False:
sp, u_id, name = line.strip().split('|')
Uniprot_ID.append(u_id)
get_line = True
else:
uni_seq[u_id] = sequence
sequence_list.append(sequence)
sp, u_id, name = line.strip().split('|')
Uniprot_ID.append(u_id)
sequence = ''
else:
if get_line == True:
sequence += line.strip() # removes the newline space
uni_seq[u_id] = sequence
sequence_list.append(sequence)
stackoverfollowers!
i have a task that i can't combat till the end
To write a function words(a, b, txt)
txt = ['All in the golden afternoon\nFull leisurely we glide;\nFor
both our oars, with little skill,\nBy little arms are plied,\nWhile
little hands make vain pretence\nOur wanderings to guide.']
a = 6
b = 8
A function should return all the words with length from 6 to 8 letters
of each line. If a line don't have words like this, it returns empty
string. If a line have more than one word they should have an order
like they have in a line
Function words(a,b,txt) should return
['golden', '', 'little','little', 'little pretence', '']
i have wrote a code like this:
def noalpha(s):
noa = '' # choose all non-alphabetic symbols
for c in s:
if not (c in noa or c.isalpha()):
noa += c
return noa
def words(a,b,txt):
lst = []
for i in txt: # work with a whole text that is one element in list txt
i = i.splitlines() # split text in lines \n
for s in i: # iteration in lines
s = s.split()
for w in s: # iteration in words
w = w.replace(noalpha(w), '')
if a <= len(w) <= b:
lst.append(w)
return lst
so i can't find the way to:
return '' (an empty string) for a whole line that doesn't contain words of necessary length
if a line contains more that one word i can't return them like 'word1 word2 word3'
Something like this?
def alpha(word):
return ''.join(char for char in word if char.isalpha())
result = []
for line in txt[0].splitlines():
words = [alpha(word) for word in line.split()]
result.append(' '.join(word for word in words if a <= len(word) <= b))
I have a file like this:
NA|polymerase|KC545393|Bundibugyo_ebolavirus|EboBund_112_2012|NA|2012|Human|Democratic_Republic_of_the_Congo
NA|VP24|KC545393|Bundibugyo_ebolavirus|EboBund_112_2012|NA|2012|Human|Democratic_Republic_of_the_Congo
NA|VP30|KC545393|Bundibugyo_ebolavirus|EboBund_112_2012|NA|2012|Human|Democratic_Republic_of_the_Congo
I am trying to print this characters from each line:
polymerase|KC545393
VP24|KC545393
VP30|KC545393
How can I do this?
I tried this code:
for character in line:
if character=="|":
print line[1:i.index(j)]
Use str.split() to split each line by the '|' character; you can limit the splitting because you only need the first 3 columns:
elems = line.split('|', 3)
print '|'.join(elems[1:3])
The print line then takes the elements at index 1 and 2 and joins them together again using the '|' character to produce your desired output.
Demo:
>>> lines = '''\
... NA|polymerase|KC545393|Bundibugyo_ebolavirus|EboBund_112_2012|NA|2012|Human|Democratic_Republic_of_the_Congo
... NA|VP24|KC545393|Bundibugyo_ebolavirus|EboBund_112_2012|NA|2012|Human|Democratic_Republic_of_the_Congo
... NA|VP30|KC545393|Bundibugyo_ebolavirus|EboBund_112_2012|NA|2012|Human|Democratic_Republic_of_the_Congo
... '''.splitlines(True)
>>> for line in lines:
... elems = line.split('|', 3)
... print '|'.join(elems[1:3])
...
polymerase|KC545393
VP24|KC545393
VP30|KC545393
Assuming you know that each line has at least two separators, you can use:
>>> s = 'this|is|a|string'
>>> s
'this|is|a|string'
>>> s[:s.find('|',s.find('|')+1)]
'this|is'
This finds the first | starting at the character position beyond the first | (i.e., it finds the second |) then gives you the substring up but not including to that point.
If it may not have two separators, you just have to be more careful:
s = 'blah blah'
result = s
if s.find('|') >= 0:
if s.find('|',s.find('|')+1) >= 0:
result = s[:s.find('|',s.find('|')+1)]
If that's the case, you'll probably definitely want it in a more general purpose function, something like:
def substringUpToNthChar(str,n,ch):
if n < 1: return ""
pos = -1
while n > 0:
pos = str.find(ch,pos+1)
if pos < 0: return str
n -= 1
return str[:pos]
This will correctly handle the case where there's fewer separators than desired and will also handle (relatively elegantly) getting more than the first two fields.
So far, I have this:
def main():
bad_filename = True
l =[]
while bad_filename == True:
try:
filename = input("Enter the filename: ")
fp = open(filename, "r")
for f_line in fp:
a=(f_line)
b=(f_line.strip('\n'))
l.append(b)
print (l)
bad_filename = False
except IOError:
print("Error: The file was not found: ", filename)
main()
this is my program and when i print this what i get
['1,2,3,4,5']
['1,2,3,4,5', '6,7,8,9,0']
['1,2,3,4,5', '6,7,8,9,0', '1.10,2.20,3.30,0.10,0.30']
but instead i need to get
[1,2,3,4,5]
[6,7,8,9,0.00]
[1.10,2.20,3.3.0,0.10,0.30]
Each line of the file is a series on numbers separated by commas, but to python they are just characters. You need one more conversion step to get your string into a list. First split on commas to create a list of strings each of which is a number. Then use what is called "list comprehension" (or a for loop) to convert each string into a number:
b = f_line.strip('\n').split(',')
c = [float(v) for v in b]
l.append(c)
If you really want to reset the list each time through the loop (your desired output shows only the last line) then instead of appending, just assign the numerical list to l:
b = f_line.strip('\n').split(',')
l = [float(v) for v in b]
List comprehension is a shorthand way of saying:
l = []
for v in b:
l.append(float(v))
You don't need a or the extra parentheses around the assignment of a and b.