Text parsing with specific element Python - python-3.x

I have an aligment result with multiple sequences as text file. I want to split each result into new text file. Far now I can detect each sequence with '>', and split into files. However, new text files writen without line that contains '>'.
with open("result.txt",'r') as fo:
start=0
op= ' '
cntr=1
# print(fo.readlines())
for x in fo.readlines():
# print(x)
if (x[0]== '>'):
if (start==1):
with open(str(cntr)+'.txt','w') as opf:
opf.write(op)
opf.close()
op= ' '
cntr+=1
else:
start=1
else:
if (op==''):
op=x
else:
op= op + '\n' + x
fo.close()
print('completed')
>P51051.1 RecName: Full=Melatonin receptor type 1B; Short=Mel-1B-R; Short=Mel1b
receptor [Xenopus laevis]
Length=152
this is how I want to see as a beginning of each text file but they start as
receptor [Xenopus laevis]
Length=152
How can I include from the beginning?

You can do it like this:
with open("result.txt", encoding='utf-8') as fo:
for index, txt in enumerate(fo.read().split(">")):
if txt:
with open(f'{index}.txt', 'w') as opf:
opf.write(txt)
You should provide the encoding of the file e.g. utf-8, no need to specify read r, there is no need to close the file if you are using a context manager i.e. with and you just need to use read instead of readlines to get a string then call split on the string. I'm using enumerate to get a counter as well as enumerate objects. And f-string as it is a better way for string concatenation.

Related

How to modify and print list items in python?

I am a beginner in python, working on a small logic, i have a text file with html links in it, line by line. I have to read each line of the file, and print the individual links with same prefix and suffix,
so that the model looks like this.
<item>LINK1</item>
<item>LINK2</item>
<item>LINK3</item>
and so on.
I have tried this code, but something is wrong in my approach,
def file_read(fname):
with open(fname) as f:
#Content_list is the list that contains the read lines.
content_list = f.readlines()
for i in content_list:
print(str("<item>") + i + str("</item>"))
file_read(r"C:\Users\mandy\Desktop\gd.txt")
In the output, the suffix was not as expected, as i am a beginner, can anyone sort this out for me?
<item>www.google.com
</item>
<item>www.bing.com
</item>
I think when you use .readLine you also put the end of line character into i.
If i understand you correctly and you want to print
item www.google.com item
Then try
https://www.journaldev.com/23625/python-trim-string-rstrip-lstrip-strip
print(str("") + i.strip() + str(""))
When you use the readlines() method, it also includes the newline character from your file ("\n") before parsing the next line.
You could use a method called .strip() which strips off spaces or newline characters from the beginning and end of each line which would correctly format your code.
def file_read(fname):
with open(fname) as f:
#Content_list is the list that contains the read lines.
content_list = f.readlines()
for i in content_list:
print(str("<item>") + i.strip() + str("</item>"))
file_read(r"C:\Users\mandy\Desktop\gd.txt")
I assume you wanted to print in the following way
www.google.com
When you use readlines it gives extra '\n' at end of each line. to avoid that you can strip the string and in printing you can use fstrings.
with open(fname) as f:
lin=f.readlines()
for i in lin:
print(f"<item>{i.strip()}<item>")
Another method:
with open('stacksource') as f:
lin=f.read().splitlines()
for i in lin:
print(f"<item>{i}<item>")
Here splitlines() splits the lines and gives a list

Trying to pull a twitter handle from a text file

I am trying to extract a set of alpha numeric characters from a text file.
below would be some lines in the file. I want to extract the '#' as well as anything that follows.
im trying to pull #bob from a file.
this is a #line in the #file
#bob is a wierdo
the below code is what I have so far.
def getAllPeople(fileName):
#give empty list
allPeople=[]
#open TweetsFile.txt
with open(fileName, 'r') as f1:
lines=f1.readlines()
#split all words into strings
for word in lines:
char = word.split("#")
print(char)
#close the file
f1.close()
What I am trying to get is;
['#bob','#line','#file', '#bob']
If you do not want to use re, take Andrew's suggestion
mentions = list(filter(lambda x: x.startswith('#'), tweet.split()))
otherwise, see the marked duplicate.
mentions = [w for w in tweet.split() if w.startswith('#')]
since you apparently can not use filter or lambda.

Python: Write to file diacritical marks as escape character sequence

I read text line from input file and after cut i have strings:
-pokaż wszystko-
–ყველას გამოჩენა–
and I must write to other file somethink like this:
-poka\017C wszystko-
\2013\10E7\10D5\10D4\10DA\10D0\10E1 \10D2\10D0\10DB\10DD\10E9\10D4\10DC\10D0\2013
My python script start that:
file_input = open('input.txt', 'r', encoding='utf-8')
file_output = open('output.txt', 'w', encoding='utf-8')
Unfortunately, writing to a file is not what it expects.
I got tip why I have to change it, but cant figure out conversion:
Diacritic marks saved in UTF-8 ("-pokaż wszystko-"), it works correctly only if NLS_LANG = AMERICAN_AMERICA.AL32UTF8
If the output file has diacritics saved in escaping form ("-poka\017C wszystko-"), the script works correctly for any NLS_LANG settings
Python 3.6 solution...format characters outside the ASCII range:
#coding:utf8
s = ['-pokaż wszystko-','–ყველას გამოჩენა–']
def convert(s):
return ''.join(x if ord(x) < 128 else f'\\{ord(x):04X}' for x in s)
for t in s:
print(convert(t))
Output:
-poka\017C wszystko-
\2013\10E7\10D5\10D4\10DA\10D0\10E1 \10D2\10D0\10DB\10DD\10E9\10D4\10DC\10D0\2013
Note: I don't know if or how you want to handle Unicode characters outside the basic multilingual plane (BMP, > U+FFFF), but this code probably won't handle them. Need more information about your escape sequence requirements.

Merging multiple text files into one and related problems

I'm using Windows 7 and Python 3.4.
I have several multi-line text files (all in Persian) and I want to merge them into one under one condition: each line of the output file must contain the whole text of each input file. It means if there are nine text files, the output text file must have only nine lines, each line containing the text of a single file. I wrote this:
import os
os.chdir ('C:\Dir')
with open ('test.txt', 'w', encoding = 'UTF8') as OutFile:
with open ('news01.txt', 'r', encoding = 'UTF8') as InFile:
while True:
_Line = InFile.readline()
if len (_Line) == 0:
break
else:
_LineString = str (_Line)
OutFile.write (_LineString)
It worked for that one file but it looks like it takes more than one line in output file and also the output file contains disturbing characters like: &amp, &nbsp and things like that. But the source files don't contain any of them.
Also, I've got some other texts: news02.txt, news03.txt, news04.txt ... news09.txt.
Considering all these:
How can I correct my code so that it reads all files one after one, putting each in only one line?
How can I clean these unfamiliar and strange characters or prevent them to appear in my final text?
Here is an example that will do the merging portion of your question:
def merge_file(infile, outfile, separator = ""):
print(separator.join(line.strip("\n") for line in infile), file = outfile)
def merge_files(paths, outpath, separator = ""):
with open(outpath, 'w') as outfile:
for path in paths:
with open(path) as infile:
merge_file(infile, outfile, separator)
Example use:
merge_files(["C:\file1.txt", "C:\file2.txt"], "C:\output.txt")
Note this makes the rather large assumption that the contents of 'infile' can fit into memory. Reasonable for most text files, but possibly quite unreasonable otherwise. If your text files will be very large, you can this alternate merge_file implementation:
def merge_file(infile, outfile, separator = ""):
for line in infile:
outfile.write(line.strip("\n")+separator)
outfile.write("\n")
It's slower, but shouldn't run into memory problems.
Answering question 1:
You were right about the UTF-8 part.
You probably want to create a function which takes multiple files as a tuple of files/strings of file directories or *args. Then, read all input files, and replace all "\n" (newlines) with a delimiter (Default ""). out_file can be in in_files, but makes the assumption that the contents of files can be loaded in to memory. Also, out_file can be a file object, and in_files can be file objects.
def write_from_files(out_file, in_files, delimiter="", dir="C:\Dir"):
import _io
import os
import html.parser # See part 2 of answer
os.chdir(dir)
output = []
for file in in_files:
file_ = file
if not isinstance(file_, _io.TextIOWrapper):
file_ = open(file_, "r", -1, "UTF-8") # If it isn't a file, make it a file
file_.seek(0, 0)
output.append(file_.read().replace("\n", delimiter)) # Replace all newlines
file_.close() # Close file to prevent IO errors # with delimiter
if not isinstance(out_file, _io.TextIOWrapper):
out_file = open(out_file, "w", -1, "UTF-8")
html.parser.HTMLParser().unescape("\n".join(output))
out_file.write(join)
out_file.close()
return join # Do not have to return
Answering question 2:
I think you may of copied from a webpage. This does not happen to me. The &amp and &nbsp are the HTML entities, (&) and ( ). You may need to replace them with their corresponding character. I would use HTML.parser. As you see in above, it turns HTML escape sequences into Unicode literals. E.g.:
>>> html.parser.HTMLParser().unescape("Alpha &lt β")
'Alpha < β'
This will not work in Python 2.x, as in 3.x it was renamed. Instead, replace the incorrect lines with:
import HTMLParser
HTMLParser.HTMLParser().unescape("\n".join(output))

python3 opening files and reading lines

Can you explain what is going on in this code? I don't seem to understand
how you can open the file and read it line by line instead of all of the sentences at the same time in a for loop. Thanks
Let's say I have these sentences in a document file:
cat:dog:mice
cat1:dog1:mice1
cat2:dog2:mice2
cat3:dog3:mice3
Here is the code:
from sys import argv
filename = input("Please enter the name of a file: ")
f = open(filename,'r')
d1ct = dict()
print("Number of times each animal visited each station:")
print("Animal Id Station 1 Station 2")
for line in f:
if '\n' == line[-1]:
line = line[:-1]
(AnimalId, Timestamp, StationId,) = line.split(':')
key = (AnimalId,StationId,)
if key not in d1ct:
d1ct[key] = 0
d1ct[key] += 1
The magic is at:
for line in f:
if '\n' == line[-1]:
line = line[:-1]
Python file objects are special in that they can be iterated over in a for loop. On each iteration, it retrieves the next line of the file. Because it includes the last character in the line, which could be a newline, it's often useful to check and remove the last character.
As Moshe wrote, open file objects can be iterated. Only, they are not of the file type in Python 3.x (as they were in Python 2.x). If the file object is opened in text mode, then the unit of iteration is one text line including the \n.
You can use line = line.rstrip() to remove the \n plus the trailing withespaces.
If you want to read the content of the file at once (into a multiline string), you can use content = f.read().
There is a minor bug in the code. The open file should always be closed. I means to use f.close() after the for loop. Or you can wrap the open to the newer with construct that will close the file for you -- I suggest to get used to the later approach.

Resources