Python: Write to file diacritical marks as escape character sequence - python-3.x

I read text line from input file and after cut i have strings:
-pokaż wszystko-
–ყველას გამოჩენა–
and I must write to other file somethink like this:
-poka\017C wszystko-
\2013\10E7\10D5\10D4\10DA\10D0\10E1 \10D2\10D0\10DB\10DD\10E9\10D4\10DC\10D0\2013
My python script start that:
file_input = open('input.txt', 'r', encoding='utf-8')
file_output = open('output.txt', 'w', encoding='utf-8')
Unfortunately, writing to a file is not what it expects.
I got tip why I have to change it, but cant figure out conversion:
Diacritic marks saved in UTF-8 ("-pokaż wszystko-"), it works correctly only if NLS_LANG = AMERICAN_AMERICA.AL32UTF8
If the output file has diacritics saved in escaping form ("-poka\017C wszystko-"), the script works correctly for any NLS_LANG settings

Python 3.6 solution...format characters outside the ASCII range:
#coding:utf8
s = ['-pokaż wszystko-','–ყველას გამოჩენა–']
def convert(s):
return ''.join(x if ord(x) < 128 else f'\\{ord(x):04X}' for x in s)
for t in s:
print(convert(t))
Output:
-poka\017C wszystko-
\2013\10E7\10D5\10D4\10DA\10D0\10E1 \10D2\10D0\10DB\10DD\10E9\10D4\10DC\10D0\2013
Note: I don't know if or how you want to handle Unicode characters outside the basic multilingual plane (BMP, > U+FFFF), but this code probably won't handle them. Need more information about your escape sequence requirements.

Related

Text parsing with specific element Python

I have an aligment result with multiple sequences as text file. I want to split each result into new text file. Far now I can detect each sequence with '>', and split into files. However, new text files writen without line that contains '>'.
with open("result.txt",'r') as fo:
start=0
op= ' '
cntr=1
# print(fo.readlines())
for x in fo.readlines():
# print(x)
if (x[0]== '>'):
if (start==1):
with open(str(cntr)+'.txt','w') as opf:
opf.write(op)
opf.close()
op= ' '
cntr+=1
else:
start=1
else:
if (op==''):
op=x
else:
op= op + '\n' + x
fo.close()
print('completed')
>P51051.1 RecName: Full=Melatonin receptor type 1B; Short=Mel-1B-R; Short=Mel1b
receptor [Xenopus laevis]
Length=152
this is how I want to see as a beginning of each text file but they start as
receptor [Xenopus laevis]
Length=152
How can I include from the beginning?
You can do it like this:
with open("result.txt", encoding='utf-8') as fo:
for index, txt in enumerate(fo.read().split(">")):
if txt:
with open(f'{index}.txt', 'w') as opf:
opf.write(txt)
You should provide the encoding of the file e.g. utf-8, no need to specify read r, there is no need to close the file if you are using a context manager i.e. with and you just need to use read instead of readlines to get a string then call split on the string. I'm using enumerate to get a counter as well as enumerate objects. And f-string as it is a better way for string concatenation.

What is the behavior of str.isalnum() for files open in binary mode? Is it independent of locale()?

I want to write a python2.7 program to open a file in binary mode and collect all the alpha-numeral characters/bytes based on the ASCII table values. I want this program to work for any file extension, which is why I am opening the files in binary mode. I am not using any specific encoding since I do not want a codec error.
def get_alnum_from_file(filename):
res = set()
with open(filename, "rb") as myfile:
text = myfile.read()
for ch in text:
if ch.isalnum():
res.add(ch)
return res
This has worked for all the inputs I have worked on. However, is there an edge case where ch.isalnum() will return true for characters not a-z A-Z 0-9?
For example, if it encounters characters like á, é, í, ó, ú; will it return true on some occasions? The documentation states that isalnum() depends on the locale, but I am not sure if this counts if the characters are raw binary.

Python 3 split('\n')

How do I split a text string according to an explicit newline ('\n')?
Unfortunately, instead of a properly formatted csv file, I am dealing with a long string of text with "\n" where the newline would be. (example format: "A0,B0\nA1,B1\nA2,B2\nA3,B3\n ...") I thought a simple bad_csv_list = text.split('\n') would give me a list of the two-valued cells (example split ['A0,B0', 'A1,B1', 'A2,B2', 'A3,B3', ...]). Instead I end up with one cell and "\n" gets converted to "\\n". I tried copy-pasting a section of the string and using split('\n') and it works as I had hoped. The print statement for the file object tells me the following:
<_io.TextIOWrapper name='stats.csv' mode='r' encoding='cp1252'>
...so I suspect the problem is with the cp1252 encoding? Of note tho: Notepad++ says the file I am working with is "UTF-8 without BOM"... I've looked in the docs and around SO and tried importing io and codec and prepending the open statement and declaring encoding='utf8' but I am at a loss and I don't really grok text encoding. Maybe there is a better solution?
from sys import argv
# import io, codec
filename = argv[1]
file_object = open(filename, 'r')
# file_object = io.open(filename, 'r', encoding='utf8')
# file_object = codec.open(filename, 'r', encoding='utf8')
file_contents = file_object.read()
file_list = file_contents.split('\n')
print("1.) Here's the name of the file: {}".format(filename))
print("2.) Here's the file object info: {}".format(file_object))
print("3.) Here's all the files contents:\n{}".format(file_contents))
print("4.) Here's a list of the file contents:\n{}".format(file_list))
Any help would be greatly appreciated, thank you.
If it helps to explain what I am dealing with, here's the contents of the stats.csv file:
Albuquerque,749\nAnaheim,371\nAnchorage,828\nArlington,503\nAtlanta,1379\nAurora,425\nAustin,408\nBakersfield,542\nBaltimore,1405\nBoston,835\nBuffalo,1288\nCharlotte-Mecklenburg,647\nCincinnati,974\nCleveland,1383\nColorado Springs,455\nCorpus Christi,658\nDallas,675\nDenver,615\nDetroit,2122\nEl Paso,423\nFort Wayne,362\nFort Worth,587\nFresno,543\nGreensboro,563\nHenderson,168\nHouston,992\nIndianapolis,1185\nJacksonville,617\nJersey City,734\nKansas City,1263\nLas Vegas,784\nLexington,352\nLincoln,397\nLong Beach,575\nLos Angeles,481\nLouisville Metro,598\nMemphis,1750\nMesa,399\nMiami,1172\nMilwaukee,1294\nMinneapolis,992\nMobile,522\nNashville,1216\nNew Orleans,815\nNew York,639\nNewark,1154\nOakland,1993\nOklahoma City,919\nOmaha,594\nPhiladelphia,1160\nPhoenix,636\nPittsburgh,752\nPlano,130\nPortland,517\nRaleigh,423\nRiverside,443\nSacramento,738\nSan Antonio,503\nSan Diego,413\nSan Francisco,704\nSan Jose,363\nSanta Ana,401\nSeattle,597\nSt. Louis,1776\nSt. Paul,722\nStockton,1548\nTampa,616\nToledo,1171\nTucson,724\nTulsa,990\nVirginia Beach,169\nWashington,1177\nWichita,742
And the result from the split('\n'):
['Albuquerque,749\\nAnaheim,371\\nAnchorage,828\\nArlington,503\\nAtlanta,1379\\nAurora,425\\nAustin,408\\nBakersfield,542\\nBaltimore,1405\\nBoston,835\\nBuffalo,1288\\nCharlotte-Mecklenburg,647\\nCincinnati,974\\nCleveland,1383\\nColorado Springs,455\\nCorpus Christi,658\\nDallas,675\\nDenver,615\\nDetroit,2122\\nEl Paso,423\\nFort Wayne,362\\nFort Worth,587\\nFresno,543\\nGreensboro,563\\nHenderson,168\\nHouston,992\\nIndianapolis,1185\\nJacksonville,617\\nJersey City,734\\nKansas City,1263\\nLas Vegas,784\\nLexington,352\\nLincoln,397\\nLong Beach,575\\nLos Angeles,481\\nLouisville Metro,598\\nMemphis,1750\\nMesa,399\\nMiami,1172\\nMilwaukee,1294\\nMinneapolis,992\\nMobile,522\\nNashville,1216\\nNew Orleans,815\\nNew York,639\\nNewark,1154\\nOakland,1993\\nOklahoma City,919\\nOmaha,594\\nPhiladelphia,1160\\nPhoenix,636\\nPittsburgh,752\\nPlano,130\\nPortland,517\\nRaleigh,423\\nRiverside,443\\nSacramento,738\\nSan Antonio,503\\nSan Diego,413\\nSan Francisco,704\\nSan Jose,363\\nSanta Ana,401\\nSeattle,597\\nSt. Louis,1776\\nSt. Paul,722\\nStockton,1548\\nTampa,616\\nToledo,1171\\nTucson,724\\nTulsa,990\\nVirginia Beach,169\\nWashington,1177\\nWichita,742']
Why does it ADD a \ ?
dOh!!! ROYAL FACE PALM! I just wrote all this out an then realized that all I needed to do was put an escape slash before the \newline:
file_list = file_contents.split('\\n')
I'm gonna post this anyways so y'all can have a chuckle ^_^

Merging multiple text files into one and related problems

I'm using Windows 7 and Python 3.4.
I have several multi-line text files (all in Persian) and I want to merge them into one under one condition: each line of the output file must contain the whole text of each input file. It means if there are nine text files, the output text file must have only nine lines, each line containing the text of a single file. I wrote this:
import os
os.chdir ('C:\Dir')
with open ('test.txt', 'w', encoding = 'UTF8') as OutFile:
with open ('news01.txt', 'r', encoding = 'UTF8') as InFile:
while True:
_Line = InFile.readline()
if len (_Line) == 0:
break
else:
_LineString = str (_Line)
OutFile.write (_LineString)
It worked for that one file but it looks like it takes more than one line in output file and also the output file contains disturbing characters like: &amp, &nbsp and things like that. But the source files don't contain any of them.
Also, I've got some other texts: news02.txt, news03.txt, news04.txt ... news09.txt.
Considering all these:
How can I correct my code so that it reads all files one after one, putting each in only one line?
How can I clean these unfamiliar and strange characters or prevent them to appear in my final text?
Here is an example that will do the merging portion of your question:
def merge_file(infile, outfile, separator = ""):
print(separator.join(line.strip("\n") for line in infile), file = outfile)
def merge_files(paths, outpath, separator = ""):
with open(outpath, 'w') as outfile:
for path in paths:
with open(path) as infile:
merge_file(infile, outfile, separator)
Example use:
merge_files(["C:\file1.txt", "C:\file2.txt"], "C:\output.txt")
Note this makes the rather large assumption that the contents of 'infile' can fit into memory. Reasonable for most text files, but possibly quite unreasonable otherwise. If your text files will be very large, you can this alternate merge_file implementation:
def merge_file(infile, outfile, separator = ""):
for line in infile:
outfile.write(line.strip("\n")+separator)
outfile.write("\n")
It's slower, but shouldn't run into memory problems.
Answering question 1:
You were right about the UTF-8 part.
You probably want to create a function which takes multiple files as a tuple of files/strings of file directories or *args. Then, read all input files, and replace all "\n" (newlines) with a delimiter (Default ""). out_file can be in in_files, but makes the assumption that the contents of files can be loaded in to memory. Also, out_file can be a file object, and in_files can be file objects.
def write_from_files(out_file, in_files, delimiter="", dir="C:\Dir"):
import _io
import os
import html.parser # See part 2 of answer
os.chdir(dir)
output = []
for file in in_files:
file_ = file
if not isinstance(file_, _io.TextIOWrapper):
file_ = open(file_, "r", -1, "UTF-8") # If it isn't a file, make it a file
file_.seek(0, 0)
output.append(file_.read().replace("\n", delimiter)) # Replace all newlines
file_.close() # Close file to prevent IO errors # with delimiter
if not isinstance(out_file, _io.TextIOWrapper):
out_file = open(out_file, "w", -1, "UTF-8")
html.parser.HTMLParser().unescape("\n".join(output))
out_file.write(join)
out_file.close()
return join # Do not have to return
Answering question 2:
I think you may of copied from a webpage. This does not happen to me. The &amp and &nbsp are the HTML entities, (&) and ( ). You may need to replace them with their corresponding character. I would use HTML.parser. As you see in above, it turns HTML escape sequences into Unicode literals. E.g.:
>>> html.parser.HTMLParser().unescape("Alpha &lt β")
'Alpha < β'
This will not work in Python 2.x, as in 3.x it was renamed. Instead, replace the incorrect lines with:
import HTMLParser
HTMLParser.HTMLParser().unescape("\n".join(output))

VIM: deleting non-roman characters

I'm working with a document with both Roman and Asian characters, and I want put them each of them alone in two separated files and keeps their original structure, is it possible?
Thanks
Might be easier in Python. Here's a script that reads a text file and creates two output files: one with low-ASCII and one with everything else. If you have Python support compiled in Vim, the following should also be usable from within Vim (with minimal changes).
import codecs
mixedInput = codecs.open('mixed.txt', 'r', 'utf-8')
lowAsciiOutput = codecs.open('lowAscii.txt', 'w', 'utf-8')
otherOutput = codecs.open('other.txt', 'w', 'utf-8')
for rawline in mixedInput:
line = rawline.rstrip()
for c in line:
if ord(c) < 2**7:
lowAsciiOutput.write(c)
else:
otherOutput.write(c)
otherOutput.write('\n')
lowAsciiOutput.write('\n')
mixedInput.close()
lowAsciiOutput.close()
otherOutput.close()
example input file (mixed.txt):
欢迎来到Mifos管理区域
Does that do what you want?
Also saved as a gist: https://gist.github.com/855545

Resources