What is the behavior of str.isalnum() for files open in binary mode? Is it independent of locale()? - string

I want to write a python2.7 program to open a file in binary mode and collect all the alpha-numeral characters/bytes based on the ASCII table values. I want this program to work for any file extension, which is why I am opening the files in binary mode. I am not using any specific encoding since I do not want a codec error.
def get_alnum_from_file(filename):
res = set()
with open(filename, "rb") as myfile:
text = myfile.read()
for ch in text:
if ch.isalnum():
res.add(ch)
return res
This has worked for all the inputs I have worked on. However, is there an edge case where ch.isalnum() will return true for characters not a-z A-Z 0-9?
For example, if it encounters characters like á, é, í, ó, ú; will it return true on some occasions? The documentation states that isalnum() depends on the locale, but I am not sure if this counts if the characters are raw binary.

Related

Python: Write to file diacritical marks as escape character sequence

I read text line from input file and after cut i have strings:
-pokaż wszystko-
–ყველას გამოჩენა–
and I must write to other file somethink like this:
-poka\017C wszystko-
\2013\10E7\10D5\10D4\10DA\10D0\10E1 \10D2\10D0\10DB\10DD\10E9\10D4\10DC\10D0\2013
My python script start that:
file_input = open('input.txt', 'r', encoding='utf-8')
file_output = open('output.txt', 'w', encoding='utf-8')
Unfortunately, writing to a file is not what it expects.
I got tip why I have to change it, but cant figure out conversion:
Diacritic marks saved in UTF-8 ("-pokaż wszystko-"), it works correctly only if NLS_LANG = AMERICAN_AMERICA.AL32UTF8
If the output file has diacritics saved in escaping form ("-poka\017C wszystko-"), the script works correctly for any NLS_LANG settings
Python 3.6 solution...format characters outside the ASCII range:
#coding:utf8
s = ['-pokaż wszystko-','–ყველას გამოჩენა–']
def convert(s):
return ''.join(x if ord(x) < 128 else f'\\{ord(x):04X}' for x in s)
for t in s:
print(convert(t))
Output:
-poka\017C wszystko-
\2013\10E7\10D5\10D4\10DA\10D0\10E1 \10D2\10D0\10DB\10DD\10E9\10D4\10DC\10D0\2013
Note: I don't know if or how you want to handle Unicode characters outside the basic multilingual plane (BMP, > U+FFFF), but this code probably won't handle them. Need more information about your escape sequence requirements.

Degree character (°) encoding/decoding

I am on windows platform and I use Python 3.
I have a text file which contains degree characters (°)
I want to read the whole text file, do some processing and write it back with the performed modifications. Here is sample of my code :
with io.open('myTextFile.txt',encoding='ASCII') as f:
for item in allItem:
i=0
myData = pd.DataFrame(data=np.zeros((n,1)))
for line in f:
myRegex = "(AD"+item+")"
if re.match(myRegex,line):
myData.loc[i,0] = line
i+=1
myData = myData[(myData.T != 0).any()]
myData = myData.append(pd.DataFrame(["\n"],index=[myData.index[-1]+1]))
myData = myData[0].map(lambda x: x.strip()).to_frame()
myData.to_csv('myModifiedTextFile.txt', header = False, index = False, mode='a', quoting=csv.QUOTE_NONE, escapechar=' ', encoding = 'ASCII')
However I am getting unicode errors although I tried specifying encoding/decoding :
'ascii' codec can't decode byte 0xe9 in position 512: ordinal not in range(128)
ascii is not very useful here, since it only knows 128 characters, the ones you can find in this table. Notice there is no degree sign in that table. I am unsure what the actual encoding of your file is – Unicode and commonly used Windows code pages (1250/1252) have the degree sign at 0xB0.
I assume in your file, there is a degree sign at position 512 and it is causing the error. If this is the case, you need to be more specific with your encoding argument. Figure out which code page / encoding was used to save the file. Confirm this by looking up the code page and finding the degree sign at 0xE9.
If there is a different character at position 512 ("é" is a good candidate), then simply specify an encoding like cp1250, cp1252, or cp1257.

Writing to files in ASCII with Python3, not UTF8

I have a program that I created with two sections.
The first one copies a text file with an integer in the middle of the file name in this format.
file = "Filename" + "str(int)" + ".txt"
the user can create as many copies of the file that they would like.
The second part of the program is what I am having the problem with. There is an integer at the very bottom of the file that is to correspond with the integer in the file name. After the first part is done, I open each file one at a time in "r+" read/write format. So I can file.seek(1000) to about where the integer is in the file.
Now in my opinion the next part should be easy. I should just simply have to write str(int) into the file right here. But it wasn't that easy. It worked just fine doing it like that in Linux at home, but at work on Windows it proved difficult. What I ended up having to do after file.seek(1000) is write to the file using Unicode UTF-8. I accomplished this with this code snippet of the rest of the program. I will document it so that it is able to be understood what is going on. Instead of having to write this in Unicode, I would love to be able to write this in good old regular English ASCII characters. Eventually this program will be expanded to include a lot more data at the bottom of each file. Having to write the data in Unicode is going to make things extremely difficult. If I just write the data without turning it into Unicode this is the result. This string is supposed to say #2 =1534, instead it says #2 =ㄠ㌵433.
If someone can show me what I am doing wrong that would be great. I would love to just use something like file.write('1534') to write the data to the file instead of having to do it in Unicode UTF-8.
while a1 < d1 :
file = "file" + str(a1) + ".par"
f = open(file, "r+")
f.seek(1011)
data = f.read() #reads the data from that point in the file into a variable.
numList= list(str(a1)) # "a1" is the integer in the file name. I had to turn the integer into a list to accomplish the next task.
replaceData = '\x00' + numList[0] + '\x00' + numList[1] + '\x00' + numList[2] + '\x00' + numList[3] + '\x00' #This line turns the integer into Utf 8 Unicode. I am by no means a Unicode expert.
currentData = data #probably didn't need to be done now that I'm looking at this.
data = data.replace(currentData, replaceData) #replaces the Utf 8 string in the "data" variable with the new Utf 8 string in "replaceData."
f.seek(1011) # Return to where I need to be in the file to write the data.
f.write(data) # Write the new Unicode data to the file
f.close() #close the file
f.close() #make sure the file is closed (sometimes it seems that this fails in Windows.)
a1 += 1 #advances the integer, and then return to the top of the loop
This is an example of writing to a file in ASCII. You need to open the file in byte mode, and using the .encode method for strings is a convenient way to get the end result you want.
s = '12345'
ascii = s.encode('ascii')
with open('somefile', 'wb') as f:
f.write(ascii)
You can obviously also open in rb+ (read and write byte mode) in your case if the file already exists.
with open('somefile', 'rb+') as f:
existing = f.read()
f.write(b'ascii without encoding!')
You can also just pass string literals with the b prefix, and they will be encoded with ascii as shown in the second example.

How to read the file without encoding and extract desired urls with python3?

Environment :python3.
There are many files ,some of them encoding with gbk,others encoding with utf-8.
I want to extract all the jpg with regular expression
For s.html encoding with gbk.
tree = open("/tmp/s.html","r").read()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 135: invalid start byte
tree = open("/tmp/s.html","r",encoding="gbk").read()
pat = "http://.+\.jpg"
result = re.findall(pat,tree)
print(result)
['http://somesite/2017/06/0_56.jpg']
It is a huge job to open all the files with specified encoding,i want a smart way to extract jpg urls in all the files.
I had a similar problem, and how I solved this is as follows.
In get_file_encoding(filename), I first check if there is a BOM (Byte Order Mark) in the file, if so, get the encoding from the BOM. From the function: get_file_bom_encoding(filename)
If that did not return a value, I would get a list from the function: get_all_file_encodings(filename)
This list will have all encodings which the file can be opened in. For the purpose I was doing this, I just needed one, and I did not care about the rest, so I simply choose the first value of the list file_encoding = str(encoding_list[0][0])
This obviously is not accurate a 100% but at least it will give you either the correct encoding from the BOM, or it will give you a list of encodings in which the file can be opened. Or if you do this the same, it will give you one (first value) encoding the file can be opened with without errors.
Here it the code:
# -*- coding: utf-8 -*-
import codecs
def get_file_bom_encoding(filename):
with open (filename, 'rb') as openfileobject:
line = str(openfileobject.readline())
if line[2:14] == str(codecs.BOM_UTF8).split("'")[1]: return 'utf_8'
if line[2:10] == str(codecs.BOM_UTF16_BE).split("'")[1]: return 'utf_16'
if line[2:10] == str(codecs.BOM_UTF16_LE).split("'")[1]: return 'utf_16'
if line[2:18] == str(codecs.BOM_UTF32_BE).split("'")[1]: return 'utf_32'
if line[2:18] == str(codecs.BOM_UTF32_LE).split("'")[1]: return 'utf_32'
return ''
def get_all_file_encodings(filename):
encoding_list = []
encodings = ('utf_8', 'utf_16', 'utf_16_le', 'utf_16_be',
'utf_32', 'utf_32_be', 'utf_32_le',
'cp850' , 'cp437', 'cp852', 'cp1252', 'cp1250' , 'ascii',
'utf_8_sig', 'big5', 'big5hkscs', 'cp037', 'cp424', 'cp500',
'cp720', 'cp737', 'cp775', 'cp855', 'cp856', 'cp857',
'cp858', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864',
'cp865', 'cp866', 'cp869', 'cp874', 'cp875', 'cp932',
'cp949', 'cp950', 'cp1006', 'cp1026', 'cp1140', 'cp1251',
'cp1253', 'cp1254', 'cp1255', 'cp1256', 'cp1257',
'cp1258', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213',
'euc_kr', 'gb2312', 'gbk', 'gb18030', 'hz', 'iso2022_jp',
'iso2022_jp_1', 'iso2022_jp_2', 'iso2022_jp_2004',
'iso2022_jp_3', 'iso2022_jp_ext', 'iso2022_kr', 'latin_1',
'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5',
'iso8859_6', 'iso8859_7', 'iso8859_8', 'iso8859_9',
'iso8859_10', 'iso8859_13', 'iso8859_14', 'iso8859_15',
'iso8859_16', 'johab', 'koi8_r', 'koi8_u', 'mac_cyrillic',
'mac_greek', 'mac_iceland', 'mac_latin2', 'mac_roman',
'mac_turkish', 'ptcp154', 'shift_jis', 'shift_jis_2004',
'shift_jisx0213'
)
for e in encodings:
try:
fh = codecs.open(filename, 'r', encoding=e)
fh.readlines()
except UnicodeDecodeError:
fh.close()
except UnicodeError:
fh.close()
else:
encoding_list.append([e])
fh.close()
continue
return encoding_list
def get_file_encoding(filename):
file_encoding = get_file_bom_encoding(filename)
if file_encoding != '':
return file_encoding
encoding_list = get_all_file_encodings(filename)
file_encoding = str(encoding_list[0][0])
if file_encoding[-3:] == '_be' or file_encoding[-3:] == '_le':
file_encoding = file_encoding[:-3]
return file_encoding
def main():
print('This Python script is only for import functionality, it does not run interactively')
if __name__ == '__main__':
main()
I am sure that there are modules/packages which can do this more accurately, but this is done with standard packages (which was another requirement I had)
It does mean that you are reading the files multiple times, which is not a very fast way of doing things. You may be able to use this to suite your own particular problem, or even improve on this, especially the "reading multiple times" is something which you could look at, and immediately open the file once one encoding is found.
If they have mixed encoding, you could try one encoding and fall back to another:
# first open as binary
with open(..., 'rb') as f:
f_contents = f.read()
try:
contents = f_contents.decode('UTF-8')
except UnicodeDecodeError:
contents = f_contents.decode('gbk')
...
If they are html files, you may also be able to find the encoding tag, or search them as binary with a binary regex:
contents = open(..., 'rb').read()
regex = re.compile(b'http://.+\.jpg')
result = regex.findall(contents)
# now you'll probably want to `.decode()` each of the urls, but you should be able to do that pretty trivially with even the `ASCII` codec
Though now that I think of it, you probably don't really want to use regex to capture the links as you'll then have to deal with html entities (&) and may do better with something like pyquery
Here's a quick example using pyquery
contents = open(..., 'rb').read()
pq = pyquery.PyQuery(contents)
images = pq.find('img')
for img in images:
img = pyquery.PyQuery(img)
if img.attr('src').endswith('.jpg')
print(img.attr('src'))
Not on my computer with things installed, so mileage with these code samples may vary

Merging multiple text files into one and related problems

I'm using Windows 7 and Python 3.4.
I have several multi-line text files (all in Persian) and I want to merge them into one under one condition: each line of the output file must contain the whole text of each input file. It means if there are nine text files, the output text file must have only nine lines, each line containing the text of a single file. I wrote this:
import os
os.chdir ('C:\Dir')
with open ('test.txt', 'w', encoding = 'UTF8') as OutFile:
with open ('news01.txt', 'r', encoding = 'UTF8') as InFile:
while True:
_Line = InFile.readline()
if len (_Line) == 0:
break
else:
_LineString = str (_Line)
OutFile.write (_LineString)
It worked for that one file but it looks like it takes more than one line in output file and also the output file contains disturbing characters like: &amp, &nbsp and things like that. But the source files don't contain any of them.
Also, I've got some other texts: news02.txt, news03.txt, news04.txt ... news09.txt.
Considering all these:
How can I correct my code so that it reads all files one after one, putting each in only one line?
How can I clean these unfamiliar and strange characters or prevent them to appear in my final text?
Here is an example that will do the merging portion of your question:
def merge_file(infile, outfile, separator = ""):
print(separator.join(line.strip("\n") for line in infile), file = outfile)
def merge_files(paths, outpath, separator = ""):
with open(outpath, 'w') as outfile:
for path in paths:
with open(path) as infile:
merge_file(infile, outfile, separator)
Example use:
merge_files(["C:\file1.txt", "C:\file2.txt"], "C:\output.txt")
Note this makes the rather large assumption that the contents of 'infile' can fit into memory. Reasonable for most text files, but possibly quite unreasonable otherwise. If your text files will be very large, you can this alternate merge_file implementation:
def merge_file(infile, outfile, separator = ""):
for line in infile:
outfile.write(line.strip("\n")+separator)
outfile.write("\n")
It's slower, but shouldn't run into memory problems.
Answering question 1:
You were right about the UTF-8 part.
You probably want to create a function which takes multiple files as a tuple of files/strings of file directories or *args. Then, read all input files, and replace all "\n" (newlines) with a delimiter (Default ""). out_file can be in in_files, but makes the assumption that the contents of files can be loaded in to memory. Also, out_file can be a file object, and in_files can be file objects.
def write_from_files(out_file, in_files, delimiter="", dir="C:\Dir"):
import _io
import os
import html.parser # See part 2 of answer
os.chdir(dir)
output = []
for file in in_files:
file_ = file
if not isinstance(file_, _io.TextIOWrapper):
file_ = open(file_, "r", -1, "UTF-8") # If it isn't a file, make it a file
file_.seek(0, 0)
output.append(file_.read().replace("\n", delimiter)) # Replace all newlines
file_.close() # Close file to prevent IO errors # with delimiter
if not isinstance(out_file, _io.TextIOWrapper):
out_file = open(out_file, "w", -1, "UTF-8")
html.parser.HTMLParser().unescape("\n".join(output))
out_file.write(join)
out_file.close()
return join # Do not have to return
Answering question 2:
I think you may of copied from a webpage. This does not happen to me. The &amp and &nbsp are the HTML entities, (&) and ( ). You may need to replace them with their corresponding character. I would use HTML.parser. As you see in above, it turns HTML escape sequences into Unicode literals. E.g.:
>>> html.parser.HTMLParser().unescape("Alpha &lt β")
'Alpha < β'
This will not work in Python 2.x, as in 3.x it was renamed. Instead, replace the incorrect lines with:
import HTMLParser
HTMLParser.HTMLParser().unescape("\n".join(output))

Resources