Seek from end of file in python 3 - string

One of the changes in python 3 has been to remove the ability to seek from the end of the file in normal text mode. What is the generally accepted alternative to this?
For example in python 2.7 I would enter file.seek(-3,2)
I've read a bit about why they did this so please don't just link to a PEP. I know that using 'rb' would allow me to seek, but this makes my text file read in the wrong format.

In Python 2, the file data wasn't being decoded while reading. Seeking backwards and multi-byte encodings don't mix well (you can't know where would the next character start), which is why it is disabled for Python 3.
You can still seek on the underlying buffer object, via the TextIOBase.buffer attribute, but then you'll have to reattach a new TextIOBase wrapper, as the current wrapper will no longer know where it is at:
import io
file.buffer.seek(-3, 2)
file = io.TextIOWrapper(
file.buffer, encoding=file.encoding, errors=file.errors,
newline=file.newlines)
I've copied across any encoding and line handling information to the io.TextIOWrapper() object.
Take into account that this can decoding could break for UTF-16, UTF-32, UTF-8 and other multi-byte codecs.
Demo:
>>> import io
>>> with open('demo.txt', 'w') as out:
... out.write('Demonstration\nfor seeking from the end')
...
38
>>> with open('demo.txt') as inf:
... print(inf.readline())
... inf.buffer.seek(-3, 2)
... inf = io.TextIOWrapper(inf.buffer)
... print(inf.readline())
...
Demonstration
35
end
You could wrap this up in a utility function:
import io
def textio_seek(fobj, amount, whence=0):
fobj.buffer.seek(amount, whence)
return io.TextIOWrapper(
fobj.buffer, encoding=fobj.encoding, errors=fobj.errors,
newline=fobj.newlines)
and use this as:
with open(somefile) as file:
# ...
file = textio_seek(file, -2, 3)
# ...
Using the file object as a context manager just still works, as the original file object reference is still attached to the original file buffer object and thus can still be used to close the file.

Related

Writing to files in ASCII with Python3, not UTF8

I have a program that I created with two sections.
The first one copies a text file with an integer in the middle of the file name in this format.
file = "Filename" + "str(int)" + ".txt"
the user can create as many copies of the file that they would like.
The second part of the program is what I am having the problem with. There is an integer at the very bottom of the file that is to correspond with the integer in the file name. After the first part is done, I open each file one at a time in "r+" read/write format. So I can file.seek(1000) to about where the integer is in the file.
Now in my opinion the next part should be easy. I should just simply have to write str(int) into the file right here. But it wasn't that easy. It worked just fine doing it like that in Linux at home, but at work on Windows it proved difficult. What I ended up having to do after file.seek(1000) is write to the file using Unicode UTF-8. I accomplished this with this code snippet of the rest of the program. I will document it so that it is able to be understood what is going on. Instead of having to write this in Unicode, I would love to be able to write this in good old regular English ASCII characters. Eventually this program will be expanded to include a lot more data at the bottom of each file. Having to write the data in Unicode is going to make things extremely difficult. If I just write the data without turning it into Unicode this is the result. This string is supposed to say #2 =1534, instead it says #2 =ㄠ㌵433.
If someone can show me what I am doing wrong that would be great. I would love to just use something like file.write('1534') to write the data to the file instead of having to do it in Unicode UTF-8.
while a1 < d1 :
file = "file" + str(a1) + ".par"
f = open(file, "r+")
f.seek(1011)
data = f.read() #reads the data from that point in the file into a variable.
numList= list(str(a1)) # "a1" is the integer in the file name. I had to turn the integer into a list to accomplish the next task.
replaceData = '\x00' + numList[0] + '\x00' + numList[1] + '\x00' + numList[2] + '\x00' + numList[3] + '\x00' #This line turns the integer into Utf 8 Unicode. I am by no means a Unicode expert.
currentData = data #probably didn't need to be done now that I'm looking at this.
data = data.replace(currentData, replaceData) #replaces the Utf 8 string in the "data" variable with the new Utf 8 string in "replaceData."
f.seek(1011) # Return to where I need to be in the file to write the data.
f.write(data) # Write the new Unicode data to the file
f.close() #close the file
f.close() #make sure the file is closed (sometimes it seems that this fails in Windows.)
a1 += 1 #advances the integer, and then return to the top of the loop
This is an example of writing to a file in ASCII. You need to open the file in byte mode, and using the .encode method for strings is a convenient way to get the end result you want.
s = '12345'
ascii = s.encode('ascii')
with open('somefile', 'wb') as f:
f.write(ascii)
You can obviously also open in rb+ (read and write byte mode) in your case if the file already exists.
with open('somefile', 'rb+') as f:
existing = f.read()
f.write(b'ascii without encoding!')
You can also just pass string literals with the b prefix, and they will be encoded with ascii as shown in the second example.

a bytes-like object is required, not 'str': typeerror in compressed file

I am finding substring in compressed file using following python script. I am getting "TypeError: a bytes-like object is required, not 'str'". Please any one help me in fixing this.
from re import *
import re
import gzip
import sys
import io
import os
seq={}
with open(sys.argv[1],'r') as fh:
for line1 in fh:
a=line1.split("\t")
seq[a[0]]=a[1]
abcd="AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG"
print(a[0],"\t",seq[a[0]])
count={}
with gzip.open(sys.argv[2]) as gz_file:
with io.BufferedReader(gz_file) as f:
for line in f:
for b in seq:
if abcd in line:
count[b] +=1
for c in count:
print(c,"\t",count[c])
fh.close()
gz_file.close()
f.close()
and input files are
TruSeq2_SE AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG
the second file is compressed text file. The line "if abcd in line:" shows the error.
The "BufferedReader" class gives you bytestrings, not text strings - you can directly compare both objects in Python3 -
Since these strings just use a few ASCII characters and are not actually text, you can work all the way along with byte strings for your code.
So, whenever you "open" a file (not gzip.open), open it in binary mode (i.e.
open(sys.argv[1],'rb') instead of 'r' to open the file)
And also prefix your hardcoded string with a b so that Python uses a binary string inernally: abcd=b"AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG" - this will avoid a similar error on your if abcd in line - though the error message should be different than the one you presented.
Alternativally, use everything as text - this can give you more methods to work with the strings (Python3's byte strigns are somewhat crippled) presentation of data when printing, and should not be much slower - in that case, instead of the changes suggested above, include an extra line to decode the line fetched from your data-file:
with io.BufferedReader(gz_file) as f:
for line in f:
line = line.decode("latin1")
for b in seq:
(Besides the error, your progam logic seens to be a bit faulty, as you don't actually use a variable string in your innermost comparison - just the fixed bcd value - but I suppose you can fix taht once you get rid of the errors)

How does one add string to tarfile in Python3

I have problem adding an str to a tar arhive in python. In python 2 I used such method:
fname = "archive_name"
params_src = "some arbitrarty string to be added to the archive"
params_sio = io.StringIO(params_src)
archive = tarfile.open(fname+".tgz", "w:gz")
tarinfo = tarfile.TarInfo(name="params")
tarinfo.size = len(params_src)
archive.addfile(tarinfo, params_sio)
Its essentially the same what can be found in this here.
It worked well. However, going to python 3 it broke and results with the following error:
File "./translate_report.py", line 67, in <module>
main()
File "./translate_report.py", line 48, in main
archive.addfile(tarinfo, params_sio)
File "/usr/lib/python3.2/tarfile.py", line 2111, in addfile
copyfileobj(fileobj, self.fileobj, tarinfo.size)
File "/usr/lib/python3.2/tarfile.py", line 276, in copyfileobj
dst.write(buf)
File "/usr/lib/python3.2/gzip.py", line 317, in write
self.crc = zlib.crc32(data, self.crc) & 0xffffffff
TypeError: 'str' does not support the buffer interface
To be honest I have trouble understanding where it comes from since I do not feed any str to tarfile module back to the point where I do construct StringIO object.
I know the meanings of StringIO and str, bytes and such changed a bit from python 2 to 3 but I do not see a mistake and cannot come up with better logic to solve this task.
I create StringIO object precisely to provide buffer methods around the string I want to add to the archive. Yet it strikes me that some str does not provide it. On top of it the exception is raised around lines that seem to be responsible for checksum calculations.
Can some one please explain what I am miss-understanding or at least give an example how to add a simple str to the tar archive with out creating an intermediate file on the file-system.
When writing to a file, you need to encode your unicode data to bytes explicitly; StringIO objects do not do this for you, it's a text memory file. Use io.BytesIO() instead and encode:
params_sio = io.BytesIO(params_src.encode('utf8'))
Adjust your encoding to your data, of course.

Extracting source code from html file using python3.1 urllib.request

I'm trying to obtain data using regular expressions from a html file, by implementing the following code:
import urllib.request
def extract_words(wdict, urlname):
uf = urllib.request.urlopen(urlname)
text = uf.read()
print (text)
match = re.findall("<tr>\s*<td>([\w\s.;'(),-/]+)</td>\s+<td>([\w\s.,;'()-/]+)</td>\s*</tr>", text)
which returns an error:
File "extract.py", line 33, in extract_words
match = re.findall("<tr>\s*<td>([\w\s.;'(),-/]+)</td>\s+<td>([\w\s.,;'()-/]+)</td>\s*</tr>", text)
File "/usr/lib/python3.1/re.py", line 192, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object
Upon experimenting further in the IDLE, I noticed that the uf.read() indeed returns the html source code the first time I invoke it. But then onwards, it returns a - b''. Is there any way to get around this?
uf.read() will only read the contents once. Then you have to close it and reopen it to read it again. This is true for any kind of stream. This is however not the problem.
The problem is that reading from any kind of binary source, such as a file or a webpage, will return the data as a bytes type, unless you specify an encoding. But your regexp is not specified as a bytes type, it's specified as a unicode str.
The re module will quite reasonably refuse to use unicode patterns on byte data, and the other way around.
The solution is to make the regexp pattern a bytes string, which you do by putting a b in front of it. Hence:
match = re.findall(b"<tr>\s*<td>([\w\s.;'(),-/]+)</td>\s+<td>([\w\s.,;'()-/]+)</td>\s*</tr>", text)
Should work. Another option is to decode the text so it also is a unicode str:
encoding = uf.headers.getparam('charset')
text = text.decode(encoding)
match = re.findall("<tr>\s*<td>([\w\s.;'(),-/]+)</td>\s+<td>([\w\s.,;'()-/]+)</td>\s*</tr>", text)
(Also, to extract data from HTML, I would say that lxml is a better option).

Comparing input against .txt and receiving error

Im trying to compare a users input with a .txt file but they never equal. The .txt contains the number 12. When I check to see what the .txt is it prints out as
<_io.TextIOWrapper name='text.txt' encoding='cp1252'>
my code is
import vlc
a = input("test ")
rflist = open("text.txt", "r")
print(a)
print(rflist)
if rflist == a:
p = vlc.MediaPlayer('What Sarah Said.mp3')
p.play()
else:
print('no')
so am i doing something wrong with my open() or is it something else entirely
To print the contents of the file instead of the file object, try
print(rflist.read())
instead of
print(rflist)
A file object is not the text contained in the file itself, but rather a wrapper object that facilitates operations on the file, like reading its contents or closing it.
rflist.read() or f.readline() is correct.
Read the documentation section 7.2
Dive Into Python is a fantastic book to start Python. take a look at it and you can not put it down.

Resources