How does one add string to tarfile in Python3 - python-3.x

I have problem adding an str to a tar arhive in python. In python 2 I used such method:
fname = "archive_name"
params_src = "some arbitrarty string to be added to the archive"
params_sio = io.StringIO(params_src)
archive = tarfile.open(fname+".tgz", "w:gz")
tarinfo = tarfile.TarInfo(name="params")
tarinfo.size = len(params_src)
archive.addfile(tarinfo, params_sio)
Its essentially the same what can be found in this here.
It worked well. However, going to python 3 it broke and results with the following error:
File "./translate_report.py", line 67, in <module>
main()
File "./translate_report.py", line 48, in main
archive.addfile(tarinfo, params_sio)
File "/usr/lib/python3.2/tarfile.py", line 2111, in addfile
copyfileobj(fileobj, self.fileobj, tarinfo.size)
File "/usr/lib/python3.2/tarfile.py", line 276, in copyfileobj
dst.write(buf)
File "/usr/lib/python3.2/gzip.py", line 317, in write
self.crc = zlib.crc32(data, self.crc) & 0xffffffff
TypeError: 'str' does not support the buffer interface
To be honest I have trouble understanding where it comes from since I do not feed any str to tarfile module back to the point where I do construct StringIO object.
I know the meanings of StringIO and str, bytes and such changed a bit from python 2 to 3 but I do not see a mistake and cannot come up with better logic to solve this task.
I create StringIO object precisely to provide buffer methods around the string I want to add to the archive. Yet it strikes me that some str does not provide it. On top of it the exception is raised around lines that seem to be responsible for checksum calculations.
Can some one please explain what I am miss-understanding or at least give an example how to add a simple str to the tar archive with out creating an intermediate file on the file-system.

When writing to a file, you need to encode your unicode data to bytes explicitly; StringIO objects do not do this for you, it's a text memory file. Use io.BytesIO() instead and encode:
params_sio = io.BytesIO(params_src.encode('utf8'))
Adjust your encoding to your data, of course.

Related

Parsing xml file from url to a astropy votable without downloading

From http://svo2.cab.inta-csic.es/theory/fps/ you can get the transmission curves for many filters used in astronomical observations. I would like to get these data by opening the url with the corresponding xml file (for each filter), parse it to astropy's votable that helps to read the table data easily.
I have managed to do this by opening the file converting it to a UTF-8 file and saving in locally as an xml. Then opening the local file works fine, as it is obvious form the following example.
However I do not want to save the file and open it again. When I tried that by doing: votable = parse(xml_file), it raises an OSError: File name too long as it takes all the file as a string.
from urllib.request import urlopen
fltr = 'http://svo2.cab.inta-csic.es/theory/fps/fps.php?ID=2MASS/2MASS.H'
url = urlopen(fltr).read()
xml_file = url.decode('UTF-8')
with open('tmp.xml','w') as out:
out.write(xml_file)
votable = parse('tmp.xml')
data = votable.get_first_table().to_table(use_names_over_ids=True)
print(votable)
print(data["Wavelength"])
The output in this case is:
<VOTABLE>... 1 tables ...</VOTABLE>
Wavelength
AA
----------
12890.0
13150.0
...
18930.0
19140.0
Length = 58 rows
Indeed according to the API documentation, votable.parse's first argument is either a filename or a readable file-like object. It doesn't specify this exactly, but apparently the file also has to be seekable meaning that it can be read with random access.
The HTTPResponse object returned by urlopen is indeed a file-like object with a .read() method, so in principle it might be possible to pass directly to parse(), but this is how I found out it has to be seekable:
fltr = 'http://svo2.cab.inta-csic.es/theory/fps/fps.php?ID=2MASS/2MASS.H'
u = urlopen(fltr)
>>> parse(u)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "astropy/io/votable/table.py", line 135, in parse
_debug_python_based_parser=_debug_python_based_parser) as iterator:
File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "astropy/utils/xml/iterparser.py", line 157, in get_xml_iterator
with _convert_to_fd_or_read_function(source) as fd:
File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "astropy/utils/xml/iterparser.py", line 63, in _convert_to_fd_or_read_function
with data.get_readable_fileobj(fd, encoding='binary') as new_fd:
File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "astropy/utils/data.py", line 210, in get_readable_fileobj
fileobj.seek(0)
io.UnsupportedOperation: seek
So you need to wrap the data in a seekable file-like object. Along the lines that #keflavich wrote you can use io.BytesIO (io.StringIO won't work as explained below).
It turns out that there's no reason to explicitly decode the UTF-8 data to unicode. I'll spare the example, but after trying it myself it turns out parse() works on raw bytes (which I find a bit odd, but okay). So you can read the entire contents of the URL into an io.BytesIO which is just an in-memory file-like object that supports random access:
>>> u = urlopen(fltr)
>>> s = io.BytesIO(u.read())
>>> v = parse(s)
WARNING: W42: None:2:0: W42: No XML namespace specified [astropy.io.votable.tree]
>>> v.get_first_table().to_table(use_names_over_ids=True)
<Table masked=True length=58>
Wavelength Transmission
AA
float32 float32
---------- ------------
12890.0 0.0
13150.0 0.0
... ...
18930.0 0.0
19140.0 0.0
This is, in general, the way in Python to do something with some data as though it were a file, without writing an actual file to the filesystem.
Note, however, this won't work if the entire file can't fit in memory. In that case you still might need to write it out to disk. But if it's just for some temporary processing and you don't want to litter your disk tmp.xml like in your example, you can always use the tempfile module to, among other things, create temporary files that are automatically deleted once they're no longer in use.

wkb: could not create geometry because of errors while reading input

I am trying to translate ewkb coordinates into the associated longitude and latitude on Python. The ewkb strings are listed in a one-column csv file (named "/home/nick/Documents/Sepi/WKB_coordinates_sing.csv").
I deleted the other columns for the sake of simplicity, but eventually I would like to use the original data set and read just the right column with ewkb.
Moreover, I would like to read and translate one line at a time, because I have files with millions of lines and coordinates to process.
I wrote the following code:
from shapely import wkb
with open ("/home/nick/Documents/Sepi/WKB_coordinates_sing.csv") as f:
for line in f:
hexloc=f.readline()
print(hexloc)
point=wkb.loads(hexloc,hex=True)
print(point.x,point.y)
However, when I run it, I get the following:
~$ python /home/nick/Documents/Sepi/ewkb.py
0101000020E610000072604C0D47AA37402C306475ABA85140
ParseException: Premature end of HEX string
Traceback (most recent call last):
File "/home/nick/Documents/Sepi/ewkb.py", line 7, in <module>
point=wkb.loads(hexloc,hex=True)
File "/home/nick/anaconda3/lib/python3.6/site-packages/shapely/wkb.py", line 14, in loads
return reader.read_hex(data)
File "/home/nick/anaconda3/lib/python3.6/site-packages/shapely/geos.py", line 409, in read_hex
"Could not create geometry because of errors "
shapely.errors.WKBReadingError: Could not create geometry because of errors while reading input.
However, I can obtain longitude and latitude if I run the following code with the first hexadecimal string from my csv file as argument of wkb.loads:
Code:
from shapely import wkb
hexloc="0101000020E610000072604C0D47AA37402C306475ABA85140"
print(hexloc)
point=wkb.loads(hexloc,hex=True)
print(point.x,point.y)
Result:
~$ python /home/nick/Documents/Sepi/ewkb.py
0101000020E610000072604C0D47AA37402C306475ABA85140
23.665146666666665 70.63546500000001
Thank you in advance!
There seem to be several possible issues. First, your code snippet is mixing iteration and direct "read" methods. With this example:
with open ("/home/nick/Documents/Sepi/WKB_coordinates_sing.csv") as f:
for line in f:
hexloc=f.readline()
#do something with hexloc
hexloc will effectively iterate only over every second line in the input file. You might want to replace this with:
with open ("/home/nick/Documents/Sepi/WKB_coordinates_sing.csv") as f:
for hexloc in f:
#do something with hexloc
Moreover, when you read the input lines like this, they retain the trailing newline which confuses the loads method. I would suggest to try:
with open ("/home/nick/Documents/Sepi/WKB_coordinates_sing.csv") as f:
for line in f:
hexloc = line.strip()
point = wkb.loads(hexloc, hex=True)

Doesn't python's ffmpy works with temporary files made using tempfile.TemporaryFile?

I am making a function whose purpose is to take a mp3 file and analyse and process it. So, taking help from this SO answer, I am making a temporary wav file, and then using python ffmpy library I am trying to convert mp3(actual given file) to wav file. But the catch is that I am giving the temporary wav file generated above as the output file to ffmpy to store the result to i.e. I am doing this:
import ffmpy
import tempfile
from scipy.io import wavfile
# audioFile variable is known here
tempWavFile = tempfile.TemporaryFile(suffix="wav")
ff_obj = ffmpy.FFmpeg(
global_options="hide_banner",
inputs={audioFile:None},
outputs={tempWavFile: " -acodec pcm_s16le -ac 1 -ar 44000"}
)
ff_obj.run()
[fs, frames] = wavfile.read(tempWavFile)
print(" fs is: ", fs)
print(" frames is: ", frames)
But on line ff_obj.run() this error occurs:
File "/home/tushar/.local/lib/python3.5/site-packages/ffmpy.py", line 95, in run
stderr=stderr
File "/usr/lib/python3.5/subprocess.py", line 947, in __init__
restore_signals, start_new_session)
File "/usr/lib/python3.5/subprocess.py", line 1490, in _execute_child
restore_signals, start_new_session, preexec_fn)
TypeError: Can't convert '_io.TextIOWrapper' object to str implicitly
So, my question is:
When I replaced tempWavFile = tempfile.TemporaryFile(suffix="wav") with tempWavFile = tempfile.mktemp('.wav'), no error occurs, why so ?
What does this error mean and what is the cause of it's occurrence and how can it be corrected ?
tempfile.TemporaryFile returns a file-like object:
>>> tempWavFile = tempfile.TemporaryFile(suffix="wav")
>>> tempWavFile
<_io.BufferedRandom name=12>
On the other hand, tempfile.mktemp returns a string to a path to a real file which has just been created on the file system:
>>> tempWavFile = tempfile.mktemp('.wav')
>>> tempWavFile
'/var/folders/f1/9b4sf0gx0dx78qpkq57sz4bm0000gp/T/tmpf2117fap.wav'
After creating tempWavFile, you pass it to ffmpy.FFmpeg, which will aggregate input and output files and parameters in a single command, to be passed to subprocess. The command-line takes the form of a list, and probably looks something like the following: ["ffmpeg", "-i", "input.wav", "output.wav"].
Finally, ffmpy passes this list to subprocess.Popen and that's where it explodes when you use tempfile.TemporaryFile. This is normal because subprocess does not know anything about your arguments and expects all of them to be strings. When it sees the _io.BufferedRandom object returned by tempfile.TemporaryFile, it doesn't know what to do.
So, to fix it, just use tempfile.mkstemp, which is safer anyway than tempfile.TemporaryFile.
From the Python docs:
tempfile.mkstemp(suffix=None, prefix=None, dir=None, text=False)
Creates a temporary file in the most secure manner possible.
...
Unlike TemporaryFile(), the user of mkstemp() is responsible for deleting the temporary file when done with it.
You originally mentioned mktemp, which is deprecated since Python 2.3 (see docs) and should be replaced by mkstemp.

How to visualize Pyfst transducers via dot files

I am learning how to create transducers with Pyfst and I am trying to visualize the ones I create. The ultimate goal is to be able to write the transducers to dot files and see them in Graphviz.
I took a sample code to see how to visualize the following acceptor.
a = fst.Acceptor()
a.add_arc(0, 1, 'x', 0.1)
a[1].final = -1
a.draw()
When I use draw(), which comes with the package, I get an error:
File "/Users/.../tests.py", line 42, in <module>
a.draw()
File "_fst.pyx", line 816, in fst._fst.StdVectorFst.draw
(fst/_fst.cpp:15487)
File "/Users/.../venv-3.6/lib/python3.6/re.py", line 191, in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: cannot use a string pattern on a bytes-like object
If I try to write the above mentioned acceptor to .dot via this:
def fst_dot(dot_object, filename):
path, file = split(filename)
new_path = join(dot_files_folder_path, path)
if not os.path.exists(new_path):
os.makedirs(new_path)
if hasattr(dot_object, 'dotFormat'):
draw_string = dot_object.dotFormat()
else:
draw_string = dot_object.draw()
open(join(dot_files_folder_path, filename + ".dot"), "w").write(draw_string)
then also I get the following error:
File "/Users/...tests.py", line 43, in <module>
fst_dot(a, 'acceptor')
File "/Users/...tests.py", line 22, in fst_dot
draw_string = dot_object.draw()
File "_fst.pyx", line 816, in fst._fst.StdVectorFst.draw
(fst/_fst.cpp:15487)
File "/Users/.../venv-3.6/lib/python3.6/re.py", line 191, in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: cannot use a string pattern on a bytes-like object
So, both errors look the same - there is some kind of an issue with draw().
On the pyfst site it says that draw is used for dot format representation of the transducer.
I can't understand how to fix the error. Any help would be greatly appreciated.
I am using OSX and PyCharm.
You might try using Python2 to see if that helps at all.
However, I think you'll be better off using the Python bindings that are included with OpenFST 1.5+. That library also has the ability to write to GraphViz .dot files. There is documentation available here:
http://www.openfst.org/twiki/bin/view/FST/PythonExtension
I recommend you fstdraw command from openfst.
after a.write('tmp.fst') in python.
$ fstdraw tmp.fst > tmp.dot in shell.
EDIT:
Finally, I found that UFAL's forked pyfst works fine with python3.
https://github.com/UFAL-DSG/pyfst

a bytes-like object is required, not 'str': typeerror in compressed file

I am finding substring in compressed file using following python script. I am getting "TypeError: a bytes-like object is required, not 'str'". Please any one help me in fixing this.
from re import *
import re
import gzip
import sys
import io
import os
seq={}
with open(sys.argv[1],'r') as fh:
for line1 in fh:
a=line1.split("\t")
seq[a[0]]=a[1]
abcd="AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG"
print(a[0],"\t",seq[a[0]])
count={}
with gzip.open(sys.argv[2]) as gz_file:
with io.BufferedReader(gz_file) as f:
for line in f:
for b in seq:
if abcd in line:
count[b] +=1
for c in count:
print(c,"\t",count[c])
fh.close()
gz_file.close()
f.close()
and input files are
TruSeq2_SE AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG
the second file is compressed text file. The line "if abcd in line:" shows the error.
The "BufferedReader" class gives you bytestrings, not text strings - you can directly compare both objects in Python3 -
Since these strings just use a few ASCII characters and are not actually text, you can work all the way along with byte strings for your code.
So, whenever you "open" a file (not gzip.open), open it in binary mode (i.e.
open(sys.argv[1],'rb') instead of 'r' to open the file)
And also prefix your hardcoded string with a b so that Python uses a binary string inernally: abcd=b"AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG" - this will avoid a similar error on your if abcd in line - though the error message should be different than the one you presented.
Alternativally, use everything as text - this can give you more methods to work with the strings (Python3's byte strigns are somewhat crippled) presentation of data when printing, and should not be much slower - in that case, instead of the changes suggested above, include an extra line to decode the line fetched from your data-file:
with io.BufferedReader(gz_file) as f:
for line in f:
line = line.decode("latin1")
for b in seq:
(Besides the error, your progam logic seens to be a bit faulty, as you don't actually use a variable string in your innermost comparison - just the fixed bcd value - but I suppose you can fix taht once you get rid of the errors)

Resources