Python 3, can I tell if a file opened in binary mode contains text? - python-3.x

I am upgrading some code from python 2 to python 3.
There is a function to open and read files. In Python 2 there is no need to specify binary mode or as a string. While in Python 3 I should specify the mode.
The python 2 code is:
with open(f_path, mode=open_mode) as fp:
content = fp.read()
This is causing me problems as it is called by various other functions where I don't necessarily know the file type in advance. (Sometimes the data is written to a zip file, other times the data is returned via an HTTP endpoint).
I expect most data will be binary image files, though CSv and text files will also be present.
What would be the best way of opening a file of unknown type and detecting if it is binary or string data?
Is it possible for example to open a file in binary mode, then detect that it contains text and convert it (or alternatively generate an exception and open it in string mode instead)?

You might try the binaryornot library.
pip install binaryornot
Then in the code:
from binaryornot.check import is_binary
is_binary(f_path)
Here is their documentation:
https://pypi.org/project/binaryornot/

Related

Is 3to2 python backporting valid?

I have a python3 code which I am backporting to python 2. I have used 3to2. there are certain changes which I am not sure of
3to2 is appending u' ' before each string including docstrings,headers, print statements and even while accessing response json from the API.
Eg - requests.get(u'geturl')
Headers[u'Authorization']
Is this valid or they can be simply string?
Regarding the 'open' while opening the files, it has been translated as
from io import open
f = open(filename, u'rb')
Why not use inbuilt open and why the u"" Before the 'rb'

Loading .npz with Python 3.5 always crashes

In this simple tutorial written in Python 2.7, they have a line loading the numpy array.
train_data = np.load(open('../musicnet.npz','rb'))
Then, they get the data by calling different keys
X,Y = train_data['2494']
Everything works well in python 2.7
Data type of train_data is numpy.lib.npyio.NpzFile
My problem
However, whenever I try to do the same in Python 3.5, most of the lines work fine, except when it comes to the line of X,Y = train_data['2494'], it just freezes there forever. I would like to use Python 3.5 because my other projects are written in python 3.5.
How to rewrite this line so that it runs with Python 3.5?
Error Message
I finally managed to get the error message in terminal
It freezes there because there's tons of output right after the error message, my jupyter notebook just cannot handle that much information.
Solution
Change the encoding to 'bytes'
train_data = np.load('../musicnet.npz', encoding='bytes')
Then everything works fine.
You first said things crashed, now you say it freezes when trying to access a specific array. numpy has the same syntax in 3.5 compared to 2.7. You shouldn't have to rewrite anything.
np.load does have a couple of parameters that deal with differences between Py2 and Py3. But I'm not sure these are an issue for you.
fix_imports : bool, optional
Only useful when loading Python 2 generated pickled files on Python 3,
which includes npy/npz files containing object arrays. If `fix_imports`
is True, pickle will try to map the old Python 2 names to the new names
used in Python 3.
encoding : str, optional
What encoding to use when reading Python 2 strings. Only useful when
loading Python 2 generated pickled files in Python 3, which includes
npy/npz files containing object arrays. Values other than 'latin1',
'ASCII', and 'bytes' are not allowed, as they can corrupt numerical
data. Default: 'ASCII'
Try
print(list(train_data.keys()))
This should show the array names that were saved to the zip archive. Do they match the names in the Py2 load? Do they include the '2494' name?
A couple of things are unusual about:
X,Y = train_data['2494']
Naming an array in the zip archive by a string number, and unpacking the load into two variables.
Do you know anything about how this was savez? What was saved?
Another question - are you loading this file from the same machine that Py2 worked on? Or has the file been transferred from another machine, and possibly corrupted?
As those parameters indicate, there are differences in the pickle code between Py2 and Py3. If the original save included object dtype arrays, or non-array objects, then they will be pickled and there might be incompatibilities in the pickle versions.
Try this,
with np.load('../musicnet.npz') as train_data:
X,Y = train_data['2494']
There are 2 ways out in my point of view:
re-edit your code
train_data = np.load(open('../musicnet.npz','rb'))
to
train_data = np.load(open('../musicnet.npz','r'))
Because the mode of r/rb in python2.7 / 3.5 is a difference in your situation.
Using the default debugger to pointing the significant error. (Usually, work on my experience)

Python 2.7 literal_eval() with a UTF-8 string

I'm updating an older application for Python 3, but trying to maintain compatibility if possible with Python 2.7. One of the issues I've encountered deals with inconsistencies in ast.literal_eval() between Python 2 & 3, when handling a UTF-8 string.
Specifically, one of the functions my application performs involves:
Reading a string from a UTF-8 encoded text file that represents a Python list of file names
Converting that UTF-8 string to a Python list via literal_eval()
Using that list to access those files and perform other processing.
My test .txt file has this string:
['FileName1.txt', 'CP1252-1-àlacrème.txt', 'dUTF8-1-木兰辞.txt']
I'm using this brief test script to emulate what the larger application does:
import io
from ast import literal_eval
with io.open('z.txt','r',encoding='utf_8') as inFile:
inStr = inFile.read()
print('Input string is length '+str(len(inStr)))
fileList = literal_eval(inStr)
print(fileList)
Now, when I run this test script on Python 3, I get the following (all OK and as expected) result:
Input string is length 61
['FileName1.txt', 'CP1252-1-àlacrème.txt','dUTF8-1-???.txt']
(The question marks are expected as this is a Windows CMD window; it doesn't handle non-latin-1 characters)
But anyway, when I run the same script with the same file on Python 2.7, I get this result:
Input string is length 61
['FileName1.txt', 'CP1252-1-\xc3\xa0lacr\xc3\xa8me.txt', 'dUTF8-1-\xe6\x9c\xa8\xe5\x85\xb0\xe8\xbe\x9e.txt']
So literal_eval() isn't maintaining the UTF-8 encoding in the resulting list. (Or, I guess, it's trying to maintain the encoding but the best it can do is represent the non-ASCII data as individual byte values.)
My question is: is there any way to make the Python 2 literal_eval() give the same result as the Python 3 version? Or am I stuck with this as a limitation?
As mentioned in comments, ast.literal_eval of the input parses differently between Python 2 and 3. Better to not write Python source as a data file, but use a module like pandas with .csv files:
If the input is a UTF-8 file with content:
FileName1.txt,CP1252-1-àlacrème.txt,dUTF8-1-木兰辞.txt
Then pandas can read it with:
import pandas as pd
data = pd.read_csv('test.txt',encoding='utf8',header=None)
print(data)
Output (Windows terminal Python 3, need appropriate font):
0 1 2
0 FileName1.txt CP1252-1-àlacrème.txt dUTF8-1-木兰辞.txt
Output (Windows IDLE, Python 2 in console needs appropriate code page to view ideographs):
0 1 2
0 FileName1.txt CP1252-1-àlacrème.txt dUTF8-1-木兰辞.txt

How to open a binary file in my case .nii file using node.js

I want to open a binary file, or at least when I try to open this with the vscode editor, is say that, can't be opened, because is a binary file.
Can someone explain to me what I can do in order to open this type of files and read the content?
About the .nii file format. is a NIFTI1 and used on medical visualization like MRI.
What I trying to do is to read this file at the lowest level and then make some computations.
I will like to use Node.js for this, not any Python or C++.
More details about the file format can be found here.
https://nifti.nimh.nih.gov/
I don't know about how VScode handle binary file but for exemple with Atom (or with another text editor like vi), you can open and view the content of a binary file. This is not very usefull however as the content is not particularly human readable, except maybe some metadata at the top of the file.
$ vim yourniifile.nii
Anyway, it's all depends on what you want to do with that file, which "computation" you're planned to apply to it, and how you will use it after that.
Luckily, there are some npm packages that can help you with the task of reading and processing that kind of file, like nifti-reader-js or nifti-js, for exemple:
const fs = require('fs');
const niftijs = require('nifti-js');
let rawData = fs.readFileSync('yourniifile.nii');
let data = niftijs.parse(rawData);
console.log(data);

pcap file viewing library in python 3

I'm looking at trying to read pcap files from various CTF events.
Ideally, I would like something that can do the breakdown of information such as wireshark, but just being able to read the timestamp and return the packet as a bytestring of some kind would be welcome.
The problem is that there is little or no python 3 support with all the commonly cited libraries: dpkt, pylibpcap, pcapy, etc.
Does anyone know of a pcap library that works with python 3?
to my knowledge, there are at least 2 packages that seems to work with Python 3: pure-pcapfile and dpkt:
pure-pcapfile is easy to install in python 3 using pip. It's very easy to use but still limited to decoding Ethernet and IP data. The rest is left to you. But it works right out of the box.
dpkt doesn't work right out of the box and needs some manipulation before. They are porting it to Python 3 and plan to have a Python 2 and 3 compatible version for version 2.0. Unfortunately, it's not there yet. However, it is way more complete than pure-pcapfile and can decode many protocols. If your packet embeds several layers of protocols, it will decode them automatically for you. The only problem is that you need to make a few corrections here and there to make it work (as the time of writing this comment).
pure-pcapfile
the only one that I found working for Python 3 so far is pcapfile. You can find it at https://pypi.python.org/pypi/pypcapfile/ or install it by doing pip3 install pypcapfile.
There are just basic functionalities but it works very well for me and has been updated quite recently (at the time of writing this message):
from pcapfile import savefile
file = open('mypcapfile.pcp' , 'rb')
pcapfile = savefile.load_savefile(file,verbose=True)
If everything goes well, you should see something like this:
[+] attempting to load mypcapfile.pcap
[+] found valid header
[+] loaded 1234 packets
[+] finished loading savefile.
A few remarks now. I'm using Python 3.4.3. And doing import pcapfile will not import anything from it (I'm still a beginner with Python) but the only basic information and functions from the package. Next, you have to explicitly open your file in read binary mode by passing 'rb' as the mode in the open() function. In the documentation they don't say it explicitly.
The rest is like in the documentation:
packet = pcapfile.packets[12]
to access the packet number 12 (the 13th packet then, the first one being at 0). And you have basic functionalities like
packet.timestamp
to get a timestamp or
packet.raw()
to get raw data.
The documentation mentions functions to do packet decoding of some standard formats like Ethernet and IP.
dpkt
dpkt is not available for Python 3 so you need to do the following, assuming you have access to a command line. The code is available on https://github.com/kbandla/dpkt.git and you must download it before:
git clone https://github.com/kbandla/dpkt.git
cd dpkt
git checkout --track origin/migrate_py3
git pull
This 4 commands do the following:
clone (download) the code from its git repository on github
go into the newly created directory named dpkt
switch to the branch name migrate_py3 which contains the Python 3 code. As you can see from the name of this branch, it's still experimental. So far it works for me.
(just in case) download again the code
then copy the directory named dpkt in your project or wherever Python 3 can find it.
Later on, in Python 3 here is what you have to do to get started:
import dpkt
file = open('mypcapfile.pcap','rb')
will open your file. Don't forget the 'rb' binary mode in Python 3 (same thing as in pure-pcapfile).
pcap = dpkt.pcap.Reader(file)
will read and decode your file
for ts, buf in pcap:
eth = dpkt.ethernet.Ethernet(buf)
print(eth)
will, for example, decode Ethernet packet and print them. Then read the documentation on how to use dpkt. If your packets contain IP or TCP layer, then dpkt.ethernet.Ethernet(buf) will decode them as well. Also note that in the for loop, we have access to the timestamps in ts.
You may want to iterate it in a less constrained form and doing as follows will help:
(ts,buf) = next(pcap)
eth = dpkt.ethernet.Ethernet(buf)
where the first line get the next tuple from the pcap file. If pcap is False then you read everything.

Resources