Reading from a binary file and decoding using Python - python-3.x

I have a binary file from a mainframe which I'm trying to read using Python and produce a human readable text file. I'm still gathering more information about the file. What I do know is that the file serves as input to COBOL programs.
I try to read the file into python like this:
with open('P_MF.DAT', mode='rb') as f:
file_content = f.read(500)
When I print(file_content) I get something like:
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00########\x00\x00###\xf0\xf0\xf0\xf0\xf0\xf0\xf0\xf0\xf0\xf0\xf0\xf0\xf0\xf0\xf0####\x00\x00\x00\x00\x00\x00\x00\x00###\x00\x00\x00\x00######\xf0\xf0\xf0\xf4\x00\x00\x00\x00\x08\x02\x00\x00Q\x08c\x18\x1f\xc5###\x00\x00\x000\x00\x00\x0f\x00\x00\x00\x01\x11?\x00\x00\x10\x02F\x17o##\xd5#\xc9\xd5\xc7\xd3\xc9\xe2#\xd4\xc1\xd9\xe8#############################\x00\x00\x00\x00\x00\x00\x00\x0c\x00\x00\x00\x00\x00\x00\x00'
Then I tried this using the codecs module which also gives me gibberish:
import codecs
file_content1 = codecs.decode(file_content, 'cp500')
But I can see a few readable characters in the output when I print(file_content1):
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00 000000000000000 \x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x00\x00 0004\x00\x00\x00\x00\x97\x02\x00\x00é\x97Ä\x18\x1fE \x00\x00\x00\x90\x00\x00\x0f\x00\x00\x00\x01\x11\x1a\x00\x00\x10\x02ã\x87? N INGLIS MARY \x00\x00\x00\x00\x00\x00\x00\x0c\x00\x00\x00\x00\x00\x00\x00'
I've been googling around for a couple of days. Tried a number of things like this - Python read a binary file and decode
I feel like I'm getting nowhere with this problem. I also plan to ask how this file looks if read in a mainframe. I'd appreciate any info/help/advice at this point.
​

Related

Julia: Using ProtoBuf to read messages from gzipped file

A sensor provides a stream of frames containing object coordinates, which are stored in ProtoBuf format in a gzipped file. I would like to read this file in Julia.
Using protoc, I have generated the Protobuf files for both Python and Julia, coordinate_push.py and coordinate_push.jl
My Python code is as follows:
frameList = []
with gzip.open(filePath) as f:
data = f.read()
next_pos, pos = 0, 0
while pos < len(data):
msg = coordinate_push.CoordinatesFrame()
next_pos, pos = _DecodeVarint32(data, pos)
msg.ParseFromString(data[pos:pos + next_pos])
frameList.append(msg)
pos += next_pos
I'd like to rewrite the above in Julia, and don't know where to start. Part of the problem is that I haven't fully understood the Python script (IO is not my strong point).
I understand that I need:
to open the gzip file, presumably using using GZip; file = GZip.open(file_path, "r")
to read in the data, along the lines of using ProtoBuf; data = readproto(iob, CoordinatesFrame())
What I don't understand is:
how to define iob, and especially how to link it to file (in the Julia Protobuf manual, we had iob = PipeBuffer(), but here it's a gzip-file that we'd like to read)
how to replicate the while-loop in Julia, and in particular the mysterious _DecodeVarint32 (I'm on Windows, if it's related to that.)
whether the file coordinate_push.jl has to be in the same directory as my main file, and if not, how I can properly import it (it is currently in a proto subfolder, and in Python I'd import it using from src.proto import coordinate_push)
Insight on any of the three points would be highly appreciated.
You should open an issue on the Gzip GitHub repo and ask this first part of your question there (I am not a Gzip expert unfortunately).
On the second point, I suggest looking here: https://github.com/JuliaIO/FileIO.jl/blob/master/README.md for lots of examples of FileIO loops which seems exactly what you need to replicate that Python loop. For the second part of this question, you best bet for that function is to try and hunt down the definition on GitHub or in the docs somewhere.
For the 3rd questions, coordinate_push.jl does not need to be in the same folder as your "main file" (I am not sure what you mean by this so perhaps it would help to add context on the structure of your files). To import that file all you need to do is add include("path/to/coordinate_push.jl") at the top of the file you want to call/run the code from. It's worth noting that the path can either be the absolute path or the relative project path (in some cases).

Read .csv that contains commas

I have a .csv file that contains multiple columns with texts in it. These texts contain commas, which makes things messy when I try to read the file into Python.
When I tried:
import pandas as pd
directory = 'some directory'
dataset = pd.read_csv(directory)
I got the following error:
ParserError: Error tokenizing data. C error: Expected 3 fields in line 42, saw 5
After doing some research, I found the clevercsv package.
So, I ran:
import clevercsv as csv
dataset = csv.read_csv(directory)
Running this, I got the error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 4359705: character maps to <undefined>
To overcome this, I tried:
dataset = csv.read_csv(directory, encoding="utf8")
However, 10 hours later my computer was still working on reading it. So I expect that something went wrong there.
Furthermore, when I open the file in Excel, it does split cells well. Therefore, What I tried was to save the .csv file as a .xlsx and then save it as .csv in Python with an uncommon delimiter ('~'). However, when I save my .csv file as a .xlsx file, the size of the file gets smaller, which indicates that only a part of the file is saved and that is not what I want.
Lastly, I have tried the solutions given here and here. But neither seem to work for my problem.
Given that Excel reads in the file without problems, I do expect that it should be possible to read it into Python as well. Who can help me with this?
UPDATE:
When using dataset = pd.read_csv(directory, sep = ',', error_bad_lines=False)I manage to read in the .csv. But many lines are skipped. Is there a better way for this?
panda should be work https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Dou you tried somthing like dataset = pd.read_csv(directory, sep = ',', header = None)
Regards

Search ’, Â, � etc... How to fix strange encoding characters in python

I tried to retrieve data from Google+ using API. When I wrote data into csv file, I observed weird and strange characters like 😀😄😚😉😠’
After googling, I concluded this is an encoding issue.
To write retrieved data in a file, I used the following code:
file = open('filename, 'a', encoding='utf-8')
writer = csv.writer(file)
writer.writerow(values)
To check my terminal encoding, I used
import sys
sys.getdefaultencoding()
Output is: utf-8
Don't know where is the problem?
Your minimal, reproducible example appears overmuch minimal to be complete and verifiable. In any case, it looks like double mojibake:
value = "‘😀😄😚😉😠’" ### gathered from the question
print(value.encode('cp1252','backslashreplace').decode('utf-8','backslashreplace'))
‘😀😄😚😉😠’

Python 3.1 server-side can't output Unicode string to client

I'm using a free web host but choosing not to work with any Python framework, and am stuck trying to print Chinese characters saved in the source file (using emacs to save file encoded in utf-8) to the resulting HTML page. I thought Unicode "just works" in Python 3.1 so I am baffled. I found three solutions that aren't working. I might just be missing a detail or two.
The host is Alwaysdata, and it has been straightforward to use, so I have little clue about details of how they put together the parts. All I do is upload or edit (with ssh) Python files to a www folder, change permissions, point a browser to the right URL, and it works.
My first attempt, which works on local IDLE (and also the server's Python command line interactive shell, which makes me even more confused why it won't work when it's passed to a browser)
#!/usr/bin/python3.1
mystr = "世界好"
print("Content-Type: text/html\n\n")
print("""<!DOCTYPE html>
<html><head><meta charset="utf-8"></head>
<body>""")
print(mystr)
The error is:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3:
ordinal not in range(128)
Then I tried
print(mystr.encode("utf-8"))
resulting in no error, but the following undesired output to the browser:
b'\xe4\xbd\xa0\xe5\xa5\xbd\xe4\xb8\x96\xe7\x95\x8c'
Third, the following lines were added but got an error:
import sys
sys.setdefaultencoding("utf-8")
AttributeError: 'module' object has no attribute 'setdefaultencoding'
Finally, replacing print with f.write:
import codecs
f = codecs.open(sys.stdout, "w", "utf-8")
mystr = "你好世界"
...
f.write(mystr)
error:
TypeError: invalid file: <_io.TextIOWrapper name='<stdout>'
encoding='ANSI_X3.4-1968'>
How do I get the output to work? Do I need to use a framework for a quick fix?
It sounds like you are using CGI, which is a stupid API as it's using stdout, made for output to humans, to output to your browser. This is the basic source of your problems.
You need to encode it in UTF-8, and then write to sys.stdout.buffer instead of sys.stdout.
And after that, get yourself a webframework. Really, you'll be a lot happier.

Combining multiple audio files in Python (with delay)

I'm looking to combine a range of different audio files (mp3) in Python. One of the requirements is that I need to be able to specify a delay at the end of each file. To illustrate, something like:
[file1.mp3--------3 seconds----------][delay---------2 seconds--------][file2.mp3]-------------4 seconds][delay---------2 seconds][file3.mp3----------3 seconds---------]
Does anyone here know of any mp3 libraries that can accomplish this? Python isn't really a necessity here. If it'll be easier in another language, that'll be fine.
I think FFmpeg can do this, given the right arguments. No real need to use a library.
To combine wav or aiff files, you can do something like this: (inspiration from here)
import aifc
def concatenate(*items):
data = []
for item in items:
f = aifc.open(item, 'rb')
data.append([f.getparams(), f.readframes(f.getnframes())])
f.close()
output = aifc.open('output.aif', 'wb')
output.setparams(data[0][0])
for item in data:
output.writeframes(item[1])
output.close()
See the link for the wav format (it's pretty much the same, but with the wave library)
To add silence, I would just make a one second silent file using your favorite audio editor and then concatenate in the proper amount of silence.

Resources